# The Smart Local Dataset Creation
Author: Sebastian Png

## Description of Dataset
- **url**: url of article
- **timedelta**: number of days between publication date and web scraping date (2 November 2022)
- **title**: title of article
- **category**: main category of article
- **subcategory1**: first subcategory of article
- **subcategory2**: second subcategory of article
- **subcategory3**: third subcategory of article
- **preview**: preview content of article (before clicking into article)
- **content**: full content of article (includes image credits)
- **n_tokens_title**: number of words in title (alphanumerical and including ampersands)
- **title_polarity**: polarity of title, values in the range of [-1, 1], -1: negative, 1: positive
- **title_subjectivity**: subjectivity of title, values in the range of [0, 1], 0: objective, 1: subjective
- **n_tokens_preview**: number of words in preview (alphanumerical and including ampersands)
- **preview_polarity**: polarity of preview, values in the range of [-1, 1], -1: negative, 1: positive
- **preview_subjectivity**: subjectivity of preview, values in the range of [0, 1], 0: objective, 1: subjective
- **n_tokens_content**: number of words in content of article
- **prop_non_stop**: proportion of stop words in content of article
- **prop_unique_non_stop**: proportion of unique stop words in content of article
- **content_polarity**: polarity of content, values in the range of [-1, 1], -1: negative, 1: positive
- **content_subjectivity**: subjectivity of content, values in the range of [0, 1], 0: objective, 1: subjective
- **reading_duration**: number of minutes to read entire article, around 200 words per minute
- **author**: author of article
- **publish_date**: publication date of article
- **day_of_week**: day of publication of article, 0: Monday, 6: Sunday
- **month**: month of publication of article
- **year**: year of publication of article
- **num_imgs**: number of images in article
- **img_links**: list of urls of images
- **num_hrefs**: number of hyperlinks in article
- **num_self_hrefs**: number of hyperlinks in article linked to thesmartlocal.com
- **num_tags**: number of tags at the end of the article
- **num_shares**: number of shares of article

List of stop words is the default nltk corpus extended with the following ngrams: "ha", "wa", "cover image adapted from", "cover image credits", "cover image credit", "image credits", "image credit", "image adapted from" and "photography by".

# Import Libraries
The import statements are ordered according to [PEP8 standards](https://peps.python.org/pep-0008/#:~:text=Imports%20are%20always%20put%20at,Related%20third%20party%20imports.)

In [1]:
import datetime
import re
import requests
import string
import time

import contractions
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns

from bs4 import BeautifulSoup
from functools import reduce
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from textblob import TextBlob
from tqdm import tqdm
from typing import List, Tuple

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pngse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Utility Functions

In [2]:
def get_non_stop_words_rate(text: str, stopwords: List[str]) -> Tuple[float, float]:
    """
    Returns a tuple of proportion of non stopwords in text and
    proportion of unique non stopwords in text.
    """
    tokens = text.split()
    new_text = text
    for sw in stopwords:
        new_text = re.sub(pattern=f'\s*{sw}\s*', repl=' ', string=new_text)
    non_stop_words_tokens = new_text.split()
    
    # Proportion of non stopwords in text
    rate = len(non_stop_words_tokens)/len(tokens)
    
    # Proportion of unique non stopwords in text
    unique_rate = len(set(non_stop_words_tokens))/len(set(tokens))
    
    return rate, unique_rate

def get_sentiment(text:str) -> Tuple[float, float]:
    """
    Returns a tuple of polarity and sentiment of text, 
    Polarity values are in the range of [-1, 1], 
    -1: negative, 1: positive.
    Sentiment values in the range of [0, 1],
    0: objective, 1: subjective.
    """
    blob = TextBlob(text)
    sentiment = blob.sentiment
    
    return (sentiment.polarity, sentiment.subjectivity)

def lemmatize(text: str) -> str:
    """Converts a text into its lemmatized form."""
    wnl = WordNetLemmatizer()
    return ' '.join([wnl.lemmatize(word) for word in text.split()])

def remove_punctuations(text: str) -> str:
    """Removes punctuations and keeps all alphanumerical text."""
    return re.sub(r'[^\w\s]', '', text)

def remove_tags(text: str) -> str:
    """Removes social media tags."""
    return re.sub('@\w+', '', text.lower())

def remove_stopwords(text: str, stopwords: List[str]) -> str:
    """Removes stopwords from text."""
    new_text = text
    for s in stopwords:
        pattern = ' ' + s + ' '
        new_text = re.sub(pattern, ' ', new_text)
    
    return new_text

# Web Scraping

Data of articles from 24 December 2018 to 2 November 2022 are scraped from [The Smart Local's website](https://thesmartlocal.com/) to create the underlying dataset. This block of code retrieves the url, title, category and preview text of 4080 articles. The robots.txt file was checked before scraping to avoid any violations and scraping was done overnight to reduce disruptions to the company.

In [3]:
overall_list = []

# Information of 10 articles' previews are retrieved per loop
for i in tqdm(range(1, 409)):
    url = 'https://thesmartlocal.com/page/{}/'.format(i)
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')
    
    for preview in soup.select('.col-lg-6+ .col-lg-6'):
        title = preview.h2.a.string
        category = preview.li.get_text(strip=True)
        article_url = preview.h2.a.get('href')
        article_summary = preview.div.get_text(' ', strip=True)

        temp_list = [article_url, title, category, article_summary]
        overall_list.append(temp_list)
    
    # Pause requests for half a second to avoid spamming the website
    # with requests
    time.sleep(0.5)

article_preview = pd.DataFrame(overall_list,
                               columns=['url', 'title', 'subcategory',
                                        'preview'])

article_preview

100%|████████████████████████████████████████████████████████████████████████████████| 408/408 [08:37<00:00,  1.27s/it]


Unnamed: 0,url,title,subcategory,preview
0,https://thesmartlocal.com/read/staytion-marsil...,Staytion Marsiling: Coworking Space In The Nor...,Career,"Hooray for being able to sleep in, plus the ti..."
1,https://thesmartlocal.com/read/things-to-do-es...,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do In Singapore,Do not miss the free entertainment here.
2,https://thesmartlocal.com/read/things-to-do-no...,17 New Things To Do In November 2022 – Bishan ...,Activities,"In the blink of an eye, we're approaching 2023..."
3,https://thesmartlocal.com/read/paypal-welcome-...,You Can Redeem Vouchers For Brands Like foodpa...,Businesses,Vouchers can also be used on Zalora and Agoda.
4,https://thesmartlocal.com/read/things-to-do-ju...,9 Best Things To Do In Jurong For Westies To S...,Things To Do In Singapore,"Hot take: west side, best side."
...,...,...,...,...
4075,https://thesmartlocal.com/read/deals-end-2018/,9 Money-Saving Hacks That Will Expire On 31 De...,Things To Do In Singapore,Deals and hacks before 2019 You know the en...
4076,https://thesmartlocal.com/read/inspiration-sto...,Inspiration Store At Orchard Xchange Teaches Y...,Events,JR East Inspiration Store Here’s a thought –...
4077,https://thesmartlocal.com/read/nascans-sg/,NASCANS Has Ex-MOE Teachers And Coaches Maths ...,Businesses,NASCANS Student Care Centre If you’ve racked...
4078,https://thesmartlocal.com/read/family-spots-no...,6 Hidden Family Spots In The North To Get Your...,Things To Do In Singapore,Family places and activities in Singapore’s No...


The following code uses the urls scraped earlier to further scrape more information of each article, such as reading duration, author, publish date, content, number of images, number of links, number of self-directed links to The Smart Local, number of tags and number of shares.

In [6]:
article_info = []

for url in tqdm(sl.url):
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, "lxml")
    
    after_title = soup.select('.after-title')[0]
    reading_duration = (after_title.span.string if len(after_title) > 1
                        else after_title.string)

    author = soup.select('#meta-author a')[0].string

    # Date format yyyy-mm-dd
    publish_date = soup.select('#meta-date time')[0].get('datetime')

    # Article body
    article_body = soup.select('#wtr-content,.post-content')[0]
    content = article_body.get_text(' ', strip=True)

    # Number of images in article
    img_selector = 'p img, .size-full, .alignnone'
    num_imgs = len(soup.select(img_selector))
    
    # List of links to images in each article
    img_urls = [img.get('src') for img in soup.select(img_selector)]

    # List of links in articles
    href_list = article_body.find_all('a', class_='', href=True)

    # Total number of links in article
    num_hrefs = len(href_list)

    # Number of links self-directed to smartlocal
    num_self_hrefs = len(['smartlocal.com' in l.get('href') for l in href_list])

    # Number of article shares
    num_shares = soup.select('.mashsbcount')[0].string

    # Number of tags at the end of the article
    num_tags = len(soup.select('.post-tags')[0].find_all('a'))

    temp_list = [content, reading_duration, author, publish_date, num_imgs,
                 img_urls, num_hrefs, num_self_hrefs, num_tags, num_shares]

    article_info.append(temp_list)
    
    # Pause requests for half a second to avoid spamming the website
    # with requests
    time.sleep(0.5)

cols = ['content', 'reading_duration', 'author', 'publish_date', 'num_imgs',
        'img_links', 'num_hrefs', 'num_self_hrefs', 'num_tags', 'num_shares']

article_info = pd.DataFrame(article_info, columns=cols)

article_info

100%|████████████████████████████████████████████████████████████████████████████| 4080/4080 [1:50:19<00:00,  1.62s/it]


Unnamed: 0,content,reading_duration,author,publish_date,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,Staytion Marsiling – Coworking space in the No...,4,Renae Cheng,2022-11-02,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,27
1,Things to do at Esplanade So you’ve been to Es...,3,Samantha Nguyen,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,73
2,Things to do in November 2022 Halloween may be...,13,Kezia Tan,2022-11-01,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,6,244
3,PayPal’s Welcome Pack promotion With Black Fri...,3,Aditi Kashyap,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,25
4,"Things to do in Jurong For too long, residents...",10,Raewyn Koh,2022-11-01,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,31
...,...,...,...,...,...,...,...,...,...,...
4075,Deals and hacks before 2019 You know the end-o...,4,Sammi Kor,2018-12-26,14,[https://thesmartlocal.com/wp-content/uploads/...,17,17,0,0
4076,JR East Inspiration Store Here’s a thought – m...,3,Sammi Kor,2018-12-24,10,[https://thesmartlocal.com/wp-content/uploads/...,3,3,0,0
4077,NASCANS Student Care Centre If you’ve racked y...,7,Sammi Kor,2018-12-24,12,[https://thesmartlocal.com/wp-content/uploads/...,6,6,0,0
4078,Family places and activities in Singapore’s No...,5,Renae Cheng,2018-12-24,24,[https://thesmartlocal.com/wp-content/uploads/...,7,7,0,119


The two scraped dataframes are then concatenated into one single dataframe.

In [6]:
smartlocal = pd.concat(objs=[article_preview, article_info], axis=1)

# Remove share count at the end of content body
remove_share = lambda x: re.sub(pattern='\s*\d+ SHARES Share Tweet',
                                repl='',
                                string=x)

smartlocal.content = (smartlocal.content.apply(remove_share))

# Remove commas in num_shares and convert to int type
convert_int = lambda x: int(re.sub(pattern=',', repl='', string=x))
smartlocal.num_shares = smartlocal.num_shares.apply(convert_int)

smartlocal

Unnamed: 0,url,title,subcategory,preview,content,reading_duration,author,publish_date,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/staytion-marsil...,Staytion Marsiling: Coworking Space In The Nor...,Career,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,4,Renae Cheng,2022-11-02,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,0
1,https://thesmartlocal.com/read/things-to-do-es...,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do In Singapore,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,3,Samantha Nguyen,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,20
2,https://thesmartlocal.com/read/things-to-do-no...,17 New Things To Do In November 2022 – Bishan ...,Activities,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,13,Kezia Tan,2022-11-01,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,5,144
3,https://thesmartlocal.com/read/paypal-welcome-...,You Can Redeem Vouchers For Brands Like foodpa...,Businesses,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,3,Aditi Kashyap,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,13
4,https://thesmartlocal.com/read/things-to-do-ju...,9 Best Things To Do In Jurong For Westies To S...,Things To Do In Singapore,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",10,Raewyn Koh,2022-11-01,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,52
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4075,https://thesmartlocal.com/read/deals-end-2018/,9 Money-Saving Hacks That Will Expire On 31 De...,Things To Do In Singapore,Deals and hacks before 2019 You know the en...,Deals and hacks before 2019 You know the end-o...,4,Sammi Kor,2018-12-26,14,[https://thesmartlocal.com/wp-content/uploads/...,17,17,0,0
4076,https://thesmartlocal.com/read/inspiration-sto...,Inspiration Store At Orchard Xchange Teaches Y...,Events,JR East Inspiration Store Here’s a thought –...,JR East Inspiration Store Here’s a thought – m...,3,Sammi Kor,2018-12-24,10,[https://thesmartlocal.com/wp-content/uploads/...,3,3,0,0
4077,https://thesmartlocal.com/read/nascans-sg/,NASCANS Has Ex-MOE Teachers And Coaches Maths ...,Businesses,NASCANS Student Care Centre If you’ve racked...,NASCANS Student Care Centre If you’ve racked y...,7,Sammi Kor,2018-12-24,12,[https://thesmartlocal.com/wp-content/uploads/...,6,6,0,0
4078,https://thesmartlocal.com/read/family-spots-no...,6 Hidden Family Spots In The North To Get Your...,Things To Do In Singapore,Family places and activities in Singapore’s No...,Family places and activities in Singapore’s No...,5,Renae Cheng,2018-12-24,24,[https://thesmartlocal.com/wp-content/uploads/...,7,7,0,119


# Feature Engineering

## Main Categories
There are currently 5 main categories on the website's navigation bar - Travel, Things To Do, Local and Adulting, and Reviews. They will be used to group the subcategories into main categories. As some articles have more than one subcategory, the main category will be based on the first subcategory.

In [9]:
# Main categories - Can be found on the top navigation bar of website
# Travel category
southeast_asia = ['Thailand', 'Malaysia', 'Philippines', 'Indonesia',
                  'Vietnam']
rest_of_asia = ['China', 'Hong Kong', 'Japan', 'Korea', 'Taiwan', 'Others']
travel = ['Southeast Asia', 'Rest Of Asia', 'Australia', 'New Zealand',
          'Europe', 'Africa & Middle East', 'America', 'Rest of the World',
          'Travel Guides & Tips']
travel_cat = southeast_asia + rest_of_asia + travel + ['Travel']

# Things To Do category
activities = ['Attractions', 'Volunteering']
events = ['Runs']
food = ['Food Reviews', 'Food Guides']
nightlife = ['Bars & Clubs', 'Nightlife Guides']
hotels_and_staycations = ['Hotel Reviews', 'Hotel Guides']
sales_and_promotions = ['Contests', 'Monthly Lobangs']
things_to_do = ['Activities', 'Events', 'Food', 'Nightlife',
                'Hotels & Staycations', 'Family & Kid-friendly', 'Photospots',
                'Sales & Promotions', 'Sports & Fitness', 'Beauty & Wellness',
                'Fashion', 'Gaming']
singapore = ['Things To Do In Singapore', 'Singapore']
things_to_do_cat = (activities + events + food + nightlife
                    + hotels_and_staycations + sales_and_promotions
                    + things_to_do + singapore + ['Things To Do'])

# Local category
local = (['Perspectives', 'Inspiration', 'Culture', 'Students', 'Businesses',
          'Hacks', 'Heritage', 'Supernatural & Mystery', 'Humour', 'Misc'] + 
         ['Local', 'Singapore Perspectives', 'Tutorials & Self-Improvement'])

# Adulting category
parenting = ['Education']
adulting = ['Finances', 'Home', 'Dating & Relationships', 'Wedding', 'Parenting',
            'Career', 'Self Improvement', 'Pets', 'Tech', 'Products']
adulting_cat = parenting + adulting + ['Adulting', 'Cryptocurrency']

# Reviews
reviews = ['Reviews']

# Create dictionary of subcategory to one of the 5 main categories
subcat_to_travel = {subcat:'Travel' for subcat in travel_cat} 
subcat_to_things_to_do = {subcat:'Things To Do' for subcat in things_to_do_cat}
subcat_to_local = {subcat:'Local' for subcat in local}
subcat_to_adulting = {subcat:'Adulting' for subcat in adulting_cat}
subcat_to_reviews = {subcat:'Reviews' for subcat in reviews}

# Combine all dictionaries into one
subcat_to_main_cat = {**subcat_to_travel, **subcat_to_things_to_do,
                      **subcat_to_local, **subcat_to_adulting,
                      **subcat_to_reviews}

In [10]:
# Split subcategory column into 3 columns as there are articles with
# >1 subcategory
categories = smartlocal.subcategory.str.split(pat=',', n=2, expand=True)

categories.columns = ['subcategory1', 'subcategory2', 'subcategory3']

# Get main category from the overall dictionary
categories.insert(loc=0,
                  column='category',
                  value=categories.subcategory1.apply(lambda x:
                                                      subcat_to_main_cat[x]))

# Drop original subcategory column and add the main and sub categories
# into the dataframe
smartlocal = pd.concat(objs=[smartlocal.loc[:, :'title'],
                             categories,
                             smartlocal.loc[:, 'preview':]],
                       axis=1)

smartlocal.head()

Unnamed: 0,url,title,category,subcategory1,subcategory2,subcategory3,preview,content,reading_duration,author,publish_date,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/staytion-marsil...,Staytion Marsiling: Coworking Space In The Nor...,Adulting,Career,,,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,4,Renae Cheng,2022-11-02,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,0
1,https://thesmartlocal.com/read/things-to-do-es...,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do,Things To Do In Singapore,,,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,3,Samantha Nguyen,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,20
2,https://thesmartlocal.com/read/things-to-do-no...,17 New Things To Do In November 2022 – Bishan ...,Things To Do,Activities,,,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,13,Kezia Tan,2022-11-01,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,5,144
3,https://thesmartlocal.com/read/paypal-welcome-...,You Can Redeem Vouchers For Brands Like foodpa...,Local,Businesses,,,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,3,Aditi Kashyap,2022-11-01,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,13
4,https://thesmartlocal.com/read/things-to-do-ju...,9 Best Things To Do In Jurong For Westies To S...,Things To Do,Things To Do In Singapore,,,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",10,Raewyn Koh,2022-11-01,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,52


In [11]:
smartlocal.category.value_counts()

Things To Do    2135
Local           1013
Adulting         612
Travel           316
Reviews            4
Name: category, dtype: int64

## Time Delta
Number of days between the article's publish date and the date of web scraping (2 November 2022)

In [12]:
# Convert publish date column to datetime format
smartlocal.publish_date = pd.to_datetime(arg=smartlocal.publish_date,
                                         format='%Y-%m-%d')

timedelta = (datetime.date(2022, 11, 2) -
             smartlocal.publish_date.dt.date).dt.days

# Insert timedelta after url column
insertion_index = smartlocal.columns.get_loc('url') + 1

smartlocal.insert(loc=insertion_index, column='timedelta', value=timedelta)

## Day of the Week, Month, Year
Extracted from publication date

In [13]:
# Get days of the week from publish date. 0: Monday 6: Sunday
days_of_week = smartlocal.publish_date.apply(lambda x: x.weekday())

month = smartlocal.publish_date.dt.month
year = smartlocal.publish_date.dt.year

# Insert day_of_week after publish_date column
insertion_index = smartlocal.columns.get_loc('publish_date') + 1
smartlocal.insert(loc=insertion_index,
                  column='day_of_week',
                  value=days_of_week)

# Insert day_of_week after day_of_week column
insertion_index = smartlocal.columns.get_loc('day_of_week') + 1
smartlocal.insert(loc=insertion_index, column='month', value=month)

# Insert day_of_week after month column
insertion_index = smartlocal.columns.get_loc('month') + 1
smartlocal.insert(loc=insertion_index, column='year', value=year)

# Convert publication date to date format for exporting purposes
smartlocal.publish_date = smartlocal.publish_date.dt.date

## Text Features

### n_tokens
Count of text in title and content

In [14]:
# Word count of preview - numbers and punctuations are still kept
n_tokens_preview = smartlocal.preview.apply(lambda x: len(x.split()))

# Word count of title - numbers and punctuations are still kept
n_tokens_title = smartlocal.title.apply(lambda x: len(x.split()))

# Word count of article content - punctuations are removed but numbers
# are kept
n_tokens_content = smartlocal.content.apply(lambda x: 
                                            len(remove_punctuations(x).split()))

# Insert n_tokens_title after content column
insertion_index = smartlocal.columns.get_loc('content') + 1
smartlocal.insert(loc=insertion_index,
                  column='n_tokens_title',
                  value=n_tokens_title)

# Insert n_tokens_title after n_tokens_title column
insertion_index = smartlocal.columns.get_loc('n_tokens_title') + 1
smartlocal.insert(loc=insertion_index,
                  column='n_tokens_preview',
                  value=n_tokens_preview)

# Insert n_tokens_title after n_tokens_title column
insertion_index = smartlocal.columns.get_loc('n_tokens_preview') + 1
smartlocal.insert(loc=insertion_index,
                  column='n_tokens_content',
                  value=n_tokens_content)

Using n_tokens_content, the row with the unknown reading duration can be estimated.

In [16]:
# Set unknown reading duration to 0
error_index = smartlocal.loc[smartlocal.reading_duration == 'featured',:].index[0] 
smartlocal.loc[error_index, 'reading_duration'] = '0'

# Convert reading durations from string to int
convert_int = lambda x: re.sub(pattern='\D+', repl='', string=x)
smartlocal.reading_duration = (smartlocal.reading_duration
                                         .apply(convert_int).astype(int))

# Estimate reading duration for the row without value
words_per_min = (n_tokens_content.drop(error_index) /
                 smartlocal.reading_duration.drop(error_index)).mean()
print(f'Words per Min: {str(words_per_min)}')

# Estimated reading duration is rounded up to the next minute
smartlocal.loc[error_index, 'reading_duration'] = int(n_tokens_content[error_index]
                                                      /words_per_min)

Words per Min: 202.98300280597664


### prop_non_stop and prop_unique_non_stop

Before finding the proportion of non stop words in each article, it is important to find other potential stop words to extend the list.

In [19]:
# Combine all article content into a single string
total_content = ' '.join(smartlocal['content'])
total_content_lemma = lemmatize(remove_punctuations(total_content.lower()))

# Word frequency dictionary
word_to_frequency = FreqDist()

for sentence in nltk.tokenize.sent_tokenize(total_content_lemma):
    for word in nltk.tokenize.word_tokenize(sentence):
        if word not in string.punctuation:
            word_to_frequency[word] += 1

freq_df = pd.DataFrame.from_dict(word_to_frequency.items())
freq_df.columns = ['word', 'frequency']
freq_df.sort_values(by='frequency', ascending=False, inplace=True)
freq_df.reset_index(drop=True, inplace=True)
freq_df.head()

Unnamed: 0,word,frequency
0,the,200209
1,a,154033
2,to,152387
3,and,127425
4,of,100916


It can be seen that "image" and "credit" are the most frequent words as the articles often include image credits. Hence, the bigram "image credit" can be added into the stop words corpus, along with different variations of it. In addition, "wa" and "ha" will be included in the corpus.

In [20]:
k, v = zip(*word_to_frequency.most_common())

# Default stop words
stop_words = stopwords.words('english')

# Get non stop words with highest frequency across all articles
non_stop_words = [x for x in k if x not in stop_words]

# Top 20 non stop words
non_stop_words[:20]

['image',
 'credit',
 'singapore',
 'like',
 'get',
 'also',
 'one',
 'youre',
 'youll',
 'time',
 'wa',
 'hour',
 'even',
 'name',
 'u',
 'ha',
 'day',
 'new',
 'email',
 'make']

In [21]:
# Extend stop words
stop_words = ['cover image adapted from', 'cover image credits',
              'cover image credit', 'image credits', 'image credit',
              'image adapted from', 'photography by', 'ha', 'wa'] + stop_words

# List of default stopwords with punctuations removed, to avoid issues
# of encoding
stop_words_no_punc = [remove_punctuations(word) for word in stop_words]

# Functions to apply
functions = [str.lower, remove_punctuations, lemmatize,
             lambda x: get_non_stop_words_rate(text=x,
                                               stopwords=stop_words_no_punc)]
mass_apply = lambda x: reduce(lambda y, f: f(y), functions, x)

rates = smartlocal['content'].apply(mass_apply)
prop_non_stop = rates.str[0]
prop_unique_non_stop = rates.str[1]

# Insert n_tokens_title after n_tokens_content column
insertion_index = smartlocal.columns.get_loc('n_tokens_content') + 1
smartlocal.insert(loc=insertion_index,
                  column='prop_non_stop',
                  value=prop_non_stop)

# Insert n_tokens_title after prop_non_stop column
insertion_index = smartlocal.columns.get_loc('prop_non_stop') + 1
smartlocal.insert(loc=insertion_index,
                  column='prop_unique_non_stop',
                  value=prop_unique_non_stop)

### Sentiment and Subjectivity
Before getting the polarity and subjectivity of the titles and content through the textblob library, preprocessing of text is performed:
- removal of social media tags (e.g. @instagram)
- uncontracting words (you've to you have); limitation: I'd will be uncontracted to I would by default, instead of I had
- removal of punctuations
- removal of stop words
- lemmatization

In [23]:
# Functions to process text by removing social media tags, uncontracting
# words, removing punctuations and stop words
functions = [remove_tags, contractions.fix, remove_punctuations,
             lambda z: remove_stopwords(text=z, stopwords=stop_words),
             lemmatize, get_sentiment]
mass_apply = lambda x: reduce(lambda y, f: f(y), functions, x)

# Process title and content
title_processed = smartlocal['title'].apply(mass_apply)
preview_processed = smartlocal['preview'].apply(mass_apply)
content_processed = smartlocal['content'].apply(mass_apply)

title_df = pd.DataFrame(data=title_processed.tolist(),
                          columns=['title_polarity', 'title_subjectivity'])
preview_df = pd.DataFrame(data=preview_processed.tolist(),
                          columns=['preview_polarity', 'preview_subjectivity'])
content_df = pd.DataFrame(data=content_processed.tolist(),
                          columns=['content_polarity', 'content_subjectivity'])

smartlocal_new = pd.concat(objs=[smartlocal.loc[:,:'n_tokens_title'],
                                 title_df,
                                 smartlocal.loc[:,'n_tokens_preview'],
                                 preview_df,
                                 smartlocal.loc[:,'n_tokens_content'],
                                 content_df,
                                 smartlocal.loc[:,'prop_non_stop':]],
                           axis=1)

In [38]:
len(smartlocal_new.columns)

32

In [40]:
smartlocal_new.isnull().sum()

url                        0
timedelta                  0
title                      0
category                   0
subcategory1               0
subcategory2            3967
subcategory3            4079
preview                    0
content                    0
n_tokens_title             0
title_polarity             0
title_subjectivity         0
n_tokens_preview           0
preview_polarity           0
preview_subjectivity       0
n_tokens_content           0
content_polarity           0
content_subjectivity       0
prop_non_stop              0
prop_unique_non_stop       0
reading_duration           0
author                     0
publish_date               0
day_of_week                0
month                      0
year                       0
num_imgs                   0
img_links                  0
num_hrefs                  0
num_self_hrefs             0
num_tags                   0
num_shares                 0
dtype: int64

# Export DataFrame

In [42]:
# Export dataframe to xlsx file
smartlocal_new.to_excel('./datasets/thesmartlocal.xlsx',
                        index=False,
                        encoding="utf-16")

# Export dataframe to parquet file
smartlocal_new.to_parquet('./datasets/thesmartlocal.parquet')