# The Smart Local Dataset Creation
Author: Sebastian Png

## Description of Dataset
- url: url of article
- timedelta: number of days between publication date and web scraping date (17 April 2022)
- title: title of article
- category: main category of article
- subcategory1: first subcategory of article
- subcategory2: second subcategory of article
- subcategory3: third subcategory of article
- preview: preview content of article (before clicking into article)
- content: full content of article (includes image credits)
- n_tokens_title: number of words in title (alphanumerical and including ampersands)
- title_polarity: polarity of title, values in the range of [-1, 1], -1: negative, 1: positive
- title_subjectivity: subjectivity of title, values in the range of [0, 1], 0: objective, 1: subjective
- n_tokens_content: number of words in content of article
- prop_non_stop: proportion of stop words in content of article
- prop_unique_non_stop: proportion of unique stop words in content of article
- content_polarity: polarity of content, values in the range of [-1, 1], -1: negative, 1: positive
- content_subjectivity: subjectivity of content, values in the range of [0, 1], 0: objective, 1: subjective
- reading_duration: number of minutes to read entire article, around 200 words per minute
- author: author of article
- publish_date: publication date of article
- day_of_week: day of publication of article, 0: Monday, 6: Sunday
- month: month of publication of article
- year: year of publication of article
- num_imgs: number of images in article
- num_hrefs: number of hyperlinks in article
- num_self_hrefs: number of hyperlinks in article linked to thesmartlocal.com
- num_tags: number of tags at the end of the article
- num_shares: number of shares of article

List of stop words is the default nltk corpus extended with the following ngrams: "ha", "wa", "cover image adapted from", "cover image credits", "cover image credit", "image credits", "image credit", "image adapted from" and "photography by".

# Import Libraries
The import statements are ordered according to [PEP8 standards](https://peps.python.org/pep-0008/#:~:text=Imports%20are%20always%20put%20at,Related%20third%20party%20imports.)

In [1]:
import datetime
import re
import requests
import string
import time

import contractions
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from textblob import TextBlob

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pngse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Web Scraping

Data of articles from 27 December 2018 to 16 April 2022 are scraped from [The Smart Local's website](https://thesmartlocal.com/) to create our underlying dataset. This block of code retrieves the url, title, category and preview text of 3480 articles. The robots.txt file was checked before scraping to avoid any violations and scraping was done overnight on a weekend to reduce disruptions to the company.

In [2]:
overall_list = []

# Information of 10 articles' previews are retrieved per loop
for i in range(1, 349):
    url = "https://thesmartlocal.com/page/{}/".format(i)
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, "lxml")
    
    for preview in soup.select(".col-lg-6+ .col-lg-6"):
        title = preview.h1.a.string
        category = preview.li.get_text(strip = True)
        article_url = preview.h1.a.get("href")
        article_summary = preview.div.get_text(" ", strip = True)

        temp_list = [article_url, title, category, article_summary]
        overall_list.append(temp_list)
    
    # Pause requests for a second to avoid spamming the website with requests
    time.sleep(1)

smartlocal = pd.DataFrame(overall_list, columns = ["url", "title", "subcategory", "preview"])

smartlocal

Unnamed: 0,url,title,subcategory,preview
0,https://thesmartlocal.com/read/pororo-park-mar...,"Pororo Park: 11,000sqft Indoor Kids’ Playgroun...",Family & Kid-friendly,"Spend a family day out at Pororo Park, where k..."
1,https://thesmartlocal.com/read/best-clubs-sing...,"Guide To Clubbing In Singapore 2022, Including...",Nightlife,Round up your squad and hit these clubs up. Th...
2,https://thesmartlocal.com/read/baju-kurung-onl...,8 Online Stores To Buy Matching Baju Kurung Fo...,Businesses,Hari Raya is right around the corner - get you...
3,https://thesmartlocal.com/read/superga-moving-...,Superga Has Up To 70% Off Storewide In Moving ...,Sales & Promotions,"Happening till 17th April only, Superga's ware..."
4,https://thesmartlocal.com/read/lingoace-upcycl...,Give Your Preloved Children’s Storybooks A New...,Events,Score freebies and good karma points!
...,...,...,...,...
3475,https://thesmartlocal.com/read/top-employer-br...,Companies In Singapore That Have Clinched Awar...,Businesses,Top Employer Brands in Singapore 2018 Nowada...
3476,https://thesmartlocal.com/read/exotic-photosho...,10 Exotic Photoshoot Destinations With Direct ...,"Travel Guides & Tips,Wedding",Unique wedding shoot destinations Images adapt...
3477,https://thesmartlocal.com/read/chiang-rai-thin...,7 Outdoorsy Things To Do In Chiang Rai To Take...,Thailand,Outdoor things to do in Chiang Rai A trip to T...
3478,https://thesmartlocal.com/read/joo-chiat-katon...,9 Colourful Heritage Places In Joo Chiat And K...,Things To Do In Singapore,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...


The following code uses the urls scraped earlier to further scrape more information of each article, such as reading duration, author, publish date, content, number of images, number of links, number of self-directed links to The Smart Local, number of tags and number of shares.

In [3]:
article_info = []

for url in smartlocal.loc[:, "url"]:
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, "lxml")
    
    after_title = soup.select(".after-title")[0]
    reading_duration = after_title.span.string if len(after_title) > 1 else after_title.string
    
    author = soup.select(".justify-content-start")[0].find_all("a")[0].string
    publish_date = soup.select("#meta-date")[0].time.get_text(" ", strip = True)
    content = soup.select("#wtr-content,.post-content")[0].get_text(" ", strip = True) # Article body
    num_imgs = len(soup.select(".size-full")) # Number of images
    
    href_list = soup.select("#wtr-content,.post-content")[0].find_all("a", {"class": ""}) # List of links in articles
    num_hrefs = len(href_list) # Total number of links in article
    num_self_hrefs = sum("thesmartlocal.com" in href.get("href") for href in href_list) # Number of links self-directed to smartlocal
    num_shares = soup.select(".mashsbcount")[0].string
    num_tags = len(soup.select(".post-tags")[0].find_all("a")) # Number of tags at the end of the article

    temp_list = [content, reading_duration, author, publish_date, num_imgs, num_hrefs, 
                 num_self_hrefs, num_tags, num_shares]
    article_info.append(temp_list)
    
    # Pause requests for a second to avoid spamming the website with requests
    time.sleep(1)

cols = ["content", "reading_duration", "author", "publish_date", "num_imgs", "num_hrefs", 
        "num_self_hrefs", "num_tags", "num_shares"]
article_info = pd.DataFrame(article_details, columns = cols)

article_info

Unnamed: 0,content,reading_duration,author,publish_date,num_imgs,num_hrefs,num_self_hrefs,num_tags,num_shares
0,Pororo Park If there’s anything that can make ...,3,Kezia Tan,16 Apr 2022,5,14,6,2,129
1,"Best clubs in Singapore After 1.5 years, the d...",10,Samantha Nguyen,16 Apr 2022,10,24,12,3,228
2,Baju kurung family sets in 2022 Festive occasi...,4,Kezia Tan,16 Apr 2022,8,26,6,2,38
3,Superga moving out sale When it comes to go-to...,2,Kezia Tan,15 Apr 2022,3,12,3,1,244
4,Upcycle Your Story Book with LingoAce If you g...,2,Joycelyn Yeow,15 Apr 2022,7,4,1,3,51
...,...,...,...,...,...,...,...,...,...
3475,Top Employer Brands in Singapore 2018 Nowadays...,7,Faith Joan Chua,02 Jan 2019,17,7,0,0,4
3476,Unique wedding shoot destinations Images adapt...,11,Stella Soon,02 Jan 2019,30,64,1,0,2
3477,Outdoor things to do in Chiang Rai A trip to T...,6,Pailin Boonlong,31 Dec 2018,14,17,0,1,0
3478,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...,8,Isabelle Ong,28 Dec 2018,31,14,0,0,5


The two scraped dataframes are then concatenated into one single dataframe.

In [4]:
smartlocal_overall = pd.concat([smartlocal, article_info], axis = 1)

# Remove share count at the end of content body
smartlocal_overall.loc[:,"content"] = (smartlocal_overall.loc[:,"content"]
                                       .apply(lambda x: re.sub("\s*\d+ SHARES Share Tweet", "", x)))

# Remove commas in num_shares
smartlocal_overall["num_shares"] = smartlocal_overall["num_shares"].apply(lambda x: int(re.sub(",", "", x)))
smartlocal_overall

Unnamed: 0,url,title,subcategory,preview,content,reading_duration,author,publish_date,num_imgs,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/pororo-park-mar...,"Pororo Park: 11,000sqft Indoor Kids’ Playgroun...",Family & Kid-friendly,"Spend a family day out at Pororo Park, where k...",Pororo Park If there’s anything that can make ...,3,Kezia Tan,16 Apr 2022,5,14,6,2,129
1,https://thesmartlocal.com/read/best-clubs-sing...,"Guide To Clubbing In Singapore 2022, Including...",Nightlife,Round up your squad and hit these clubs up. Th...,"Best clubs in Singapore After 1.5 years, the d...",10,Samantha Nguyen,16 Apr 2022,10,24,12,3,228
2,https://thesmartlocal.com/read/baju-kurung-onl...,8 Online Stores To Buy Matching Baju Kurung Fo...,Businesses,Hari Raya is right around the corner - get you...,Baju kurung family sets in 2022 Festive occasi...,4,Kezia Tan,16 Apr 2022,8,26,6,2,38
3,https://thesmartlocal.com/read/superga-moving-...,Superga Has Up To 70% Off Storewide In Moving ...,Sales & Promotions,"Happening till 17th April only, Superga's ware...",Superga moving out sale When it comes to go-to...,2,Kezia Tan,15 Apr 2022,3,12,3,1,244
4,https://thesmartlocal.com/read/lingoace-upcycl...,Give Your Preloved Children’s Storybooks A New...,Events,Score freebies and good karma points!,Upcycle Your Story Book with LingoAce If you g...,2,Joycelyn Yeow,15 Apr 2022,7,4,1,3,51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3475,https://thesmartlocal.com/read/top-employer-br...,Companies In Singapore That Have Clinched Awar...,Businesses,Top Employer Brands in Singapore 2018 Nowada...,Top Employer Brands in Singapore 2018 Nowadays...,7,Faith Joan Chua,02 Jan 2019,17,7,0,0,4
3476,https://thesmartlocal.com/read/exotic-photosho...,10 Exotic Photoshoot Destinations With Direct ...,"Travel Guides & Tips,Wedding",Unique wedding shoot destinations Images adapt...,Unique wedding shoot destinations Images adapt...,11,Stella Soon,02 Jan 2019,30,64,1,0,2
3477,https://thesmartlocal.com/read/chiang-rai-thin...,7 Outdoorsy Things To Do In Chiang Rai To Take...,Thailand,Outdoor things to do in Chiang Rai A trip to T...,Outdoor things to do in Chiang Rai A trip to T...,6,Pailin Boonlong,31 Dec 2018,14,17,0,1,0
3478,https://thesmartlocal.com/read/joo-chiat-katon...,9 Colourful Heritage Places In Joo Chiat And K...,Things To Do In Singapore,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...,8,Isabelle Ong,28 Dec 2018,31,14,0,0,5


# Adding More Features

## Main Categories
There are currently 5 main categories on the website's navigation bar - Travel, Things To Do, Local and Adulting, and Reviews. They will be used to group the subcategories into main categories. As some articles have more than one subcategory, the main category will be based on the first subcategory.

In [5]:
# Main categories
# Travel category
southeast_asia = ["Thailand", "Malaysia", "Philippines", "Indonesia", "Vietnam"]
rest_of_asia = ["China", "Hong Kong", "Japan", "Korea", "Taiwan", "Others"]
travel = ["Southeast Asia", "Rest Of Asia", "Australia", "New Zealand", "Europe", "Africa & Middle East", 
          "America", "Rest of the World", "Travel Guides & Tips"]
travel_overall = southeast_asia + rest_of_asia + travel + ["Travel"]

# Things To Do category
activities = ["Attractions", "Volunteering"]
events = ["Runs"]
food = ["Food Reviews", "Food Guides"]
nightlife = ["Bars & Clubs", "Nightlife Guides"]
hotels_and_staycations = ["Hotel Reviews", "Hotel Guides"]
sales_and_promotions = ["Contests", "Monthly Lobangs"]
things_to_do = ["Activities", "Events", "Food", "Nightlife", "Hotels & Staycations", "Family & Kid-friendly", 
                "Photospots", "Sales & Promotions", "Sports & Fitness", "Beauty & Wellness", "Fashion"]
things_to_do_overall = (activities + events + food + nightlife + hotels_and_staycations + sales_and_promotions + 
                        things_to_do + ["Things To Do", "Things To Do In Singapore", "Singapore"])

# Local category
local = (["Perspectives", "Inspiration", "Culture", "Students", "Businesses", "Hacks", "Heritage", "Misc"] + 
         ["Local", "Singapore Perspectives", "Tutorials & Self-Improvement"])

# Adulting category
parenting = ["Education"]
adulting = ["Finances", "Home", "Dating & Relationships", "Wedding", "Parenting", "Career", 
            "Self Improvement", "Pets", "Tech"]
adulting_overall = parenting + adulting + ["Adulting", "Cryptocurrency"]

# Reviews
reviews = ["Reviews"]

# Create dictionary of subcategory to one of the 5 main categories
subcat_to_travel = {subcat:"Travel" for subcat in travel_overall} 
subcat_to_things_to_do = {subcat:"Things To Do" for subcat in things_to_do_overall}
subcat_to_local = {subcat:"Local" for subcat in local}
subcat_to_adulting = {subcat:"Adulting" for subcat in adulting_overall}
subcat_to_reviews = {subcat:"Reviews" for subcat in reviews}

# Combine all dictionaries into one
subcat_to_main_cat = {**subcat_to_travel, **subcat_to_things_to_do, **subcat_to_local, 
                      **subcat_to_adulting, **subcat_to_reviews}

In [6]:
# Split subcategory column into 3 columns as there are articles with >1 subcategory
subcategory_split = smartlocal_overall["subcategory"].str.split(",", 2, expand = True)
subcategory_split.columns = ["subcategory1", "subcategory2", "subcategory3"]

# Get main category from the overall dictionary
main_category = pd.DataFrame(subcategory_split["subcategory1"].apply(lambda x: subcat_to_main_cat[x]))
main_category.columns = ["category"]

# Drop original subcategory column and add the main and sub categories into the dataframe
smartlocal_overall = pd.concat([smartlocal_overall.loc[:, :"title"], main_category, subcategory_split, 
                                smartlocal_overall.loc[:, "preview":]], axis = 1)

smartlocal_overall.head()

Unnamed: 0,url,title,category,subcategory1,subcategory2,subcategory3,preview,content,reading_duration,author,publish_date,num_imgs,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/pororo-park-mar...,"Pororo Park: 11,000sqft Indoor Kids’ Playgroun...",Things To Do,Family & Kid-friendly,,,"Spend a family day out at Pororo Park, where k...",Pororo Park If there’s anything that can make ...,3,Kezia Tan,16 Apr 2022,5,14,6,2,129
1,https://thesmartlocal.com/read/best-clubs-sing...,"Guide To Clubbing In Singapore 2022, Including...",Things To Do,Nightlife,,,Round up your squad and hit these clubs up. Th...,"Best clubs in Singapore After 1.5 years, the d...",10,Samantha Nguyen,16 Apr 2022,10,24,12,3,228
2,https://thesmartlocal.com/read/baju-kurung-onl...,8 Online Stores To Buy Matching Baju Kurung Fo...,Local,Businesses,,,Hari Raya is right around the corner - get you...,Baju kurung family sets in 2022 Festive occasi...,4,Kezia Tan,16 Apr 2022,8,26,6,2,38
3,https://thesmartlocal.com/read/superga-moving-...,Superga Has Up To 70% Off Storewide In Moving ...,Things To Do,Sales & Promotions,,,"Happening till 17th April only, Superga's ware...",Superga moving out sale When it comes to go-to...,2,Kezia Tan,15 Apr 2022,3,12,3,1,244
4,https://thesmartlocal.com/read/lingoace-upcycl...,Give Your Preloved Children’s Storybooks A New...,Things To Do,Events,,,Score freebies and good karma points!,Upcycle Your Story Book with LingoAce If you g...,2,Joycelyn Yeow,15 Apr 2022,7,4,1,3,51


In [7]:
smartlocal_overall["category"].value_counts()

Things To Do    1769
Local            920
Adulting         580
Travel           207
Reviews            4
Name: category, dtype: int64

# Time Delta
Number of days between the article's publish date and the date of web scraping (17 April 2022)

In [8]:
smartlocal_overall["publish_date"] = pd.to_datetime(smartlocal_overall["publish_date"])
timedelta = (datetime.date(2022, 4, 17) - smartlocal_overall["publish_date"].dt.date).dt.days

# Insert timedelta after url column
insertion_index = smartlocal_overall.columns.get_loc("url") + 1
smartlocal_overall.insert(insertion_index, "timedelta", timedelta)

# Day of the Week, Month, Year
Extracted from publication date

In [9]:
# Get days of the week from publish date. 0: Monday 6: Sunday
days_of_week = smartlocal_overall["publish_date"].apply(lambda x: x.weekday())

month = smartlocal_overall["publish_date"].dt.month
year = smartlocal_overall["publish_date"].dt.year

# Insert day_of_week after publish_date column
insertion_index = smartlocal_overall.columns.get_loc("publish_date") + 1
smartlocal_overall.insert(insertion_index, "day_of_week", days_of_week)

# Insert day_of_week after day_of_week column
insertion_index = smartlocal_overall.columns.get_loc("day_of_week") + 1
smartlocal_overall.insert(insertion_index, "month", month)

# Insert day_of_week after month column
insertion_index = smartlocal_overall.columns.get_loc("month") + 1
smartlocal_overall.insert(insertion_index, "year", year)

# Convert publication date to date format for exporting purposes
smartlocal_overall["publish_date"] = smartlocal_overall["publish_date"].dt.date

# Word Count

## n_tokens_title and n_tokens_content
Count of text in title and content

In [10]:
# Removes punctuations and keeps all alphanumerical text
def remove_punctuations(text):
    return re.sub(r"[^\w\s]", "", text)

# Word count of title - numbers and punctuations are still kept
n_tokens_title = smartlocal_overall["title"].apply(lambda x: len(x.split()))

# Word count of article content - punctuations are removed but numbers are kept
n_tokens_content = smartlocal_overall["content"].apply(lambda x: len(remove_punctuations(x).split()))

# Insert n_tokens_title after content column
insertion_index = smartlocal_overall.columns.get_loc("content") + 1
smartlocal_overall.insert(insertion_index, "n_tokens_title", n_tokens_title)

# Insert n_tokens_title after n_tokens_title column
insertion_index = smartlocal_overall.columns.get_loc("n_tokens_title") + 1
smartlocal_overall.insert(insertion_index, "n_tokens_content", n_tokens_content)

Using n_tokens_content, the row with the unknown reading duration can be estimated.

In [11]:
# Set unknown reading duration to 0
error_index = smartlocal_overall.loc[smartlocal_overall["reading_duration"] == "featured", :].index[0] 
smartlocal_overall.loc[error_index, "reading_duration"] = "0"

# Convert reading durations from string to int
smartlocal_overall["reading_duration"] = (smartlocal_overall["reading_duration"]
                                          .apply(lambda x: re.sub("\D+", "", x)).astype(int))

# Estimate reading duration for the row without value
words_per_min = (n_tokens_content.drop(error_index) / smartlocal_overall["reading_duration"].drop(error_index)).mean()
print("Words per Min: " + str(words_per_min))

# Estimated reading duration is rounded up to the next minute
smartlocal_overall.loc[error_index, "reading_duration"] = int(n_tokens_content[error_index]/words_per_min)

Words per Min: 198.4629743736535


## prop_non_stop and prop_unique_non_stop

Before finding the proportion of non stop words in each article, it is important to find other potential stop words to extend the list.

In [12]:
# Converts a text into the lemmatized form
def lemmatize(text):
    wnl = WordNetLemmatizer()
    return " ".join([wnl.lemmatize(word) for word in text.split()])

# Combine all article content into a single string
total_content = " ".join(smartlocal_overall["content"])
total_content_lemma = lemmatize(remove_punctuations(total_content.lower()))

# Word frequency dictionary
word_to_frequency = FreqDist()

for sentence in nltk.tokenize.sent_tokenize(total_content_lemma):
    for word in nltk.tokenize.word_tokenize(sentence):
        if word not in string.punctuation:
            word_to_frequency[word] += 1

freq_df = pd.DataFrame.from_dict(word_to_frequency.items())
freq_df.columns = ["word", "frequency"]
freq_df.sort_values(by = "frequency", ascending = False, inplace = True)
freq_df.reset_index(drop = True, inplace = True)
freq_df.head()

Unnamed: 0,word,frequency
0,the,164811
1,a,131082
2,to,130462
3,and,109437
4,of,87236


It can be seen that "image" and "credit" are the most frequent words as the articles often include image credits. Hence, the bigram "image credit" can be added into the stop words corpus, along with different variations of it. In addition, "wa" and "ha" will be included in the corpus.

In [13]:
k, v = zip(*word_to_frequency.most_common())

# Default stop words
stop_words = stopwords.words("english")

# Get non stop words with highest frequency across all articles
non_stop_words = [x for x in k if x not in stop_words]

# Top 20 non stop words
non_stop_words[:20]

['image',
 'credit',
 'singapore',
 'like',
 'also',
 'get',
 'one',
 'youre',
 'time',
 'youll',
 'wa',
 'hour',
 'even',
 'day',
 'ha',
 'new',
 'make',
 'home',
 'website',
 'address']

In [14]:
# Returns a tuple of proportion of non stopwords in text and proportion of unique non stopwords in text
def get_non_stop_words_rate(text, stopwords):
    tokens = text.split()
    new_text = text
    for s in stopwords:
        pattern = " " + s + " "
        new_text = re.sub(pattern, " ", new_text)
    non_stop_words_tokens = new_text.split()
    
    # Proportion of non stopwords in text
    rate = len(non_stop_words_tokens)/len(tokens)
    
    # Proportion of unique non stopwords in text
    unique_rate = len(set(non_stop_words_tokens))/len(set(tokens))
    
    return rate, unique_rate

# Extend stop words
stop_words = ["cover image adapted from", "cover image credits", "cover image credit", "image credits", 
              "image credit", "image adapted from", "photography by", "ha", "wa"] + stop_words

# List of default stopwords with punctuations removed, to avoid issues of encoding
stop_words_no_punc = [remove_punctuations(word) for word in stop_words]

# Functions to apply
functions = lambda x: get_non_stop_words_rate(lemmatize(remove_punctuations(x.lower())), stop_words_no_punc)
rates = smartlocal_overall["content"].apply(lambda x: functions(x))
prop_non_stop = rates.str[0]
prop_unique_non_stop = rates.str[1]

# Insert n_tokens_title after n_tokens_content column
insertion_index = smartlocal_overall.columns.get_loc("n_tokens_content") + 1
smartlocal_overall.insert(insertion_index, "prop_non_stop", prop_non_stop)

# Insert n_tokens_title after prop_non_stop column
insertion_index = smartlocal_overall.columns.get_loc("prop_non_stop") + 1
smartlocal_overall.insert(insertion_index, "prop_unique_non_stop", prop_unique_non_stop)

## Sentiment and Subjectivity
Before getting the polarity and subjectivity of the titles and content through the textblob library, preprocessing of text is performed:
- removal of social media tags (e.g. @instagram)
- uncontracting words (you've to you have); limitation: I'd will be uncontracted to I would by default, instead of I had
- removal of punctuations
- removal of stop words
- lemmatization

In [15]:
# Removes social media tags
def remove_tags(text):
    return re.sub("@\w+", "", text.lower())

# Removes stopwords from text
def remove_stopwords(text, stopwords):
    new_text = text
    for s in stopwords:
        pattern = " " + s + " "
        new_text = re.sub(pattern, " ", new_text)
    
    return new_text

def get_polarity(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    
    return polarity

def get_subjectivity(text):
    blob = TextBlob(text)
    subjectivity = blob.sentiment.subjectivity
    
    return subjectivity

# Print statistics of series
def print_stats(s):
    print("Min: {}\nMean: {}\nMedian: {}\nMax: {}\nStd: {}".format(s.min(), s.mean(), s.median(), s.max(), s.std()))

# Functions to process text by removing social media tags, uncontracting words, removing punctuations and stop words
functions = lambda x: lemmatize(remove_stopwords(remove_punctuations(contractions.fix(remove_tags(x))), stop_words))

# Process title and content
title_processed = smartlocal_overall["title"].apply(lambda x: functions(x))
content_processed = smartlocal_overall["content"].apply(lambda x: functions(x))

# Polarity and subjectivity of title
title_polarity = smartlocal_overall["title"].apply(get_polarity)
title_subjectivity = smartlocal_overall["title"].apply(get_subjectivity)

# Polarity and subjectivity of content
content_polarity = content_processed.apply(get_polarity)
content_subjectivity = content_processed.apply(get_subjectivity)

# Insert n_tokens_title after n_tokens_title column
insertion_index = smartlocal_overall.columns.get_loc("n_tokens_title") + 1
smartlocal_overall.insert(insertion_index, "title_polarity", title_polarity)

# Insert n_tokens_title after title_polarity column
insertion_index = smartlocal_overall.columns.get_loc("title_polarity") + 1
smartlocal_overall.insert(insertion_index, "title_subjectivity", title_subjectivity)

# Insert n_tokens_title after prop_unique_non_stop column
insertion_index = smartlocal_overall.columns.get_loc("prop_unique_non_stop") + 1
smartlocal_overall.insert(insertion_index, "content_polarity", content_polarity)

# Insert n_tokens_title after content_polarity column
insertion_index = smartlocal_overall.columns.get_loc("content_polarity") + 1
smartlocal_overall.insert(insertion_index, "content_subjectivity", content_subjectivity)

print("title_polarity")
print_stats(title_polarity)

print("\ntitle_subjectivity")
print_stats(title_subjectivity)

print("\ncontent_polarity")
print_stats(content_polarity)

print("\ncontent_subjectivity")
print_stats(content_subjectivity)

title_polarity
Min: -1.0
Mean: 0.15135398413447892
Median: 0.04999999999999999
Max: 1.0
Std: 0.285848153527945

title_subjectivity
Min: 0.0
Mean: 0.3614090849656949
Median: 0.3833333333333333
Max: 1.0
Std: 0.30747529116939637

content_polarity
Min: -0.12586544948557946
Mean: 0.1546244575659702
Median: 0.15459499674816576
Max: 0.41025641025641035
Std: 0.06684516451228775

content_subjectivity
Min: 0.2137269193391642
Mean: 0.46860674418882714
Median: 0.470415652961371
Max: 0.7278610098415299
Std: 0.0570392352868297


In [16]:
len(smartlocal_overall.columns)

28

In [17]:
smartlocal_overall.isnull().sum()

url                        0
timedelta                  0
title                      0
category                   0
subcategory1               0
subcategory2            3381
subcategory3            3479
preview                    0
content                    0
n_tokens_title             0
title_polarity             0
title_subjectivity         0
n_tokens_content           0
prop_non_stop              0
prop_unique_non_stop       0
content_polarity           0
content_subjectivity       0
reading_duration           0
author                     0
publish_date               0
day_of_week                0
month                      0
year                       0
num_imgs                   0
num_hrefs                  0
num_self_hrefs             0
num_tags                   0
num_shares                 0
dtype: int64

In [18]:
# Export dataframe to xlsx file
smartlocal_overall.to_excel("./datasets/thesmartlocal.xlsx", index = False, encoding = "utf-16")


