# Project 3: Web APIs and Classification: Data cleaning and EDA

## Problem Statement

To assist one of our colleagues, who is a moderator of the entrepreneur subreddit, with filtering out the investing posts in the entrepreneur subreddit.
The entrepreneur subreddit is mostly filled with irrelevant posts from the investing subreddit, which causes annoyance to the business owners who share their business ideas. By constructing a classifier model, we can use it to seperate business ideas and business investment strategies for new business owners who desire to start their side hustle. The two classifier models that will be constructed will be the **Logistics Regression Classifier and the Multi-nominal Bayes Model**. The performance metric to be used for measuring against the models will be **Accuracy** as the model needs to classify the post according to their respective subreddit.

## Executive Summary

The project aims to create a classification model to distinguish posts from the entrepreneur subreddit and the investing subreddit. The posts and title were scrapped from their respective subreddits using reddit's API and the requests library. The posts were "cleaned" to remove unwanted characters and a brief exploratory data analysis was done on the dataset. 

From the analysis, Term Frequency Inverse Document Frequency (TFIDF) Vectorizer was found to perform slightly better than the Count Vectorizer. Also, Multi-nominal Naive Bayes Model performed slightly better than the Logistics Regression model even though their performance was comparable. Both models also have a higher accuracy score compared to Baseline accuracy.


### Contents:
- [Data Import & Cleaning for Entrepreneur subreddit](#Obtaining-the-data-from-entrepreneur-sub-reddit)
- [Data Import & Cleaning for Investing subreddit](#Obtaining-the-data-from-investing-sub-reddit)
- [Exploratory Data Analysis for Entrepreneur Subreddit](#EDA-on-Entrepreneurship-sub)
- [Exploratory Data Analysis for Investing Subreddit](#EDA-on-Investing-subreddit)
- [Merge the dataframes](#Merge-both-subreddit-dataframes)

In [1]:
#Imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import re
from bs4 import BeautifulSoup as bs
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import random
import time

%matplotlib inline

In [2]:
# Set the graph style
plt.style.use('ggplot')

# Data importing and Cleaning

## Obtaining the data from entrepreneur sub reddit

In [3]:
def get_reddit_post(url, no_loop):
    """
    This function obtains the posts from the chosen subreddit from the url variable.
    no_loop is the number of loops 
    name is the name of the csv file to save
    """
    posts = []
    after = None
    new_df = 'No dataframe as it\'s one loop'
    name = re.search('r\/(.+).json',url).group(1) # search for the name of the sub reddit

    for loop in range(no_loop): # Number of loops
        if after == None: # If there's no next post
            current_url = url # Make use of the current url
        else:
            current_url = url + '?after=' + after # Current url becomes the next post
        print(f'Current url is: {current_url}') # Prints the current url
        res = requests.get(current_url, headers={'User-agent': 'Entre 2.0'}) # Create the request, USER AGENT can be changed

        if res.status_code != 200:
            print('Status error', res.status_code) # If error, then break
            break

        current_dict = res.json() # Parse into JSON
        current_posts = [p['data'] for p in current_dict['data']['children']] # Gets the current posts
        posts.extend(current_posts) # Store it in a list named posts
        after = current_dict['data']['after'] # Get the next url

        if loop > 0: # Saving the progress
            prev_posts = pd.read_csv('./datasets/' + name + '.csv') # Save the posts
            current_df = pd.DataFrame(current_posts) # current posts in a new dataframe
            new_df = pd.concat([prev_posts, current_df]) # Once it breaks out of for loop, new_df is gone.
            new_df.to_csv('./datasets/' + name + '.csv', index=False)
        else:
            pd.DataFrame(posts).to_csv('./datasets/' + name + '.csv', index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(f'Sleep for {sleep_duration} seconds') # Sleep duration in seconds
        time.sleep(sleep_duration)
        
    return new_df

In [None]:
# Scrap the entrepreneur subreddit posts, about 1,000 posts scrapped
get_reddit_post('https://www.reddit.com/r/entrepreneur.json',40)

In [4]:
# Read the entrepreneur csv file
entre_df = pd.read_csv('./datasets/entrepreneur.csv')

In [5]:
entre_df[['title','selftext']].loc[0] # Gets the title and text of the first post

title               Thank you Thursday! - (November 26, 2020)
selftext    Your opportunity to thank the /r/Entrepreneur ...
Name: 0, dtype: object

### Clean the data in entrepreneur sub reddit

In [6]:
# Drop the first row as it's not relevant, it's ask questions monday
entre_df.drop(labels=0, inplace=True)

In [7]:
entre_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,author_cakeday
1,,Entrepreneur,Happy Thanksgiving everyone!\n\nToday probably...,t2_10hwkz,False,,0,False,"Being driven individuals, we are all rushing t...",[],...,all_ads,False,https://www.reddit.com/r/Entrepreneur/comments...,846881,1606411000.0,1,,False,,
2,,Entrepreneur,"Hi, recently we had a client who was strugglin...",t2_5mi25ldv,False,,0,False,How you can reduce bounce rate from your webpage?,[],...,all_ads,False,https://www.reddit.com/r/Entrepreneur/comments...,846881,1606449000.0,0,,False,,
3,,Entrepreneur,Link to video: https://youtu.be/j6QPZp--lJE\n ...,t2_17jopcwt,False,,0,False,"I made an animated summary of ""The lean Start ...",[],...,all_ads,False,https://www.reddit.com/r/Entrepreneur/comments...,846881,1606453000.0,0,,False,,
4,,Entrepreneur,Hi all I'm a 16 year old from Adelaide Austral...,t2_1bftus7y,False,,0,False,Skate ramp business,"[{'e': 'text', 't': 'Recommendations?'}]",...,all_ads,False,https://www.reddit.com/r/Entrepreneur/comments...,846881,1606390000.0,0,,False,b5eccc92-6452-11e6-93ad-0ecc2c508ed9,
5,,Entrepreneur,Hey entrepreneurs - I recently came up with an...,t2_rxefv,False,,0,False,Help with getting a textile prototype created.,[],...,all_ads,False,https://www.reddit.com/r/Entrepreneur/comments...,846881,1606457000.0,0,,False,,


In [8]:
# Title and Posts
entre_df[['title', 'selftext']]

Unnamed: 0,title,selftext
1,"Being driven individuals, we are all rushing t...",Happy Thanksgiving everyone!\n\nToday probably...
2,How you can reduce bounce rate from your webpage?,"Hi, recently we had a client who was strugglin..."
3,"I made an animated summary of ""The lean Start ...",Link to video: https://youtu.be/j6QPZp--lJE\n ...
4,Skate ramp business,Hi all I'm a 16 year old from Adelaide Austral...
5,Help with getting a textile prototype created.,Hey entrepreneurs - I recently came up with an...
...,...,...
993,Me again with another side hustle vid for feed...,"G'day guys, \nBack again with the second part..."
994,"I made an animated summary of ""The Magic of Th...",Link to video: https://youtu.be/wdQRQ82AED8\n ...
995,Tik tok marketing?,Have been seeing a lot lately from startups tr...
996,How to make passive income,This is the best way to earn money from HOme\n...


In [9]:
# Merge the title and post together using pd.concat
entre_df_merged = pd.concat([entre_df['title'], entre_df['selftext']], axis=0)
entre_df_merged = pd.DataFrame(entre_df_merged, columns=['text'])

In [10]:
# Add the labels to the dataframe, filled with ones
entre_df_merged['label'] = np.ones(len(entre_df_merged))
entre_df_merged

Unnamed: 0,text,label
1,"Being driven individuals, we are all rushing t...",1.0
2,How you can reduce bounce rate from your webpage?,1.0
3,"I made an animated summary of ""The lean Start ...",1.0
4,Skate ramp business,1.0
5,Help with getting a textile prototype created.,1.0
...,...,...
993,"G'day guys, \nBack again with the second part...",1.0
994,Link to video: https://youtu.be/wdQRQ82AED8\n ...,1.0
995,Have been seeing a lot lately from startups tr...,1.0
996,This is the best way to earn money from HOme\n...,1.0


In [11]:
# Full function to clean the title and the post
def clean_post(df):
    """
    This function removes the unnecessary characters, punctuations, removes stop words and lemmantizes the words
    from the posts and titles. Lemmantization is used as I want to preserve the meaning of the words in which it'll compare the words against a dictionary.
    """
    new_lst = []
    
    # Stop words
    stops = set(stopwords.words('english'))
    
    # Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    for post in df:
        # Lowercase the text
        post = post.lower()

        # Find the https websites and removes them
        post = re.sub('\(https:.*?\)','',post)

        # Removes youtube links
        post = re.sub('https:.*?\\n','',post)

        # Removes uncaptured url links at the bottom of the text
        post = re.sub('https.*?[\\n|"]','',post)

        # Removes characters: \n\n&amp;#x200B;
        post = re.sub('\\n\\n&amp;#x200b;\\n\\n','',post)

        # Removing the special characters, like punctuation marks, periods
        # post = re.sub(r'[^\w]',' ',post)
        
        # Removes digits and keeps the letters
        post = re.sub(r'[^a-zA-Z]', ' ', post)

        # Removes underscores
        post = re.sub(' _', ' ',post)

        # Removes addtional white spaces
        post = re.sub(' +', ' ',post)
        
        # Removes words that have entrepreneur variations
        post = post.replace('entrepreneur','').replace('Entrepreneur','')
        
        # Removes words that have invest variations
        post = post.replace('invest','').replace('investor','')
        
        # Stores the words in a list 
        lst = [] 
        
        # If the word is not in the stop words then, lemmantize the words
        for word in post.split():
            if not word in stops:
                lst.append(lemmatizer.lemmatize(word))
            
        new_lst.append(" ".join(lst))
        
    return new_lst

In [12]:
entre_df_merged['text'] = clean_post(entre_df_merged['text'])
entre_df_merged

Unnamed: 0,text,label
1,driven individual rushing towards dream ever s...,1.0
2,reduce bounce rate webpage,1.0
3,made animated summary lean start eric ries hop...,1.0
4,skate ramp business,1.0
5,help getting textile prototype created,1.0
...,...,...
993,g day guy back second part disjoined series th...,1.0
994,link video release new video often interested ...,1.0
995,seeing lot lately startup trying figure tik to...,1.0
996,best way earn money home check video http www ...,1.0


In [13]:
# Drop duplicate posts
entre_df_merged = entre_df_merged.drop_duplicates(subset=['text'])
entre_df_merged

Unnamed: 0,text,label
1,driven individual rushing towards dream ever s...,1.0
2,reduce bounce rate webpage,1.0
3,made animated summary lean start eric ries hop...,1.0
4,skate ramp business,1.0
5,help getting textile prototype created,1.0
...,...,...
592,like start product photography service stick l...,1.0
593,im reading around sub around reddit building b...,1.0
594,fluffy content goal build arr business right p...,1.0
595,hi merchant holiday season coming real soon be...,1.0


In [14]:
# Checking for null values
entre_df_merged.isnull().sum()

text     0
label    0
dtype: int64

In [15]:
# Checking for duplicates
entre_df_merged.duplicated().sum()

0

### Save to cleaned_entre_df to csv file

In [16]:
entre_df_merged.to_csv('./datasets/cleaned_entre_df.csv', index=False)

# Obtaining the data from investing sub reddit

In [None]:
get_reddit_post("https://www.reddit.com/r/investing.json",40)

In [None]:
invest_df = pd.read_csv('./datasets/investing.csv')
invest_df

### Clean the data in investing sub reddit

In [None]:
# Check for duplicates
invest_df.duplicated(subset='selftext').sum()

In [None]:
# Drop duplicates
invest_df = invest_df.drop_duplicates(subset=['selftext'])
invest_df

In [None]:
# Drop the first row as it's not relevant, it's ask questions monday
invest_df.drop(labels=[0,1], inplace=True)

In [None]:
# Merge the title and post together using pd.concat
invest_df_merged = pd.concat([invest_df['title'], invest_df['selftext']], axis=0)
invest_df_merged = pd.DataFrame(invest_df_merged, columns=['text'])

In [None]:
# Add the labels to the dataframe
invest_df_merged['label'] = np.zeros(len(invest_df_merged))
invest_df_merged

In [None]:
# Clean the post with clean_post function
invest_df_merged.text = clean_post(invest_df_merged.text)
invest_df_merged

In [None]:
# Checking for null values
invest_df_merged.isnull().sum()

In [None]:
# Checking for duplicates
invest_df_merged.duplicated().sum()

### Save to cleaned_invest_df to csv file

In [None]:
invest_df_merged.to_csv('./datasets/cleaned_invest_df.csv', index=False)

# Exploratory Data Analysis

## EDA on Entrepreneurship sub 

### Distribution of frequency of the words

In [None]:
list_of_words = " ".join(entre_df_merged.text).split()
list_of_words

In [None]:
# Create a dictionary for the frequency of words
word_dict = {}
for word in list_of_words:
    word_dict[word] = list_of_words.count(word)

In [None]:
# Most frequently occuring word
max(word_dict, key=word_dict.get)

In [None]:
# Sort the words frequency in descending order
sort_words_freq = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)

for i in sort_words_freq[:10]:
    print(i[0], i[1])

In [None]:
# Adding it into a dataframe
entre_freq_df = pd.DataFrame(sort_words_freq, columns=['word','frequency'])
entre_freq_df

In [None]:
plt.figure(figsize=(11,7))
sns.barplot(x='word', y='frequency', data=entre_freq_df[:10], palette='coolwarm')
plt.title('Entrepreneur subreddit: Top 10 most frequently occuring words');

Given that entrepreneur sub reddit is where business minded people share their business ideas, it's not surprising to see the word 'business' coming in top with a count of 800, followed by the word 'people'. This shows that businesses primarily involves social interactions for them to function. There's seems to be a large drop after the word 'business', with the other words having a range of 400.

### Distribution of length of text for titles and posts

#### Title

In [None]:
# Reset the index
entre_df_merged = entre_df_merged.reset_index(drop=True)
entre_df_merged

In [None]:
# Slicing the titles
entre_df_merged_title = pd.DataFrame(entre_df_merged.text.loc[:596])
entre_df_merged_title

In [None]:
# Split the words into a list and then count the number of words
entre_df_merged_title['length'] = entre_df_merged_title.text.apply(lambda x:len(x.split()))

In [None]:
entre_df_merged_title.head()

In [None]:
plt.figure(figsize=(11,7))
sns.histplot(data=entre_df_merged_title.length, bins=500)

plt.title('Distribution of length of title for Entrepreneur subreddit')
plt.xlabel('Length of title')
plt.xlim(0,50)

The distribution shows a heavy right-skewed graph, with most of the titles having a length of 5 and most of the titles fall in the range between 2 to 9.

#### Posts

In [None]:
# Slicing the posts
entre_df_merged_post = pd.DataFrame(entre_df_merged.text.loc[597:])
entre_df_merged_post

In [None]:
# Split the words into a list and then count the number of words
entre_df_merged_post['length'] = entre_df_merged_post.text.apply(lambda x:len(x.split()))

In [None]:
entre_df_merged_post.head()

In [None]:
plt.figure(figsize=(11,7))
sns.histplot(data=entre_df_merged_post.length, bins=500)

plt.title('Distribution of length of post for Entrepreneur subreddit')
plt.xlabel('Length of post')
plt.xlim(0,250);

The distribution for the length of the posts for the Entrepreneur subreddit shows a right-tailed skewed graph. Most of the posts have an average of 25 words in their posts and the range is between 25 to 50.

### Timing of the posts

In [None]:
# Find the columns with time zone
entre_df.columns[entre_df.columns.str.contains('utc|time|created')]

In [None]:
# Drop the duplicates
entre_df_time = entre_df[['title','selftext','created_utc']].drop_duplicates(subset=['selftext']).drop(['title','selftext'], axis=1)
entre_df_time

In [None]:
# Converting to datatime
entre_df_time = pd.to_datetime(entre_df_time['created_utc'], unit='s')
entre_df_time

In [None]:
# Obtaining the hours
entre_df_time.dt.hour.value_counts().plot(kind='bar', title='Hour of the day posted on entrepreneur subreddit', figsize=(11,7))
plt.xlabel('Hour')
plt.ylabel('Count');

I can see that the most popular time to post is 19:00, 15:00, 21:00 18:00. It seems like entrepreneurs like to post in the later afternoon and evening. I'll group up the hourly timings to get a overall picture of the posting time.

In [None]:
# Group by the posters' hours, 0-6, 6-12, 12-18, 18-23.
entre_hour = entre_df_time.dt.hour
entre_hour_grpby = pd.cut(entre_hour, bins=[-1,6,12,18,23], labels=['Midnight','Morning','Afternoon','Evening'])
entre_hour_grpby

In [None]:
# Plotting the graph of the time of the post
entre_hour_grpby.value_counts().plot(kind='bar', title='Time of posting', figsize=(11,7))

plt.xlabel('Time of the day')
plt.ylabel('Count')
plt.xticks(fontsize=11)
plt.yticks(fontsize=11);

It seems most users post during the afternoon and evening and they're least likely to post in the morning.

## EDA on Investing subreddit

### Distribution of frequency of the words

In [None]:
list_of_words = " ".join(invest_df_merged.text).split()
list_of_words

In [None]:
# Create a dictionary for the frequency of words
word_dict = {}
for word in list_of_words:
    word_dict[word] = list_of_words.count(word)

In [None]:
# Most frequently occuring word
max(word_dict, key=word_dict.get)

In [None]:
# Sort the words frequency
sort_words_freq = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)

for i in sort_words_freq[:10]:
    print(i[0], i[1])

In [None]:
# Adding the frequency of words dictionary into a dataframe
invest_freq_df = pd.DataFrame(sort_words_freq, columns=['word','frequency'])
invest_freq_df

In [None]:
plt.figure(figsize=(17,10))
sns.barplot(x='word', y='frequency', data=invest_freq_df[:10], palette='coolwarm')
plt.title('Investing subreddit: Top 20 most frequently occuring words')
plt.xticks(fontsize=11);

The most frequently occuring word for the investing subreddit is 'stock'. Given that this is a subreddit where users share their investing ideas, it's not surprising to see 'stock' coming at the top. The next top word is 'company', in which users most likely ask what type of companies to invest. 'Market' and 'stock' goes together since users are sharing their trading ideas.

The 'u' could most likely mean the United States stocks market as it's most likely lemmantized. It's seems the US stocks market is popular on this subreddit.

In [None]:
# Top occuring words for the entrepreneur subreddit
entre_freq_df['word'].head(10)

In [None]:
# Top occuring words for the investing subreddit
invest_freq_df['word'].head(10)

Given that the top occuring words from their respective subreddits are unique, I can safely assume that the model will not have issues distinguishing the posts.

### Distribution of length of text for titles and posts

#### Title

In [None]:
# Reset the index
invest_df_merged = invest_df_merged.reset_index(drop=True)
invest_df_merged

In [None]:
# Slicing the titles
invest_df_merged_title = pd.DataFrame(invest_df_merged.text.loc[:500])
invest_df_merged_title

In [None]:
# Split the words into a list and then count the number of words
invest_df_merged_title['length'] = invest_df_merged_title.text.apply(lambda x:len(x.split()))

In [None]:
invest_df_merged_title.head()

In [None]:
plt.figure(figsize=(11,7))
sns.histplot(data=invest_df_merged_title.length, bins=500)

plt.title('Distribution of length of title for Investing subreddit')
plt.xlabel('Length of title')
plt.xlim(0,30);

The distribution shows a heavy right-skewed graph, with most of the titles having a length of 5 and most of the titles falling in between the range of 2 to 10.

#### Posts

In [None]:
# Slicing the posts
invest_df_merged_post = pd.DataFrame(invest_df_merged.text.loc[500:])
invest_df_merged_post

In [None]:
# Split the words into a list and then count the number of words
invest_df_merged_post['length'] = invest_df_merged_post.text.apply(lambda x:len(x.split()))

In [None]:
invest_df_merged_post.head()

In [None]:
plt.figure(figsize=(11,7))
sns.histplot(data=invest_df_merged_post.length, bins=500)

plt.title('Distribution of length of post for investing subreddit')
plt.xlabel('Length of post')
plt.xlim(0,250);

The distribution for the length of the posts for the investing subreddit shows a right-tailed skewed graph. Most of the posts have an average of 25 words in their posts and most of the posts fall in the range between 25 to 60 words.

### Timing of the posts

In [None]:
# Find the columns with time zone
invest_df.columns[invest_df.columns.str.contains('utc|time|created')]

In [None]:
# Drop the duplicates
invest_df_time = invest_df[['title','selftext','created_utc']].drop_duplicates(subset=['selftext']).drop(['title','selftext'], axis=1)
invest_df_time

In [None]:
# Converting to datatime
invest_df_time = pd.to_datetime(invest_df_time['created_utc'], unit='s')
invest_df_time

In [None]:
# Obtaining the hours
invest_df_time.dt.hour.value_counts().plot(kind='bar', title='Hour of the day posted on investing subreddit', figsize=(11,7));

I can see that the most popular time to post is 18:00, 20:00, 17:00, 19:00. It seems like these traders like to post in the evening. I'll group up the hourly timings to get a overall picture of the posting time.

In [None]:
# Group by the posters' hours, 0-6, 6-12, 12-18, 18-23.
invest_hour = invest_df_time.dt.hour
invest_hour_grpby = pd.cut(invest_hour, bins=[-1,6,12,18,23], labels=['Midnight','Morning','Afternoon','Evening'])
invest_hour_grpby

In [None]:
# Plotting the graph of the time of the post
invest_hour_grpby.value_counts().plot(kind='bar', title='Time of posting for investing subreddit', figsize=(11,7))
plt.xticks(fontsize=11)
plt.yticks(fontsize=11);

It seems most users post during the afternoon and evening and they're least likely to post in the morning, similar to the Entrepreneur subbredit

## Testing the functions to clean the posts

In [None]:
# Removes unwanted links and lower case the words
test = clean_post(entre_df_merged.text)
test

In [None]:
# Removes stop words
test = remove_stop_words(test)
test

In [None]:
# Lemmitizes words
test = lemmitizer(test)
test

In [None]:
" ".join(test)

## End of test

# Merge both subreddit dataframes

In [None]:
clean_invest_df = pd.read_csv('./datasets/cleaned_invest_df.csv')

In [None]:
clean_entre_df = pd.read_csv('./datasets/cleaned_entre_df.csv')

In [None]:
final_df = pd.concat([clean_entre_df,clean_invest_df],axis=0)
final_df

In [None]:
# Reset the index
final_df.reset_index(drop=True, inplace=True)

In [None]:
# A null value found in the dataframe
final_df[final_df.isnull().any(axis=1)]

In [None]:
# Drop the null value
final_df.dropna(inplace=True)

In [None]:
# No duplicates found
final_df.duplicated().sum()

In [None]:
final_df

## Save the final df

In [None]:
final_df.to_csv('./datasets/final_df.csv', index=False)