# Project 3 - Book 1: Web APIs & Classification

## Problem Statement

Nutrino is a leading provider of nutrition related data services and analytics. As part of the data science team, we have been tasked to generate business insights curated from popular social media platforms. The company will be able to use that information to better understand customers & markets, enhance decision-making, and ultimately increase profitability.

To do so, we will first be scrapping data from reddit and using classification models such as Logistic Regression and Naive Bayes to uncover patterns within 2 popular diets, Keto and Vegan. We will measure our success using several classificationmetrics inclusing accuracy and F1 score. 

We hope to reveal previously unrecognised sub-trends that pertains to attitudes, lifestyles and buying behavior, strong sub trends as opposed to passing sub trends. With a better understanding of the population and their eating patterns, our clients will be able to strengthen their targeted marketing campaigns and improve the success of their products and programs.

## Executive Summary

As the data science team in Nutrino, we have been tasked to build a classifier to improve core product of the company, which is to provide nutrition related data services and analytics. We are also tasked to identify patterns on 2 currently trending diets, keto and vegan. 

Our classifier was successful in predicting at an above 90% accuracy score. We also identified patterns in the motivations and preferences of the 2 groups of subredditors, which will help determine the kind of customer engagement with teach group. 


## Notebooks:
- [Data Scrapping and Cleaning](./book1_data_scrapping_cleaning.ipynb)
- [EDA](./book2_eda.ipynb)
- [Modeling and Recommendations](./book3_preprocesing_modeling_recommendations.ipynb)


## Contents:
- [Import Libraries](#Import-Libraries)
- [Data Scrapping](#Data-Scrapping)
- [Data Cleaning](#Data-Cleaning)
- [Save Data to CSV](#Save-Data-to-CSV)

### Import Libraries

In [1]:
import requests
import time
import random
import pandas as pd

### Data Scrapping

In [2]:
#set header so that reddit wont think we are a bot and block us
header = {'User-agent':'ididitforthemulz'}

#list of subreddits that we want to scrap
sub_reds = ['vegan','keto']

In [3]:
#scrap vegan and keto posts 
all_posts = []
after = None

for sub in sub_reds:
    url = f'https://old.reddit.com/r/{sub}.json'
    posts = []
    
    for a in range(42):
        print(a)
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers=header)

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,30)
        print(sleep_duration)
        time.sleep(sleep_duration)
        
    all_posts.append(posts)

0
https://old.reddit.com/r/vegan.json
9
1
https://old.reddit.com/r/vegan.json?after=t3_hfq71l
2
2
https://old.reddit.com/r/vegan.json?after=t3_hf1id4
25
3
https://old.reddit.com/r/vegan.json?after=t3_hfo6rh
8
4
https://old.reddit.com/r/vegan.json?after=t3_hfe2ej
21
5
https://old.reddit.com/r/vegan.json?after=t3_hevwpu
29
6
https://old.reddit.com/r/vegan.json?after=t3_hfdeu4
3
7
https://old.reddit.com/r/vegan.json?after=t3_helm4l
7
8
https://old.reddit.com/r/vegan.json?after=t3_hetne6
19
9
https://old.reddit.com/r/vegan.json?after=t3_herhpc
10
10
https://old.reddit.com/r/vegan.json?after=t3_hejv4u
18
11
https://old.reddit.com/r/vegan.json?after=t3_hed1ew
4
12
https://old.reddit.com/r/vegan.json?after=t3_hepimm
3
13
https://old.reddit.com/r/vegan.json?after=t3_hdpv1d
8
14
https://old.reddit.com/r/vegan.json?after=t3_he121a
6
15
https://old.reddit.com/r/vegan.json?after=t3_hdtbxm
28
16
https://old.reddit.com/r/vegan.json?after=t3_hdyrl4
18
17
https://old.reddit.com/r/vegan.json?after=t3_h

In [4]:
vegan_posts = []
keto_posts = []
subred_name = []
author_lst = []
ups_lst = []
downs_lst = []
num_comments_lst = []
id_lst = []


for i in range(2):
    #identifying vegan posts
    for post in all_posts[i]:
        if i==0:
            #Feature Engineering: here we add the text from 
            #selftext (which is also the body) to the title
            vegan_posts.append(post['title'] + " " + post['selftext'])
            
            #this contains the username of the person that uploaded the post
            author_lst.append(post['author'])
            
            #these are our labels
            subred_name.append(post['subreddit_name_prefixed'])
            
            #number of upvotes
            ups_lst.append(post['ups'])
            
            #number of downvotes
            downs_lst.append(post['downs'])
            
            #number of comments
            num_comments_lst.append(post['num_comments'])
            
            #id of post
            id_lst.append(post['id'])
            
        #we do the same for keto posts
        else:
            keto_posts.append(post['title'] + " " + post['selftext'])
            author_lst.append(post['author'])
            subred_name.append(post['subreddit_name_prefixed'])
            ups_lst.append(post['ups'])
            downs_lst.append(post['downs'])
            num_comments_lst.append(post['num_comments'])
            id_lst.append(post['id'])

In [5]:
#check the length of all our lists

#vegan and keto posts have been split into 2 lists
print(len(vegan_posts),len(keto_posts))

#the rest of the data are all in 1 list
print(len(subred_name),len(author_lst),len(ups_lst),len(downs_lst),
     len(num_comments_lst),len(id_lst))

1050 1017
2067 2067 2067 2067 2067 2067


In [6]:
print(vegan_posts[0])
print(keto_posts[0])

Vegan Hacktivists are looking for Developers, UI Designers, Writers and Social Media experts! ❤️ 🐮 Hi folks!

The [Vegan Hacktivists](https://veganhacktivists.org/) is a small group of vegan activists (a few from our mod team here) that are working on several projects helping out organizations like The Save Movement, Meat The Victims, Planet Vegan, and more. We're currently looking for several different vegans to fill volunteer positions to help us spread veganism through online activism.

🐮 **Developers:** We're looking for developers that have experience in Laravel, PHP, CSS and JS, and are familiar (or can get familiar with) with Github, Trello and Discord. If you're interested, [apply here](https://veganhacktivists.org/apply/developers)!

🐷 **UI Designers:** We're looking for designers that have experience with making UI designs for websites using tools like Figma, Sketch, and other collaborative UI designing tools. if you're interested, [apply here](http://veganhacktivists.org/app

##### First observations:
- Now we have all our data from the subreddits
- There are html syntax observed. We need to remove it.
- We will links as well
- We will also convert the data to lowercase and remove stop words
- and finally, we will lemmatise the words instead of stemming which is a harsher method and could produce some non-words

Now that we have collected our data, we will move to cleaning the data. 

### Data Cleaning

#### Text Preprocessing

In [7]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#define a method to preprocess data
def robust_text_preprocessing(text):
    #change to lower
    text=text.lower()
    
    #remove links
    text=re.sub(r'http\S+',"",text)
    
    #remove punctuation
    text=re.sub(r"[^A-Za-z0-9]"," ",text)
    
    #lemmatize
    lemmatizer=WordNetLemmatizer()
    text=lemmatizer.lemmatize(text)
    
    #remove stopwords
    words=text.split()
    text = [word for word in words if word not in stopwords.words('english')]
    
    #join words
    text=(" ".join(text))
    
    return text

In [8]:
%%time

vegan_clean=[]
print("Cleaning and parsing the Vegan posts...")

j = 0
for str in vegan_posts:
    # Convert posts to words, then append to list.
    vegan_clean.append(robust_text_preprocessing(str))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {len(vegan_posts)}.')
    
    j += 1

# Let's do the same for our Keto data.
keto_clean=[]
print("Cleaning and parsing the Keto posts...")

j = 0
for str in keto_posts:
    # Convert posts to words, then append to list.
    keto_clean.append(robust_text_preprocessing(str))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {len(keto_posts)}.')
    
    j += 1

Cleaning and parsing the Vegan posts...
Review 100 of 1050.
Review 200 of 1050.
Review 300 of 1050.
Review 400 of 1050.
Review 500 of 1050.
Review 600 of 1050.
Review 700 of 1050.
Review 800 of 1050.
Review 900 of 1050.
Review 1000 of 1050.
Cleaning and parsing the Keto posts...
Review 100 of 1017.
Review 200 of 1017.
Review 300 of 1017.
Review 400 of 1017.
Review 500 of 1017.
Review 600 of 1017.
Review 700 of 1017.
Review 800 of 1017.
Review 900 of 1017.
Review 1000 of 1017.
CPU times: user 23.9 s, sys: 4.54 s, total: 28.4 s
Wall time: 28.4 s


In [9]:
#lets take a look at our cleaned data
print(vegan_clean[0])
print(keto_clean[0])

vegan hacktivists looking developers ui designers writers social media experts hi folks vegan hacktivists small group vegan activists mod team working several projects helping organizations like save movement meat victims planet vegan currently looking several different vegans fill volunteer positions help us spread veganism online activism developers looking developers experience laravel php css js familiar get familiar github trello discord interested apply ui designers looking designers experience making ui designs websites using tools like figma sketch collaborative ui designing tools interested apply graphic designers looking designers experience making logo banners social media posts etc using tools like photoshop projects organizations shoot us portfolio interested writers currently need folks write content r vegan wiki redesign vegan challenge website ability work writers better grammar sentence constructing contact us mailto hello veganhacktivists org social media manager curr

At this stage, we have completed our data collection and preprocessing. 
We now have 4 lists of posts:
1. vegan_clean -> cleaned vegan posts without links
2. keto_clean -> cleaned keto posts without links
3. vegan_clean_links -> cleaned vegan posts without links
4. keto_clean_links -> cleaned keto posts without links

Let's move on to feature engineering

#### Feature Engineering

In [10]:
#here we create a label for individual df for each subreddit
vegan_label = [1 for str in vegan_posts]
print(len(vegan_label))
print(vegan_label[:5])
keto_label = [0 for str in keto_posts]
print(len(keto_label))
print(keto_label[:5])

1050
[1, 1, 1, 1, 1]
1017
[0, 0, 0, 0, 0]


In [11]:
#lets convert our data into dataframes
vegan_df = pd.DataFrame(zip(vegan_clean,vegan_label),
                        columns = ['text','vegan_label'])

keto_df = pd.DataFrame(zip(keto_clean,keto_label),
                        columns = ['text','vegan_label'])

In [12]:
#lets see all the text data in one df
text_df = pd.concat([vegan_df,keto_df],axis=0).reset_index()
text_df.drop(columns='index',inplace=True)
text_df['vegan_label'].value_counts()

1    1050
0    1017
Name: vegan_label, dtype: int64

In [13]:
#lets put all other data in a df
other_data_df = pd.DataFrame(zip(subred_name,author_lst,ups_lst,
                                 downs_lst,num_comments_lst,id_lst),
                            columns=['subred_name','author','upvotes',
                                     'downvotes','num_comments','post_id'])
other_data_df.shape

(2067, 6)

In [14]:
#let's see it all together
all_data_df = pd.concat([text_df, other_data_df], axis=1)

#double check that we can concat properly
print(all_data_df['vegan_label'].value_counts())
print(all_data_df['subred_name'].value_counts())

1    1050
0    1017
Name: vegan_label, dtype: int64
r/vegan    1050
r/keto     1017
Name: subred_name, dtype: int64


In [15]:
all_data_df.head(3)

Unnamed: 0,text,vegan_label,subred_name,author,upvotes,downvotes,num_comments,post_id
0,vegan hacktivists looking developers ui design...,1,r/vegan,veganactivismbot,76,0,0,f3svif
1,last words fellow vegan elijah mcclain murdere...,1,r/vegan,VenmoMeFiveBucks,5114,0,409,hf6eej
2,promising future think,1,r/vegan,The_Shorey,2709,0,151,hfkmlc


#### Drop Downvotes Column

In [16]:
#since downvotes column is completely empty, we will drop the column
all_data_df.drop(columns='downvotes',axis=1,inplace=True)
all_data_df.shape

(2067, 7)

#### Create wourd_count column

We will be using this for EDA

In [17]:
#create a feature for word count per post
all_data_df['word_count']=all_data_df['text'].apply(lambda x: len(x.split(" ")))

#check that the column data is accurate
all_data_df.loc[0,'word_count']

184

#### Check for null values

In [18]:
all_data_df.isnull().sum()
#seems like there arent any null values

text            0
vegan_label     0
subred_name     0
author          0
upvotes         0
num_comments    0
post_id         0
word_count      0
dtype: int64

In [19]:
#let's take a look at empty strings, just in case
all_data_df[all_data_df['text']==""]

Unnamed: 0,text,vegan_label,subred_name,author,upvotes,num_comments,post_id,word_count
794,,1,r/vegan,48151_62342,7,0,hc8dgw,1


In [20]:
#looks like there is a row with an empty text field
#since there is no way that we can accurately impute text data,
#we will be dropping that row

#we will first convert the "" to None and dropna
all_data_df.drop(all_data_df.index[all_data_df['text']==""],axis=0,inplace=True)
all_data_df.shape

(2066, 8)

Now that we have no more null values, let's check for moderator posts and duplicates, which regularly happens when scrapping reddit data 

#### Check for moderator bot

In [21]:
#since moderator posts are not actual input from our 'survey group' and 
#may skew our data, we consider them to be outliers
#thus, we will be removing them
all_data_df[all_data_df['author']=='AutoModerator'].count()

#we observe 16 posts from moderators and will be dropping these rows

text            18
vegan_label     18
subred_name     18
author          18
upvotes         18
num_comments    18
post_id         18
word_count      18
dtype: int64

#### Check for posts by deleted users

In [22]:
all_data_df[all_data_df['author']=='[deleted]'].count()

#we observe 4 posts that were deleted but we will leave it in
#the data should still be valid even though the account was deleted

text            4
vegan_label     4
subred_name     4
author          4
upvotes         4
num_comments    4
post_id         4
word_count      4
dtype: int64

In [23]:
all_data_df = all_data_df[~all_data_df['author'].isin(['AutoModerator'])]
all_data_df.shape
#dropped 16 rows

(2048, 8)

#### Check for duplicates

In [24]:
len(all_data_df['text'].unique())

1644

In [25]:
all_data_df.drop_duplicates(subset=['text'],inplace=True)
all_data_df.shape

(1644, 8)

In [26]:
print(all_data_df['vegan_label'].value_counts(normalize=True))

1    0.587591
0    0.412409
Name: vegan_label, dtype: float64


#### Observation of text data split

From the value_counts, we can see that our text data from the various subreddit are split at a 58/42 split. This will be important when we are performing EDA and comparing between the 2 subreddits. 

### Save Data to CSV

I will be saving all my dataframes into csv files, for ease of access in the EDA phase. 

Here is a list of the csv files, followed by the description 
1. vegan_clean -> clean text data and label from the vegan subreddit, before dropping null and duplicates
2. keto_clean -> clean text data and label from the keto subreddit, before dropping null and duplicates
3. data_clean -> all relevant data of posts from both subreddits, after dropping null and duplicates

In [27]:
vegan_df.to_csv('../datasets/vegan_clean.csv',index=False)
keto_df.to_csv('../datasets/keto_clean.csv',index=False)
all_data_df.to_csv('../datasets/data_clean.csv',index=False)

#### Next Steps:
Having now scapped our data, we will now proceed to explore our data in the EDA notebook.

- [EDA](./book2_eda.ipynb)