# Project 3 - Book 1: Web APIs & Classification

## Problem Statement

Nutrino is a leading provider of nutrition related data services and analytics. Nutrino works with businesses and professionals to improve the success of their products and programs, better understand populations and eating patterns, and identify new areas of opportunity. As part of the data science team, we are tasked to generate business ready data to support the company. To do so, we will be utilising classification models such as Logistic Regression and Naive Bayes to uncover patterns within 2 popular diets, Keto and Vegan. We hope to reveal and identify previously unrecognised sub-trends that pertains to attitudes, lifestyles and buying behavior, to allow our customers to identify strong sub trends as opposed to passing sub trends.

## Executive Summary

We have been tasked to predict housing prices so as to generate actionable insights for the organisation to achieve larger margins in their investment strategy. In order to achieve our goals, we will be performing data cleaning, feature engineering, EDA, feature selection and lastly several regression models to predict sale prices. Based on an accuracy score, the best model will be evaluated and chosen to predict sale prices. Having mirrored the market, we can then find out which are the strong predictors of sale prices. With this information, the company is able to locate properties with the favoured features and flip them for profit, generating value for the management, shareholders and of course customers. 

## Notebooks:
- [Data Scrapping and Cleaning](./book1_data_scrapping_cleaning.ipynb)
- [EDA and Feature Selection](./book2_eda_feature_selection.ipynb)
- [Preprocessing, Modeling and Recommendations](./book3_preprocesing_modeling_recommendations.ipynb)

## Contents:
- [Import Libraries](#Import-Libraries)
- [Data Scrapping](#Data-Scrapping)
- [Data Cleaning](#Data-Cleaning)
- [Feature Engineering](#Feature-Engineering)
- [Save Data to CSV](#Save-Data-to-CSV)

### Import Libraries

In [2]:
import requests
import time
import random
import pandas as pd

### Data Scrapping

In [5]:
#get request
#res = requests.get(url,headers=header)
#check status code
#res.status_code

200

In [5]:
#set header so that reddit wont think we are a bot and block us
header = {'User-agent':'ididitforthemulz'}

#list of subreddits that we want to scrap
sub_reds = ['vegan','keto']

In [8]:
%%time

#scrap vegan and keto posts 
all_posts = []
after = None

for sub in sub_reds:
    url = f'https://old.reddit.com/r/{sub}.json'
    posts = []
    
    for a in range(42):
        print(a)
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers=header)

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(3,50)
        print(sleep_duration)
        time.sleep(sleep_duration)
        
    all_posts.append(posts)

0
https://old.reddit.com/r/vegan.json
30
1
https://old.reddit.com/r/vegan.json?after=t3_hcyr7b
15
2
https://old.reddit.com/r/vegan.json?after=t3_hd6zba
34
3
https://old.reddit.com/r/vegan.json?after=t3_hcmw2v
20
4
https://old.reddit.com/r/vegan.json?after=t3_hcytnj
11
5
https://old.reddit.com/r/vegan.json?after=t3_hcxvy4
18
6
https://old.reddit.com/r/vegan.json?after=t3_hcei9g
44
7
https://old.reddit.com/r/vegan.json?after=t3_hcbj7v
29
8
https://old.reddit.com/r/vegan.json?after=t3_hc3gyi
30
9
https://old.reddit.com/r/vegan.json?after=t3_hbtswx
44
10
https://old.reddit.com/r/vegan.json?after=t3_hbsn5n
15
11
https://old.reddit.com/r/vegan.json?after=t3_hc3jet
33
12
https://old.reddit.com/r/vegan.json?after=t3_hc33am
35
13
https://old.reddit.com/r/vegan.json?after=t3_hbixs0
46
14
https://old.reddit.com/r/vegan.json?after=t3_hascvj
36
15
https://old.reddit.com/r/vegan.json?after=t3_hbngp0
19
16
https://old.reddit.com/r/vegan.json?after=t3_hbmb0z
14
17
https://old.reddit.com/r/vegan.json?a

In [9]:
#check to find title of scrapped json data
res.json()['data']['children'][0]['data']['title']

'Is eating masala mixtures ok for a keto diet.'

In [10]:
#check to find text of scrapped json data
res.json()['data']['children'][0]['data']['selftext']

'Living in an Indian household, a lot of our foods have masalas ( coriander powder, chilli powder, turmeric, cumin powder, coconut seeds etc) in them mixed in with certain meats. \n\nI wanted to know if eating that would affect my diet.'

In [41]:
res.json()['data']['children'][0]['data']['author']

'jazzjj5864'

In [11]:
res.json()['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort',

In [42]:
text_lst = []
title_lst = []
subred_name_lst = []
ups_lst = []
downs_lst = []
num_comments_lst = []
id_lst = []
author_lst = []

for sub_red in all_posts:
    for i in range(len(sub_red)):
        text_lst.append(sub_red[i]['selftext'])
        title_lst.append(sub_red[i]['title'])
        subred_name_lst.append(sub_red[i]['subreddit_name_prefixed'])
        ups_lst.append(sub_red[i]['ups'])
        downs_lst.append(sub_red[i]['downs'])
        num_comments_lst.append(sub_red[i]['num_comments'])
        id_lst.append(sub_red[i]['id'])
        author_lst.append(sub_red[i]['author'])

In [45]:
all_posts_df = pd.DataFrame(zip(text_lst,title_lst,subred_name_lst,
                                ups_lst,downs_lst,num_comments_lst,
                                id_lst,author_lst),
                           columns=['text','title','label','ups','downs','num_comments','id','author'])

In [48]:
all_posts_df.head()

Unnamed: 0,text,title,label,ups,downs,num_comments,id,author
0,,Vegan Hacktivists are looking for professional...,r/vegan,208,0,1,feve92,veganactivismbot
1,,"Regan Russell, animal rights activist. She was...",r/vegan,5097,0,485,hca93z,nekkototoro
2,,Celebrating 24 years of veganism this month wi...,r/vegan,2665,0,120,hcvo0c,Esmeanne
3,My sister just visited our dad and extended fa...,"""Vegans are preachy and shove it in your face""",r/vegan,86,0,24,hd4exc,Elemor_
4,,Farmers: veganism is propaganda. Also farmers:...,r/vegan,3391,0,252,hcmghs,nekkototoro


In [47]:
all_posts_df.to_csv('../datasets/all_posts.csv')

In [52]:
#vegan_df=pd.DataFrame(zip(text_lst,title_lst),columns=['text','title'])
vegan_lst=pd.DataFrame(title_lst,columns=['title'])
vegan_df.head()

Unnamed: 0,title
0,Canada is about to pass a bill that would make...
1,Vegan Hacktivists are looking for professional...
2,"Regan Russell, animal rights activist. She was..."
3,She do be spitting straight facts tho.
4,When there's another post about deforestation ...


In [9]:
vegan_df.to_csv('vegan_posts.csv',index=False)

##### First observations:
- Now we have all our title data from the subreddit
- There are html syntax observed. We need to remove it.
- We will also convert the data to lowercase and remove stop words
- and finally, we will lemmatise the words instead of stemming which is a harsher method and could produce some non-words

Now that we have collected our data, we will move to cleaning the data. 

### Data Cleaning

#### Text Reprocessing

In [53]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#define a method to preprocess data
def robust_text_preprocessing(text):
    #change to lower
    text=text.lower()
    
    #remove punctuation
    text=re.sub(r"[^A-Za-z0-9]"," ",text)
    
    #lemmatize
    lemmatizer=WordNetLemmatizer()
    text=lemmatizer.lemmatize(text)
    
    #remove stopwords
    words=text.split()
    text= [word for word in words if word not in stopwords.words('english')]
    text=(" ".join(text))
    
    return text

In [57]:
%%time

vegan_clean=[]
print("Cleaning and parsing the Vegan posts...")

j = 0
for str in vegan_df['title']:
    # Convert review to words, then append to clean_train_reviews.
    vegan_clean.append(robust_text_preprocessing(str))
    
    # If the index is divisible by 1000, print a message
    if (j + 1) % 100 == 0:
        print(f'Review {j + 1} of {vegan_df.shape[0]}.')
    
    j += 1

# Let's do the same for our Keto data.
# X_test_clean=[]
# print("Cleaning and parsing the Keto posts...")

# j = 0
# for str in X_test:
#     # Convert review to words, then append to clean_train_reviews.
#     X_test_clean.append(robust_text_preprocessing(str))
    
#     # If the index is divisible by 1000, print a message
#     if (j + 1) % 100 == 0:
#         print(f'Review {j + 1} of {len(X_train)}.')
    
#     j += 1

Cleaning and parsing the Vegan posts...
Review 100 of 999.
Review 200 of 999.
Review 300 of 999.
Review 400 of 999.
Review 500 of 999.
Review 600 of 999.
Review 700 of 999.
Review 800 of 999.
Review 900 of 999.
CPU times: user 1.4 s, sys: 288 ms, total: 1.69 s
Wall time: 1.69 s


In [59]:
vegan_clean[0]

'canada pass bill would make illegal report photograph otherwise expose animal abuse factory farms already know animals endure silences even worst cases torture abuse please sign petition'

In [None]:
vegan_label = [1 if name == "r/vegan" else 0 for name in subred_name ]
print(len(vegan_label))
print(vegan_label[:5])
print(vegan_label[-5:])#here we create a label for individual df for each subreddit
vegan_label = [1 for str in vegan_posts]
print(len(vegan_label))
print(vegan_label[:5])
keto_label = [0 for str in keto_posts]
print(len(keto_label))
print(keto_label[:5])

In [63]:
#check to see if we have properly cleaned the data
vegan_clean_df=pd.DataFrame(vegan_clean,columns=['title'])
#vegan_clean_df['vegan']=1
vegan_clean_df.head()

Unnamed: 0,title
0,canada pass bill would make illegal report pho...
1,vegan hacktivists looking professional develop...
2,regan russell animal rights activist killed st...
3,spitting straight facts tho
4,another post deforestation amazon front page


##### Decisions for null values

- For columns with extremely high numbers of nan such as alley, pool and misc_feature, we will be dropping these columns as the lack of data will be statistically insignificant and lead to dimensonality error/bias
- For mas_vnr_xxx/garage_xx/bsmt_xx, it seems like there are missing values because there is no such feature in the property in the first place. We will replace nan with 0 for these variables. 
- For variables such as garage_yr built, we will impute it with the year the house was built which is the norm for most houses with a garage
- Lastly, for continuous variables such as lot frontage, we will be imputing with mean values based on the lot shape and lot area


#### Dropping columns

#### Imput NaNs with 0

Now let's get to cleaning!

### Feature Engineering

Based on some external research, we have identified some factors that are comonnly known to affect the price of a property
- Location
- Property size
- House condition
- Macro Environment

3rd party research: https://resources.point.com/8-biggest-factors-affect-real-estate-prices/


#### Location, Location, Location

### Create Bag of Words

In order to facilitate our EDA process and allow the model to read the words, we will convert the our words into a document term matrix. We will do so with a CountVectorizer

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#instantiate a CountVectorizer
# cvec = CountVectorizer()

#convert the data into a sparse matrix
# posts_cvec = cvec.fit_transform(all_data_df['text'])
# text_cvec_df = pd.DataFrame(posts_cvec.toarray(),columns=cvec.get_feature_names())
# text_cvec_df.shape

### Save Data to CSV

In [44]:
ames.to_csv('../datasets/combined_clean.csv',index=False)

#### Next Steps:
Having now gotten most of the features that we need to address the issues, we will now proceed to establishing a baseline for our model in the preprocessing and modeling notebook.

- [EDA and Feature Selection](./book2_eda_feature_selection.ipynb)

In [3]:
# for coding use only - DELETE BEFORE SUBMISSION
vegan_df=pd.read_csv('../datasets/vegan_clean.csv')
keto_df=pd.read_csv('../datasets/keto_clean.csv')
posts_df=pd.read_csv('../datasets/all_posts_clean.csv')

### Train Test Split

Next we split our data into train and test datasets

In [27]:
#first, we define our feature matrix and the target
X = posts_df['text']
y = posts_df['vegan_label']

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)