<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 Web APIs & Classification
_Authors: Li Jiansheng

---


## Problem Statement
As a data scientist in Android Inc, we are looking at how to better design and develop a phone that users will like over our main competition, Iphone. We want to start by looking at reddit posts on the 2 types of phones and classifying them by Android or Iphone posts before we go further and explore more reviews.

## Executive Summary

### Contents:
- [Android Posts Data Import](reddit-android-data-collection.ipynb)
- [Iphone Posts Data Import](reddit-iphone-data-collection.ipynb)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Visualise Categorical Data](#Visualise-Categorical-Data)
- [Prepare Test Set](#Prepare-test-set)
- [Modelling](#Modelling)
- [Summary Analysis](#Summary-Analysis)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

### 1. Web Scraping
Web scraping from 2 reddit topics were done in another file. The 2 topics we scrap from was Android and Iphone.

### 2. Exploratory Data Analysis

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB


In [2]:
android_df = pd.read_csv('./datasets/android.csv')

android_df.drop_duplicates(subset='title', keep='first', inplace=True)
len(android_df)


786

In [3]:
iphone_df = pd.read_csv('./datasets/iphone.csv')
iphone_df.drop_duplicates(subset='title', keep='first', inplace=True)
len(iphone_df)

532

In [4]:
android_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,created_utc,num_crossposts,media,is_video,post_hint,preview,crosspost_parent_list,crosspost_parent,link_flair_template_id,author_cakeday
0,,Android,"Note 1. Join us at /r/MoronicMondayAndroid, a ...",t2_6l4z3,False,,0,False,Moronic Monday (Jan 20 2020) - Your weekly que...,[],...,1579519000.0,0,,False,,,,,,
1,,Android,Device reviews are everywhere these days. From...,t2_p7o61,False,,0,False,/r/android reviews: LG line,[],...,1579374000.0,0,,False,,,,,,
2,,Android,,t2_kfy6p,False,,0,False,Samsung Galaxy S20 release in France (and worl...,[],...,1579615000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,
3,,Android,,t2_2ja6dymo,False,,0,False,Good Lock 2020 with Android 10 support will be...,[],...,1579623000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,
4,,Android,,t2_tamwpg9,False,,0,False,Wine 5.0 Released - run some Windows programs ...,[],...,1579636000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,


In [5]:
iphone_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,author_cakeday,link_flair_template_id,media_metadata
0,,iphone,Welcome to the Daily Tech Support thread for /...,t2_6l4z3,False,,0,False,Daily Tech Support Thread - [January 22],[],...,2079484,1579666000.0,0,,False,,,,,
1,,iphone,Welcome to the weekly stickied WSIB thread. \n...,t2_6l4z3,False,,0,False,Weekly What Should I Buy Thread - [January 17],[],...,2079484,1579252000.0,0,,False,,,,,
2,,iphone,,t2_207e6v0r,False,,0,False,Bloomberg: New low-cost iPhone entering produc...,[],...,2079484,1579652000.0,0,,False,,,,,
3,,iphone,,t2_49fgnvx,False,,0,False,Exclusive: Apple dropped plan for encrypting b...,[],...,2079484,1579612000.0,0,,False,,,,,
4,,iphone,,t2_aaa3ane,False,,0,False,Low-cost iPhone to enter production in Februar...,[],...,2079484,1579654000.0,0,,False,,,,,


We are going to analyse text from columns selftext, title. Subreddit will be our classification target. We will combine the 2 dataframes first.

In [6]:
mobile_df = pd.concat([android_df,iphone_df])

#check for na in 'title' and 'selftext'
mobile_df['title'].isnull().sum()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


0

In [7]:
mobile_df['selftext'].isnull().sum()

686

In [8]:
mobile_df['selftext'].fillna('none', inplace=True)

mobile_df['content']=mobile_df['title'] +' '+ mobile_df['selftext']

In [9]:
mobile_df['subreddit'] = mobile_df['subreddit'].map({'Android':1,'iphone':0})
mobile_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls,content
0,[],False,,,False,AutoModerator,,,robot,[],...,Moronic Monday (Jan 20 2020) - Your weekly que...,0,15,https://www.reddit.com/r/Android/comments/erby...,[],,False,all_ads,6,Moronic Monday (Jan 20 2020) - Your weekly que...
1,[],False,,,False,curated_android,,,,[],...,/r/android reviews: LG line,0,86,https://www.reddit.com/r/Android/comments/eqki...,[],,False,all_ads,6,/r/android reviews: LG line Device reviews are...
2,[],False,,,False,CliveLH,,,,[],...,Samsung Galaxy S20 release in France (and worl...,0,1219,https://www.frandroid.com/marques/samsung/6615...,[],,False,all_ads,6,Samsung Galaxy S20 release in France (and worl...
3,[],True,,,False,ihjao,,,,[],...,Good Lock 2020 with Android 10 support will be...,0,680,https://www.sammobile.com/news/good-lock-2020-...,[],,False,all_ads,6,Good Lock 2020 with Android 10 support will be...
4,[],False,,,False,merrycachemiss,,,,[],...,Wine 5.0 Released - run some Windows programs ...,0,340,https://www.winehq.org/news/2020012101,[],,False,all_ads,6,Wine 5.0 Released - run some Windows programs ...


In [10]:
mobile_df['subreddit']

0      1
1      1
2      1
3      1
4      1
      ..
527    0
528    0
529    0
530    0
531    0
Name: subreddit, Length: 1318, dtype: int64

In [11]:
mobile_df['subreddit'].value_counts(normalize=True)

1    0.596358
0    0.403642
Name: subreddit, dtype: float64

In [12]:
mobile_df=mobile_df.loc[:,('subreddit','content')]
mobile_df.head()

Unnamed: 0,subreddit,content
0,1,Moronic Monday (Jan 20 2020) - Your weekly que...
1,1,/r/android reviews: LG line Device reviews are...
2,1,Samsung Galaxy S20 release in France (and worl...
3,1,Good Lock 2020 with Android 10 support will be...
4,1,Wine 5.0 Released - run some Windows programs ...


In [13]:
X_train, X_test, y_train, y_test = train_test_split(mobile_df[['content']],
                                                    mobile_df['subreddit'],
                                                    test_size = 0.25,
                                                    random_state = 42)

In [14]:
X_train.shape

(988, 1)

In [15]:
def review_to_words(raw_content):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    # 1. Remove HTML.
    content_text = BeautifulSoup(raw_content).get_text()
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", content_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stop words to a set.
    stops = stopwords.words('english')
    stops.extend(['none','iphone','android','mobile','device','ios','\n', 'www', 'reddit', 'com', 'comment', 'http'])
    stops = set(stops)
  
    # 5. Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(tokens_lem))

In [16]:
# Get the number of reviews based on the dataframe size.
total_content = mobile_df.shape[0]
print(f'There are {total_content} content.')

# Initialize an empty list to hold the clean reviews.
clean_train_content = []
clean_test_content = []

There are 1318 content.


In [17]:
print("Cleaning and parsing the training set content...")

j = 0
for train_content in X_train['content']:
    # Convert review to words, then append to clean_train_reviews.
    clean_train_content.append(review_to_words(train_content))
    
print("Cleaning and parsing the testing set content...")

for test_content in X_test['content']:
    # Convert review to words, then append to clean_train_reviews.
    clean_test_content.append(review_to_words(test_content))
    

Cleaning and parsing the training set content...
Cleaning and parsing the testing set content...


In [18]:
X_train['content']=clean_train_content
X_test['content']=clean_test_content

In [19]:
pipe = Pipeline([('cvec', CountVectorizer()),
                 ('lr', LogisticRegression())
                ])

In [20]:
cross_val_score(pipe, clean_train_content, y_train, cv=5)



array([0.88442211, 0.84343434, 0.79695431, 0.85786802, 0.8680203 ])

In [21]:
# ii. Fit into model
pipe.fit(clean_train_content, y_train)

# Training score
print(pipe.score(clean_train_content, y_train))

# Test score
print(pipe.score(clean_test_content, y_test))

0.9939271255060729
0.8727272727272727




In [22]:
pipe_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8,.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)

In [23]:
%time gs.fit(clean_train_content, y_train)
print(gs.best_score_)
gs.best_params_











CPU times: user 20 s, sys: 264 ms, total: 20.2 s
Wall time: 20.2 s
0.8593117408906883




{'cvec__max_df': 0.8,
 'cvec__max_features': 3500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2)}

In [24]:
gs.score(clean_train_content, y_train)

0.9908906882591093

In [25]:
gs.score(clean_test_content, y_test)

0.8727272727272727

In [26]:
lr_predictions=gs.predict(clean_test_content)

In [27]:
lr_cm=confusion_matrix(y_test, lr_predictions)

In [28]:
lr_cm_df = pd.DataFrame(lr_cm, columns=['pred Iphone', 'pred Android'], index=['actual Iphone', 'actual Android'])
lr_cm_df

Unnamed: 0,pred Iphone,pred Android
actual Iphone,108,23
actual Android,19,180


In [29]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_predictions).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 108
False Positives: 23
False Negatives: 19
True Positives: 180


In [30]:
print(classification_report(y_test, lr_predictions))

              precision    recall  f1-score   support

           0       0.85      0.82      0.84       131
           1       0.89      0.90      0.90       199

    accuracy                           0.87       330
   macro avg       0.87      0.86      0.87       330
weighted avg       0.87      0.87      0.87       330



In [31]:
pipe2 = Pipeline([('cvec', CountVectorizer()),
                 ('nb', MultinomialNB())
                ])

In [32]:
cross_val_score(pipe2, clean_train_content, y_train, cv=5)

array([0.74874372, 0.6969697 , 0.77664975, 0.72588832, 0.76649746])

In [33]:
# ii. Fit into model
pipe2.fit(clean_train_content, y_train)

# Training score
print(pipe2.score(clean_train_content, y_train))

# Test score
print(pipe2.score(clean_test_content, y_test))

0.8856275303643725
0.7727272727272727


In [34]:
pipe_params2 = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8,.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}
gs_nb = GridSearchCV(pipe2, param_grid=pipe_params2, cv=5)

In [35]:
%time 
gs_nb.fit(clean_train_content, y_train)
print(gs_nb.best_score_)
gs_nb.best_params_

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs
0.742914979757085


{'cvec__max_df': 0.8,
 'cvec__max_features': 3000,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 1)}

In [36]:
gs_nb.score(clean_train_content,y_train)

0.8572874493927125

In [37]:
gs_nb.score(clean_test_content,y_test)

0.7818181818181819

In [38]:
nb_predictions=gs_nb.predict(clean_test_content)

In [39]:
nb_cm=confusion_matrix(y_test, nb_predictions)

In [40]:
nb_cm_df = pd.DataFrame(nb_cm, columns=['pred Iphone', 'pred Android'], index=['actual Iphone', 'actual Android'])
nb_cm_df

Unnamed: 0,pred Iphone,pred Android
actual Iphone,64,67
actual Android,5,194


In [41]:
tn, fp, fn, tp = confusion_matrix(y_test, nb_predictions).ravel()

In [42]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 64
False Positives: 67
False Negatives: 5
True Positives: 194


In [62]:
def compareStats(m1,m2,xtrain,ytrain,xtest,ytest):
    m1.fit(xtrain, ytrain)
    score1=round(m1.score(xtest, ytest),2)
    predictions1=m1.predict(xtest)
    m1_cm=confusion_matrix(ytest, predictions1)
    tn1, fp1, fn1, tp1 = confusion_matrix(ytest, predictions1).ravel()
    recall1=round((tp1/(tp1+fn1)),2)
    precision1=round((tp1/(tp1+fp1)),2)
    
    m2.fit(xtrain, ytrain)
    score2=round(m2.score(xtest, ytest),2)
    predictions2=m2.predict(xtest)
    m2_cm=confusion_matrix(ytest, predictions2)
    tn2, fp2, fn2, tp2 = confusion_matrix(ytest, predictions2).ravel()
    recall2=round((tp2/(tp2+fn2)),2)
    precision2=round((tp2/(tp2+fp2)),2)
    
    print("Stats     | Logistic Regression | Multinomial |")
    print("===========================================")
    print("Recall    |           ",recall1,"     |  ",recall2,"     |" )
    print("Precision |     ",precision1,"  |",precision2,"|" )
    print("Score     |           ",score1,"    |  ",score2,"     |" )
    
compareStats(pipe,pipe2,clean_train_content,y_train,clean_test_content,y_test)
    

Stats     | Logistic Regression | Multinomial |
Recall    |            0.9      |   0.97      |
Precision |      0.89   | 0.74 |




NameError: name 'accuracy1' is not defined