<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 Web APIs & Classification
_Authors: Li Jiansheng

---


## Problem Statement
As a data scientist in Android Inc, we are looking at how to better design and develop a phone that users will like over our main competition, Iphone. We want to start by looking at reddit posts on the 2 types of phones and classifying them by Android or Iphone posts before we go further and explore more reviews.

## Executive Summary

### Contents:
- [Android Posts Data Import](reddit-android-data-collection.ipynb)
- [Iphone Posts Data Import](reddit-iphone-data-collection.ipynb)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Visualise Categorical Data](#Visualise-Categorical-Data)
- [Prepare Test Set](#Prepare-test-set)
- [Modelling](#Modelling)
- [Summary Analysis](#Summary-Analysis)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

### 1. Web Scraping
Web scraping from 2 reddit topics were done in another file. The 2 topics we scrap from was Android and Iphone.

### 2. Exploratory Data Analysis

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB


In [2]:
android_df = pd.read_csv('./datasets/android.csv')

android_df.drop_duplicates(subset='title', keep='first', inplace=True)
len(android_df)


786

In [3]:
iphone_df = pd.read_csv('./datasets/iphone.csv')
iphone_df.drop_duplicates(subset='title', keep='first', inplace=True)
len(iphone_df)

532

In [4]:
android_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,created_utc,num_crossposts,media,is_video,post_hint,preview,crosspost_parent_list,crosspost_parent,link_flair_template_id,author_cakeday
0,,Android,"Note 1. Join us at /r/MoronicMondayAndroid, a ...",t2_6l4z3,False,,0,False,Moronic Monday (Jan 20 2020) - Your weekly que...,[],...,1579519000.0,0,,False,,,,,,
1,,Android,Device reviews are everywhere these days. From...,t2_p7o61,False,,0,False,/r/android reviews: LG line,[],...,1579374000.0,0,,False,,,,,,
2,,Android,,t2_kfy6p,False,,0,False,Samsung Galaxy S20 release in France (and worl...,[],...,1579615000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,
3,,Android,,t2_2ja6dymo,False,,0,False,Good Lock 2020 with Android 10 support will be...,[],...,1579623000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,
4,,Android,,t2_tamwpg9,False,,0,False,Wine 5.0 Released - run some Windows programs ...,[],...,1579636000.0,0,,False,link,{'images': [{'source': {'url': 'https://extern...,,,,


In [5]:
iphone_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,author_cakeday,link_flair_template_id,media_metadata
0,,iphone,Welcome to the Daily Tech Support thread for /...,t2_6l4z3,False,,0,False,Daily Tech Support Thread - [January 22],[],...,2079484,1579666000.0,0,,False,,,,,
1,,iphone,Welcome to the weekly stickied WSIB thread. \n...,t2_6l4z3,False,,0,False,Weekly What Should I Buy Thread - [January 17],[],...,2079484,1579252000.0,0,,False,,,,,
2,,iphone,,t2_207e6v0r,False,,0,False,Bloomberg: New low-cost iPhone entering produc...,[],...,2079484,1579652000.0,0,,False,,,,,
3,,iphone,,t2_49fgnvx,False,,0,False,Exclusive: Apple dropped plan for encrypting b...,[],...,2079484,1579612000.0,0,,False,,,,,
4,,iphone,,t2_aaa3ane,False,,0,False,Low-cost iPhone to enter production in Februar...,[],...,2079484,1579654000.0,0,,False,,,,,


We are going to analyse text from columns selftext, title. Subreddit will be our classification target. We will combine the 2 dataframes first.

In [6]:
mobile_df = pd.concat([android_df,iphone_df])

#check for na in 'title' and 'selftext'
mobile_df['title'].isnull().sum()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


0

In [7]:
mobile_df['selftext'].isnull().sum()

686

In [8]:
mobile_df['selftext'].fillna('none', inplace=True)

mobile_df['content']=mobile_df['title'] +' '+ mobile_df['selftext']

In [9]:
mobile_df['subreddit'] = mobile_df['subreddit'].map({'iphone':0,'Android':1})
mobile_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls,content
0,[],False,,,False,AutoModerator,,,robot,[],...,Moronic Monday (Jan 20 2020) - Your weekly que...,0,15,https://www.reddit.com/r/Android/comments/erby...,[],,False,all_ads,6,Moronic Monday (Jan 20 2020) - Your weekly que...
1,[],False,,,False,curated_android,,,,[],...,/r/android reviews: LG line,0,86,https://www.reddit.com/r/Android/comments/eqki...,[],,False,all_ads,6,/r/android reviews: LG line Device reviews are...
2,[],False,,,False,CliveLH,,,,[],...,Samsung Galaxy S20 release in France (and worl...,0,1219,https://www.frandroid.com/marques/samsung/6615...,[],,False,all_ads,6,Samsung Galaxy S20 release in France (and worl...
3,[],True,,,False,ihjao,,,,[],...,Good Lock 2020 with Android 10 support will be...,0,680,https://www.sammobile.com/news/good-lock-2020-...,[],,False,all_ads,6,Good Lock 2020 with Android 10 support will be...
4,[],False,,,False,merrycachemiss,,,,[],...,Wine 5.0 Released - run some Windows programs ...,0,340,https://www.winehq.org/news/2020012101,[],,False,all_ads,6,Wine 5.0 Released - run some Windows programs ...


In [10]:
mobile_df['subreddit']

0      1
1      1
2      1
3      1
4      1
      ..
527    0
528    0
529    0
530    0
531    0
Name: subreddit, Length: 1318, dtype: int64

In [11]:
mobile_df['subreddit'].value_counts(normalize=True)

1    0.596358
0    0.403642
Name: subreddit, dtype: float64

In [12]:
mobile_df=mobile_df.loc[:,('subreddit','content')]
mobile_df.head()

Unnamed: 0,subreddit,content
0,1,Moronic Monday (Jan 20 2020) - Your weekly que...
1,1,/r/android reviews: LG line Device reviews are...
2,1,Samsung Galaxy S20 release in France (and worl...
3,1,Good Lock 2020 with Android 10 support will be...
4,1,Wine 5.0 Released - run some Windows programs ...


In [13]:
#X_train, X_test, y_train, y_test = train_test_split(mobile_df[['content']],
#                                                    mobile_df['subreddit'],
#                                                    test_size = 0.25,
#                                                    random_state = 42)

In [14]:
#X_train.shape

(988, 1)

In [15]:
def review_to_words(raw_content):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    # 1. Remove HTML.
    content_text = BeautifulSoup(raw_content).get_text()
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", content_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stop words to a set.
    stops = stopwords.words('english')
    stops.extend(['none','iphone','android','mobile','device','ios','\n', 'www', 'reddit', 'com', 'comment', 'http'])
    stops = set(stops)
  
    # 5. Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(tokens_lem))

In [16]:
# Get the number of reviews based on the dataframe size.
total_content = mobile_df.shape[0]
print(f'There are {total_content} content.')

# Initialize an empty list to hold the clean reviews.
clean_train_content = []
clean_test_content = []

There are 1318 content.


In [17]:
print("Cleaning and parsing the training set content...")

j = 0
for train_content in mobile_df['content']:
    # Convert review to words, then append to clean_train_reviews.
    clean_content.append(review_to_words(mobile_content))
    
print("Cleaning and parsing the testing set content...")

#for test_content in X_test['content']:
    # Convert review to words, then append to clean_train_reviews.
#    clean_test_content.append(review_to_words(test_content))
    

Cleaning and parsing the training set content...
Cleaning and parsing the testing set content...


In [18]:
#X_train['content']=clean_train_content
#X_test['content']=clean_test_content

In [None]:
X_train, X_test, y_train, y_test = train_test_split(mobile_df[['content']],
                                                    mobile_df['subreddit'],
                                                    test_size = 0.25,
                                                    random_state = 42)

In [19]:
# Instantiate the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 3500) 

In [20]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

train_data_features = vectorizer.fit_transform(clean_train_content)

test_data_features = vectorizer.transform(clean_test_content)

# Numpy arrays are easy to work with, so convert the result to an 
# array.
train_data_features = train_data_features.toarray()

In [21]:
print(train_data_features.shape)

(988, 3500)


In [22]:
#print(test_data_features)

In [23]:
train_data_features[0:6]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [24]:
vocab = vectorizer.get_feature_names()

In [25]:
y_train.shape

(988,)

In [26]:
# Instantiate logistic regression model.

lr = LogisticRegression()


In [27]:
# Fit model to training data.

lr.fit(train_data_features, y_train)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [28]:
# Evaluate model on training data.
lr.score(train_data_features, y_train)

0.9878542510121457

In [29]:
# Evaluate model on testing data.

lr.score(test_data_features, y_test)
#gs.score(X_train,y_test)

0.8696969696969697

In [30]:
lr_predictions=lr.predict(test_data_features)

In [31]:
confusion_matrix(y_test, lr_predictions)

array([[108,  23],
       [ 20, 179]])

In [33]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_predictions).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 108
False Positives: 23
False Negatives: 20
True Positives: 179


In [34]:
# Instantiate our model!

nb = MultinomialNB()

In [35]:
# Fit our model!

model = nb.fit(train_data_features, y_train)

In [36]:
predictions = model.predict(test_data_features)

In [37]:
model.score(test_data_features, y_test)

0.7848484848484848

In [38]:
# Generate a confusion matrix.

confusion_matrix(y_test, predictions)

array([[ 65,  66],
       [  5, 194]])

In [39]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [40]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 65
False Positives: 66
False Negatives: 5
True Positives: 194
