# Predicting the reddit post flair

Source: https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

Importing necessary libraries

In [2]:
# Data Manipulation 
import numpy as np 
import pandas as pd

# Data Visualisation 
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical libraries 
from sklearn.feature_selection import chi2

# Natural Language Processing
import nltk 
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Machine Learning 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Performance Evaluation and Support
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics


'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
# Load the data into the dataframe 
data = pd.read_csv('..\\data\\data.csv')
data.shape

(1650, 11)

In [4]:
data.columns

Index(['Unnamed: 0', 'Title', 'Score', 'ID', 'URL', 'num_comments',
       'created_on', 'Body', 'Original', 'Flair', 'Comments'],
      dtype='object')

In [5]:
# Data Shuffling
data[:] = data.sample(frac=1).values
data.drop(['Unnamed: 0'], inplace=True, axis=1)
data.head()

Unnamed: 0,Title,Score,ID,URL,num_comments,created_on,Body,Original,Flair,Comments
0,When you have to cook and clean all by yoursel...,283,fuhcgu,https://i.redd.it/hyd2c2b3coq41.png,15,1585980000.0,,True,Food,Not for me. I always cooked my own food. The o...
1,"TCS not to layoff any employees, but no salary...",95,g2gx0s,https://www.thenewsminute.com/article/tcs-not-...,31,1587079000.0,,False,Business/Finance,Why not defer the increments instead of outrig...
2,How come western countries are facing so many ...,24,g41vk7,https://www.reddit.com/r/india/comments/g41vk7...,35,1587301000.0,Having a discussion with family and I am unabl...,False,AskIndia,The lockdown enforced by the government was tr...
3,EXCLUSIVE: Documents Show Modi Govt Building 3...,2072,fjx9ih,https://www.huffingtonpost.in/entry/aadhaar-na...,314,1584440000.0,,False,Politics,And then came corona! To fuck up their plans.
4,Startrail over Leh Palace - Leh Palace was bui...,215,epg8gi,https://i.redd.it/iwbeppdci3b41.jpg,6,1579190000.0,,False,Photography,Amazing to see this.


In [7]:
# Printing the data info to have a look at the null values and data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 10 columns):
Title           1650 non-null object
Score           1650 non-null int64
ID              1650 non-null object
URL             1650 non-null object
num_comments    1650 non-null int64
created_on      1650 non-null float64
Body            635 non-null object
Original        1650 non-null bool
Flair           1650 non-null object
Comments        1557 non-null object
dtypes: bool(1), float64(1), int64(2), object(6)
memory usage: 117.7+ KB


This dataset does not have any null values for the flairs. There are null values only in Comments and Body. 

In [118]:
# Making a copy of the data for later use
data_og = data.copy()        

In [119]:
data['Flair'].value_counts()

Sports                150
Science/Technology    150
[R]eddiquette         150
Non-Political         150
Policy/Economy        150
AMA                   150
Photography           150
Business/Finance      150
Food                  150
Politics              150
AskIndia              150
Name: Flair, dtype: int64

In [120]:
print(len(data['Flair'].unique()))
print(data['Flair'].unique())

11
['Food' 'Policy/Economy' 'Politics' 'AskIndia' 'Photography'
 'Business/Finance' '[R]eddiquette' 'Non-Political' 'Science/Technology'
 'AMA' 'Sports']


In [121]:
# List of relevant features (MOVE LATER)
features = ['Flair', 'URL', 'Title', 'Comments', 'Body']

In [122]:
# Collecting the flair and ids
data = data[features]
data.head()

Unnamed: 0,Flair,URL,Title,Comments,Body
0,Food,https://www.thenewsminute.com/article/swiggy-z...,"Swiggy, Zomato roll out contact-less delivery:...",The new coronavirus was viable up to 72 hours ...,
1,Policy/Economy,https://www.livemint.com/market/stock-market-n...,Rupee crashes to near record low against the U...,NRIs rejoice!,
2,Politics,https://www.reddit.com/r/india/comments/fomt9e...,Indian cops are uncontrollable animals with pe...,I don't feel wrong even if they beat anyone be...,"DISCLAIMER: From the videos I have seen, I wil..."
3,AskIndia,https://www.reddit.com/r/india/comments/fjx0dq...,Is your employer making your come to office? (...,Uh. The Indian Army. \n\n\nGet back to me on...,Fill out the the form below. It is is 100% ANO...
4,Photography,https://i.redd.it/7apjmstrxnt41.jpg,"Sunrise - Fatehgarh Shib, Punjab [OnePlus 7T -...",Thanks for reminding me that I haven't slept.,


In [12]:
# Assigning and individual id to each flair
data['id'] = data['Flair'].factorize()[0]
flair_category = data[['Flair', 'id']].drop_duplicates().sort_values('id')
flair_category

Unnamed: 0,Flair,id
0,Food,0
1,Business/Finance,1
2,AskIndia,2
3,Politics,3
4,Photography,4
5,[R]eddiquette,5
11,AMA,6
26,Non-Political,7
31,Sports,8
40,Science/Technology,9


In [13]:
# Convert into a label dctionary to be used as a means of assigning labels after the prediction
category_labels = dict(flair_category.values)
print(category_labels)

{'Food': 0, 'Business/Finance': 1, 'AskIndia': 2, 'Politics': 3, 'Photography': 4, '[R]eddiquette': 5, 'AMA': 6, 'Non-Political': 7, 'Sports': 8, 'Science/Technology': 9, 'Policy/Economy': 10}


In [14]:
# Similarly, we can create an inverse of the previouus one to convert labels to categories
category_reverse = dict(flair_category[['id', 'Flair']].values)
print(category_reverse)

{0: 'Food', 1: 'Business/Finance', 2: 'AskIndia', 3: 'Politics', 4: 'Photography', 5: '[R]eddiquette', 6: 'AMA', 7: 'Non-Political', 8: 'Sports', 9: 'Science/Technology', 10: 'Policy/Economy'}


Have a look at the data now. We have an id column which are basically the labels that we have to predict. They are derived from equivalent flair categories. We will be using the other columns as our input features. We will also create a series of all labels that need to predicted. 

In [15]:
labels = data['id']
data.head(10)

Unnamed: 0,Title,Score,ID,URL,num_comments,created_on,Body,Original,Flair,Comments,id
0,When you have to cook and clean all by yoursel...,283,fuhcgu,https://i.redd.it/hyd2c2b3coq41.png,15,1585980000.0,,True,Food,Not for me. I always cooked my own food. The o...,0
1,"TCS not to layoff any employees, but no salary...",95,g2gx0s,https://www.thenewsminute.com/article/tcs-not-...,31,1587079000.0,,False,Business/Finance,Why not defer the increments instead of outrig...,1
2,How come western countries are facing so many ...,24,g41vk7,https://www.reddit.com/r/india/comments/g41vk7...,35,1587301000.0,Having a discussion with family and I am unabl...,False,AskIndia,The lockdown enforced by the government was tr...,2
3,EXCLUSIVE: Documents Show Modi Govt Building 3...,2072,fjx9ih,https://www.huffingtonpost.in/entry/aadhaar-na...,314,1584440000.0,,False,Politics,And then came corona! To fuck up their plans.,3
4,Startrail over Leh Palace - Leh Palace was bui...,215,epg8gi,https://i.redd.it/iwbeppdci3b41.jpg,6,1579190000.0,,False,Photography,Amazing to see this.,4
5,"Random Daily Discussion Thread - September 04,...",15,czfltt,https://www.reddit.com/r/india/comments/czfltt...,109,1567598000.0,"^Beep ^Boop ^Bot, ^I ^am ^a ^bot! ^if ^any ^pr...",False,[R]eddiquette,I have a query! I want to make a payment in US...,5
6,I find this extremely strange. - Album on Imgu...,41,g62sa1,https://imgur.com/a/pOM8XM0,9,1587597000.0,,False,AskIndia,"This has me curious, what are some documents o...",2
7,"Kolkata horror: You should be raped, woman tel...",176,cn734h,https://timesofindia.indiatimes.com/city/kolka...,52,1565220000.0,,False,[R]eddiquette,"Fuck this, imagine a man saying this.",5
8,Chandrayaan Moon Landing,503,d0ffgv,https://www.reddit.com/r/india/comments/d0ffgv...,1401,1567798000.0,ISRO's Chandrayaan 2 mission is a significant ...,False,[R]eddiquette,"Experts at Home: ""Solar battery phook gayi"" 😂",5
9,Ambedkar Jayanti 2020,1298,g15qec,https://i.redd.it/rcf07o8ress41.jpg,126,1586900000.0,,False,Politics,No news channel have explained in detail to co...,3


## Performing text analysis

In [16]:
# Import nltk stopwords as done in the previous notebook as well
STOPWORDS = nltk.corpus.stopwords.words('english')
print(STOPWORDS)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [17]:
# Import nltk punctuation which will be removed from our texts as well
# nltk.download('punkt')
# PUNCT = nltk.corpus.stopwords.words('punkt')
# print(PUNCT)

## Working Dataset

For this, I will be combining the data present in the body, Title and the Comments. 

This is giving me a very weird problem. 
All comparison with nan are returning to be false. np.NaN is not working so I am trying a different approach. I am comparing the value with float or str to determine the existence of null value. Float = Null. str = something is present. 

In [18]:
# for i in range(len(data)):
#     print(type(data.iloc[i]['Body']))

In [19]:
data['Combine'] = data['Title'] # Create a column combined
count = 0
for i in range(len(data)):
    if type(data.loc[i]['Body']) != float:
        data['Combine'][i] = data['Combine'][i] + ' ' + data['Body'][i]

    if type(data.loc[i]['Comments']) != float:
        data['Combine'][i] = data['Combine'][i] + ' ' + data['Comments'][i]

data.head(20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Title,Score,ID,URL,num_comments,created_on,Body,Original,Flair,Comments,id,Combine
0,When you have to cook and clean all by yoursel...,283,fuhcgu,https://i.redd.it/hyd2c2b3coq41.png,15,1585980000.0,,True,Food,Not for me. I always cooked my own food. The o...,0,When you have to cook and clean all by yoursel...
1,"TCS not to layoff any employees, but no salary...",95,g2gx0s,https://www.thenewsminute.com/article/tcs-not-...,31,1587079000.0,,False,Business/Finance,Why not defer the increments instead of outrig...,1,"TCS not to layoff any employees, but no salary..."
2,How come western countries are facing so many ...,24,g41vk7,https://www.reddit.com/r/india/comments/g41vk7...,35,1587301000.0,Having a discussion with family and I am unabl...,False,AskIndia,The lockdown enforced by the government was tr...,2,How come western countries are facing so many ...
3,EXCLUSIVE: Documents Show Modi Govt Building 3...,2072,fjx9ih,https://www.huffingtonpost.in/entry/aadhaar-na...,314,1584440000.0,,False,Politics,And then came corona! To fuck up their plans.,3,EXCLUSIVE: Documents Show Modi Govt Building 3...
4,Startrail over Leh Palace - Leh Palace was bui...,215,epg8gi,https://i.redd.it/iwbeppdci3b41.jpg,6,1579190000.0,,False,Photography,Amazing to see this.,4,Startrail over Leh Palace - Leh Palace was bui...
5,"Random Daily Discussion Thread - September 04,...",15,czfltt,https://www.reddit.com/r/india/comments/czfltt...,109,1567598000.0,"^Beep ^Boop ^Bot, ^I ^am ^a ^bot! ^if ^any ^pr...",False,[R]eddiquette,I have a query! I want to make a payment in US...,5,"Random Daily Discussion Thread - September 04,..."
6,I find this extremely strange. - Album on Imgu...,41,g62sa1,https://imgur.com/a/pOM8XM0,9,1587597000.0,,False,AskIndia,"This has me curious, what are some documents o...",2,I find this extremely strange. - Album on Imgu...
7,"Kolkata horror: You should be raped, woman tel...",176,cn734h,https://timesofindia.indiatimes.com/city/kolka...,52,1565220000.0,,False,[R]eddiquette,"Fuck this, imagine a man saying this.",5,"Kolkata horror: You should be raped, woman tel..."
8,Chandrayaan Moon Landing,503,d0ffgv,https://www.reddit.com/r/india/comments/d0ffgv...,1401,1567798000.0,ISRO's Chandrayaan 2 mission is a significant ...,False,[R]eddiquette,"Experts at Home: ""Solar battery phook gayi"" 😂",5,Chandrayaan Moon Landing ISRO's Chandrayaan 2 ...
9,Ambedkar Jayanti 2020,1298,g15qec,https://i.redd.it/rcf07o8ress41.jpg,126,1586900000.0,,False,Politics,No news channel have explained in detail to co...,3,Ambedkar Jayanti 2020 No news channel have exp...


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 12 columns):
Title           1650 non-null object
Score           1650 non-null int64
ID              1650 non-null object
URL             1650 non-null object
num_comments    1650 non-null int64
created_on      1650 non-null float64
Body            635 non-null object
Original        1650 non-null bool
Flair           1650 non-null object
Comments        1557 non-null object
id              1650 non-null int64
Combine         1650 non-null object
dtypes: bool(1), float64(1), int64(3), object(7)
memory usage: 143.5+ KB


In [21]:
data.loc[34]['Combine']

'Spitting in public now an offence under Disaster Management Act: MHA Der aae durust aae '

### Cleaning our data that will be used as an input

In [22]:
REPLACE_SPACES = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS = re.compile('[^0-9a-z #+_]')

There are certain symbols which add no analytical value to the data. Similarly, there certain areas where there are extra spaces or bracket spaces which are being being substituted by just one space.

In [23]:
def clean_text(text):
    '''
        text: a string
        
        return: modified initial string
        
    '''

    text = text.lower() # lowercase text
    text = REPLACE_SPACES.sub(' ', text) 
    text = BAD_SYMBOLS.sub('', text) # Replace Bad Symbols which 
    text = text.replace('x', '')
    
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text

data['Combine'] = data['Combine'].apply(clean_text)
data['Combine'] = data['Combine'].str.replace('\d+', '')

In [24]:
data['Combine'].head(10)

0    cook clean peel onions make dough clean dishes...
1    tcs layoff employees salary increments either ...
2    come western countries facing many deaths numb...
3    eclusive documents show modi govt building  de...
4    startrail leh palace leh palace built th centu...
5    random daily discussion thread september   am ...
6    find etremely strange album imgur details atm ...
7    kolkata horror raped woman tells girl shorts f...
8    chandrayaan moon landing isros chandrayaan  mi...
9    ambedkar jayanti  news channel eplained detail...
Name: Combine, dtype: object

### TFIDF Feature Importance

TFIDF =  Term Frequency–Inverse Document Frequency.
It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [25]:
# Creating an instance of the Tfidf vectorizer
# I will be performing a hyperparameter tuning soon
tfidf = TfidfVectorizer(sublinear_tf=True, 
                        min_df=5, 
                        norm = 'l2', 
                        encoding='latin-1', 
                        ngram_range=(1, 2))

In [26]:
data.tail()

Unnamed: 0,Title,Score,ID,URL,num_comments,created_on,Body,Original,Flair,Comments,id,Combine
1645,"Hi /r/India, I’m Shankar. I’m an out Indian ga...",218,2b7pz6,https://www.reddit.com/r/india/comments/2b7pz6...,322,1405898000.0,I’m Shankar and I’m 22 years old. I came out p...,False,AMA,Can you rate the following in the order of aro...,6,hi r india im shankar im indian gay man living...
1646,"Relevance of Udemy courses in the job sector, ...",28,g4xetc,https://www.reddit.com/r/india/comments/g4xetc...,38,1587433000.0,has anybody ever put any kind of udemy certif...,False,AskIndia,Short answer: those certificates are BS.,2,relevance udemy courses job sector particularl...
1647,"At Delhi Haat, Delhi (3024x4032)",240,eo5lv3,https://i.redd.it/gxm4hpqkeka41.jpg,12,1578958000.0,,False,Photography,Did they walk with Gandhiji in his landmark ev...,4,delhi haat delhi walk gandhiji landmark events
1648,They're just everywhere these days.,1899,bmu870,https://i.redd.it/lltlzq3ngbx21.png,147,1557494000.0,,False,[R]eddiquette,The Zomato guy's secretly eating your food.,5,theyre everywhere days zomato guys secretly ea...
1649,"Hi, Norinder Mudi here. Ask me anything!",127,3ss2iv,https://www.reddit.com/r/india/comments/3ss2iv...,278,1447537000.0,Nomuste jonta!\n\nThis is Swagmohan. The guy w...,False,AMA,Do you think intolerance has increased since M...,6,hi norinder mudi ask anything nomuste jontathi...


In [27]:
# Extracting the features by fitting the Vectorizer on our Title data because that has the description of the post
feat = tfidf.fit_transform(data['Combine']).toarray()
print(feat.shape)

(1650, 3299)


Now, I need to look at the most correlated words with each category and list them. I am gonna look at monograms.

In [28]:
# chisq2 statistical test
N = 5    # Number of examples to be listed
for f, i in sorted(category_labels.items()):
    chi2_feat = chi2(feat, labels == i)
    indices = np.argsort(chi2_feat[0])
    feat_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [w for w in feat_names if len(w.split(' ')) == 1]
    bigrams = [w for w in feat_names if len(w.split(' ')) == 2]
    print("\nFlair '{}':".format(f))
    print("Most correlated unigrams:\n\t. {}".format('\n\t. '.join(unigrams[-N:])))
    print("Most correlated bigrams:\n\t. {}".format('\n\t. '.join(bigrams[-N:])))


Flair 'AMA':
Most correlated unigrams:
	. hi
	. anything
	. ask
	. questions
	. ama
Most correlated bigrams:
	. ask us
	. us anything
	. hi reddit
	. answer questions
	. ask anything

Flair 'AskIndia':
Most correlated unigrams:
	. advice
	. dad
	. situation
	. afraid
	. family
Most correlated bigrams:
	. ive seen
	. want know
	. feel like
	. work home
	. dont want

Flair 'Business/Finance':
Most correlated unigrams:
	. firms
	. emi
	. hdfc
	. mukesh
	. bank
Most correlated bigrams:
	. credit card
	. mukesh ambani
	. share market
	. reliance jio
	. yes bank

Flair 'Food':
Most correlated unigrams:
	. restaurant
	. chutney
	. recipe
	. chicken
	. food
Most correlated bigrams:
	. im trying
	. every day
	. couldnt find
	. dont eat
	. indian food

Flair 'Non-Political':
Most correlated unigrams:
	. rural
	. dads
	. found
	. bored
	. comics
Most correlated bigrams:
	. im gonna
	. palghar lynching
	. amazon prime
	. india live
	. amid lockdown

Flair 'Photography':
Most correlated unigrams:


### Model Input Preparation

In [138]:
flair_list = list(category_labels.keys())
flair_list

['Food',
 'Policy/Economy',
 'Politics',
 'AskIndia',
 'Photography',
 'Business/Finance',
 '[R]eddiquette',
 'Non-Political',
 'Science/Technology',
 'AMA',
 'Sports']

In [36]:
# Splitting 20% of the data into train test split
X_train, X_test, y_train, y_test = train_test_split(data['Combine'], data['Flair'], test_size=0.15, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1402,) (1402,) (248,) (248,)


I randomized the training and testing data for better predictions. This is very important since the data has homogenous flairs for every 150 entries. 


## Building our classifiers

I will be building functions for different clasifiers. These functions will have a pipeline implemented for each model. This pipeline will first create an instance of the Count Vectorizer to create vectors of word counts and then it will also implement a TFID Transformer. 


In [37]:
# Creating an instance of the TFID transformer
count_vec = CountVectorizer()
X_train_counts = count_vec.fit_transform(X_train)
print(X_train_counts)

  (0, 14783)	1
  (0, 14548)	1
  (0, 3861)	1
  (0, 15483)	1
  (0, 825)	1
  (0, 16550)	3
  (0, 12174)	1
  (0, 10065)	1
  (0, 10560)	2
  (0, 8486)	1
  (0, 7021)	1
  (0, 16812)	1
  (0, 14513)	1
  (0, 4939)	1
  (0, 13223)	1
  (0, 5032)	1
  (0, 16623)	2
  (0, 12971)	1
  (0, 9130)	1
  (0, 10523)	1
  (0, 363)	1
  (1, 13095)	1
  (1, 13307)	1
  (1, 6102)	1
  (1, 14744)	2
  :	:
  (1401, 11945)	3
  (1401, 11939)	1
  (1401, 4695)	1
  (1401, 16293)	1
  (1401, 16211)	1
  (1401, 14669)	2
  (1401, 13100)	1
  (1401, 8560)	1
  (1401, 12503)	2
  (1401, 5485)	1
  (1401, 8974)	1
  (1401, 14251)	1
  (1401, 684)	1
  (1401, 3014)	1
  (1401, 7067)	1
  (1401, 5890)	1
  (1401, 2352)	1
  (1401, 10291)	2
  (1401, 8215)	1
  (1401, 5692)	1
  (1401, 2511)	1
  (1401, 66)	1
  (1401, 11998)	1
  (1401, 14592)	1
  (1401, 11310)	1


In [38]:
# Creating an instance of the TFID transformer
tfidf_trans = TfidfTransformer()
X_train_tfidf = tfidf_trans.fit_transform(X_train_counts)
print(X_train_tfidf)

  (0, 16812)	0.11022236890576421
  (0, 16623)	0.1925209243808672
  (0, 16550)	0.42626574810284235
  (0, 15483)	0.17615073768910242
  (0, 14783)	0.23418542063815603
  (0, 14548)	0.22161409270279045
  (0, 14513)	0.14130361141634007
  (0, 13223)	0.1627944384691295
  (0, 12971)	0.12439900504259141
  (0, 12174)	0.13143004595762284
  (0, 10560)	0.46837084127631207
  (0, 10523)	0.23418542063815603
  (0, 10065)	0.13316908358352358
  (0, 9130)	0.10883179012579917
  (0, 8486)	0.12020278585316788
  (0, 7021)	0.20012326565000102
  (0, 5032)	0.15022311053376394
  (0, 4939)	0.17863243859721162
  (0, 3861)	0.23418542063815603
  (0, 825)	0.22161409270279045
  (0, 363)	0.19534386858520836
  (1, 14744)	0.4692741705095136
  (1, 13478)	0.22819142474839169
  (1, 13308)	0.1982366873500187
  (1, 13307)	0.2670309452384351
  :	:
  (1401, 8215)	0.15983661879241237
  (1401, 7335)	0.05020117357779233
  (1401, 7180)	0.1453138840129959
  (1401, 7165)	0.10554782093250241
  (1401, 7067)	0.15340353250776484
  (1401, 6

In [39]:
# Model input Sequences
pre_train = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer())])

### Naive Bayes Classifier Pipeline
The first one that I am building is the Naive Bayes Classifier. The one most suitable for word counts is the multinomial variant

In [40]:
def nb_classifier(X_train, X_test, y_train, y_test):
    
    nb_fit = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', MultinomialNB()),
                 ])
    nb_fit.fit(X_train, y_train)    # Fitting the data to the trianing data
    
    # Making Predictions on the test data
    y_pred = nb_fit.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

### Logistic Regression Model

In [41]:
def log_reg(X_train, X_test, y_train, y_test):
    
    logreg = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', LogisticRegression()),
                 ])
    logreg.fit(X_train, y_train)     # Fitting the data to the trianing data

    # Making Predictions on the test data
    y_pred = logreg.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

### Random Forest Model

In [42]:
def random_forest(X_train, X_test, y_train, y_test):
    
    forest = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', RandomForestClassifier()),
                 ])
    forest.fit(X_train, y_train)    # Fitting the data to the trianing data
    
    # Making Predictions on the test data
    y_pred = forest.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

### Linear SVC

In [43]:
def svc(X_train, X_test, y_train, y_test):
    
    svc_fit = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', SVC()),
                 ])
    svc_fit.fit(X_train, y_train)    # Fitting the data to the trianing data
    
    # Making Predictions on the test data
    y_pred = svc_fit.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

In [44]:
X_test_counts = count_vec.transform(X_test)
X_test_tfidf = tfidf_trans.transform(X_test_counts)

In [45]:
log_model = SVC()
log_model.fit(X_train_tfidf, y_train)

y_pred = log_model.predict(X_test_tfidf)
accuracy_score(y_pred=y_pred, y_true=y_test)



0.06048387096774194

### Making predictions.

In [46]:
print("Evaluate Naive Bayes Classifier")
nb_classifier(X_train, X_test, y_train, y_test)

print("Evaluate Random Forest Classifier")
random_forest(X_train, X_test, y_train, y_test)

print("Evaluate Logistic Regression Model")
log_reg(X_train, X_test, y_train, y_test)

print("Evaluate SVC Model")
svc(X_train, X_test, y_train, y_test)

Evaluate Naive Bayes Classifier
Model Accuracy: 0.43951612903225806
Evaluate Random Forest Classifier




Model Accuracy: 0.4274193548387097
Evaluate Logistic Regression Model




Model Accuracy: 0.5645161290322581
Evaluate SVC Model




Model Accuracy: 0.06048387096774194


Logistic Regression Model gives the best results and was working well in the flask app. Apart from that the model is not converging for some cases. However, Heroku results in a error that says that module is not found even though it is present there so I will be using SVC right now. 

### SVC Model Tuning

In [38]:
# param_grid = {'C': [0.1,1, 10, 100], 
#               'gamma': [1,0.1,0.01,0.001],
#               'kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

# grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3) 
  
# # fitting the model for grid search 
# grid.fit(X_train_tfidf, y_train) 

In [39]:
# # print best parameter after tuning 
# print(grid.best_params_) 

# # print how our model looks after hyper-parameter tuning 
# print(grid.best_estimator_) 


#### Testing the best parameters 

In [40]:
def svc(X_train, X_test, y_train, y_test):
    
    svc_fit = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', SVC(C=1, gamma=1, kernel='linear')),
                 ])
    svc_fit.fit(X_train, y_train)    # Fitting the data to the trianing data
    
    # Making Predictions on the test data
    y_pred = svc_fit.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

In [41]:
print("Evaluate SVC Model")
svc(X_train, X_test, y_train, y_test)

Evaluate SVC Model
Model Accuracy: 0.6129032258064516


This gives us a 62% accuracy which is the best so far and I will go with this for now. Heroku results in an error for this model as well. Hence, I will tune my Naive Bayes Model next. Here is the link to the error logs: [Click](https://stackoverflow.com/questions/61417803/modulenotfounderror-no-module-named-sklearn-svm-classes-on-heroku?noredirect=1#comment108648845_61417803)

### Tuning Multinomial Naive Bayes Classifier

In [80]:
nb_fit = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

In [81]:
param_grid = {
    'n_jobs': 
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

In [83]:
from sklearn.metrics import classification_report

clf = GridSearchCV(nb_fit, param_grid, cv=10, scoring='accuracy', verbose=3)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test), digits=4))


Fitting 10 folds for each of 36 candidates, totalling 360 fits
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.326, total=   0.2s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.326, total=   0.2s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.357, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s


[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.257, total=   0.2s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.286, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.264, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.250, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.350, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfid

[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.186, total=   0.5s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.286, total=   0.5s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.214, total=   0.5s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.264, total=   0.5s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.248, total=   0.3s
[CV] clf__alpha=1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=1, tfidf__nor

[CV]  clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(2, 2), score=0.421, total=   0.4s
[CV] clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.447, total=   0.2s
[CV] clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.496, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.493, total=   0.1s
[CV] clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.457, total=   0.2s
[CV] clf__alpha=1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=1, tfidf__norm

[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.536, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.457, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.457, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.479, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.457, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1

[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.357, total=   0.4s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.321, total=   0.3s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.321, total=   0.4s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.393, total=   0.3s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2), score=0.364, total=   0.3s
[CV] clf__alpha=0.1, tfidf__norm=l1, tfidf__use_idf=False, vect__ngram_range=(2, 2) 
[CV]  clf_

[CV]  clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.536, total=   0.1s
[CV] clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.457, total=   0.1s
[CV] clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 1), score=0.457, total=   0.1s
[CV] clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.511, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.504, total=   0.5s
[CV] clf__alpha=0.1, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf_

[CV]  clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(1, 2), score=0.521, total=   0.5s
[CV] clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2), score=0.383, total=   0.3s
[CV] clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2), score=0.426, total=   0.4s
[CV] clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2), score=0.329, total=   0.4s
[CV] clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2), score=0.350, total=   0.4s
[CV] clf__alpha=0.01, tfidf__norm=l1, tfidf__use_idf=True, vect__ngram_range=(2, 2) 
[CV]  clf_

[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.504, total=   0.1s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.536, total=   0.1s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.543, total=   0.1s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.521, total=   0.1s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1), score=0.471, total=   0.1s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf_

[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.550, total=   0.6s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.521, total=   0.5s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.479, total=   0.6s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.479, total=   0.5s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 
[CV]  clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2), score=0.536, total=   0.5s
[CV] clf__alpha=0.01, tfidf__norm=l2, tfidf__use_idf=False, vect__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:  2.1min finished


                    precision    recall  f1-score   support

               AMA     0.7826    0.9000    0.8372        20
          AskIndia     0.3125    0.6818    0.4286        22
  Business/Finance     0.6667    0.6400    0.6531        25
              Food     0.5455    0.6000    0.5714        20
     Non-Political     0.4000    0.1818    0.2500        22
       Photography     0.9048    0.6333    0.7451        30
    Policy/Economy     0.8333    0.3704    0.5128        27
          Politics     0.2821    0.7857    0.4151        14
Science/Technology     0.3333    0.3158    0.3243        19
            Sports     0.9000    0.8182    0.8571        22
     [R]eddiquette     0.8182    0.3333    0.4737        27

          accuracy                         0.5565       248
         macro avg     0.6163    0.5691    0.5517       248
      weighted avg     0.6481    0.5565    0.5615       248



In [84]:
# print best parameter after tuning 
print(clf.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(clf.best_estimator_) 


{'clf__alpha': 0.01, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l1', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=0.01, class_prior=Non

#### Redifining and Evaluation

In [101]:
def nb_classifier(X_train, X_test, y_train, y_test):
    
    nb_fit = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', MultinomialNB(alpha=0.01)),
                 ])
    nb_fit.fit(X_train, y_train)    # Fitting the data to the trianing data
    
    # Making Predictions on the test data
    y_pred = nb_fit.predict(X_test)
    acc = accuracy_score(y_pred=y_pred, y_true=y_test)
    print("Model Accuracy: {}".format(acc))

In [102]:
print("Evaluate Naive Bayes Classifier")
nb_classifier(X_train, X_test, y_train, y_test)

Evaluate Naive Bayes Classifier
Model Accuracy: 0.5201612903225806


### Random Forest Model Tuning

In [43]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid which will include the parameters we will be testing
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [44]:
# Look at the parameter list
from pprint import pprint
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


In [45]:
''' TAKES TIME TO EXECUTE SO SHOULDN'T EXECUTE AGAIN
# Use the random grid to search for best hyperparameters

# First create the base model to tune
rf = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(X_train_tfidf, y_train)
'''

" TAKES TIME TO EXECUTE SO SHOULDN'T EXECUTE AGAIN\n# Use the random grid to search for best hyperparameters\n\n# First create the base model to tune\nrf = RandomForestClassifier()\n\n# Random search of parameters, using 3 fold cross validation, \n# search across 100 different combinations, and use all available cores\nrf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)\n\nrf_random.fit(X_train_tfidf, y_train)\n"

#### Evaluation

In [46]:
# Function to evaluate Model performance
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    accuracy = accuracy_score(y_pred=predictions, y_true=test_labels)
    print('Model Performance')
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

In [47]:
# # Preparing test data
# X_test_counts = count_vec.transform(X_test)
# X_test_tfidf = tfidf_trans.transform(X_test_counts)
# base_model = RandomForestClassifier(n_estimators = 10, random_state = 42)
# base_model.fit(X_train_tfidf, y_train)
# base_accuracy = evaluate(base_model, X_test_tfidf , y_test)

In [48]:
# best_random = rf_random.best_estimator_
# random_accuracy = evaluate(best_random, X_test_tfidf , y_test)

I still get a 52% accuracy so I will go for an SVC model instead. 

## Saving the model for Deployment
Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.



In [104]:
nb_classifier = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', MultinomialNB(alpha=0.01)),
                 ])

nb_classifier.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('model',
                 MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))],
         verbose=False)

In [106]:
import joblib 

filename = 'final_model.sav'
joblib.dump(nb_classifier, filename)

['final_model.sav']

## Performance explanation

I have found out certain reasons for my model performance. While some issues are there with the method of implementation, and the model type and parameters, there are issues which can also be attributed to the data quality. 

### Data Quality

The data quality issues is one of the biggest issues with this current model. Some issues that I came across were:-
* Unigram failing in the politics flair. When I redownloaded my data on 22rd of April, the Arnab and Sonia Gandhi incident had happened which included a words like car, stones, Italy etc. These words are generally not associated with politics but because of this incident, they came out on top and messed with the analysis. 
* Secondly, I scrolled through reddit for hours to understand the data and thread pattern. An issue I witnessed there was that flairs like non-political often have clearly political tones and words like BJP, Hindu, Muslim etc. This results in an inaccurate representation. Similarly, AMA and AskIndia also have similar issues. 

### Feature Selection

* I think something that could make the analysis better is exploring the url feature more. On posts which do not have a body, the url redirects to the additional content. If I can inculcate this data, then my features could become richher. 

* Secondly, the way my data is arranged is that there are 150 posts of each type one after the other. So, when I randomly sample it as testing and training data, there is skewness in the flairs and they aren't equally present in the training data which makes the analysis a little poor. 

### Model Selection

Naive Bayes is a simple but useful technique for text classification tasks. It is also a frequency probability estimate. The above issues I mentioned reduce the model performance because the word frequencies get radically affected due to the above reasons. 