PART 3 - Building a classifier

In [None]:
from datetime import datetime
from tqdm import tqdm
import praw
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score, classification_report

Reading CSV data

In [2]:
dataf=pd.read_csv('reddit_data.csv')
dataf #the loaded dataset

Unnamed: 0,flair,title,score,upvote_ratio,url,author,locked,orig_content,text,comms_num,timestamp,comments
0,Politics,A polite request to all Indians here,396,0.96,https://www.reddit.com/r/india/comments/g2ct57...,aaluinsonaout,False,False,I don't know if it is the same situation in ot...,82,2020-04-16 16:27:46,Our society thrives on abuse of power. We let...
1,Politics,Pitting a community against a political party ...,196,0.80,https://www.reddit.com/r/india/comments/futac9...,chillinvillain122,False,False,First of all let me start by saying it was stu...,73,2020-04-04 18:28:28,Our country is just too far in at the moment ...
2,Politics,A new political party gave a full front page a...,730,0.97,https://i.redd.it/yjo9wpy38el41.jpg,aaluinsonaout,False,False,,146,2020-03-08 12:06:11,This looks like an IIPM ad 1. Where did they ...
3,Politics,Hit by backlash over posts on lack of medical ...,407,0.97,https://theprint.in/india/hit-by-backlash-over...,hipporama,False,False,,67,2020-03-26 17:47:25,"Well, Some people really deserve to die. ~~/s..."
4,Politics,Politics in the time of corona: WB CM question...,85,0.87,https://www.timesnownews.com/india/article/pol...,ConcernedCitizen034,False,False,,22,2020-04-09 18:33:54,"Oh FFS. \n\nYellow, Orange, Green, Red, all a..."
...,...,...,...,...,...,...,...,...,...,...,...,...
2306,Coronavirus,Covid-19: Kamal Nath says lockdown was delayed...,439,0.86,https://scroll.in/latest/958962/covid-19-kamal...,Ib90,False,False,,38,2020-04-13 09:26:58,*I has biggest IQ in the whole MP* - Probably...
2307,Coronavirus,"Coronavirus Pandemic: Claps, Candles And Diya ...",19,0.78,https://www.inventiva.co.in/stories/nandini/co...,hauntin,False,False,,1,2020-04-11 20:13:25,
2308,Coronavirus,"Contrary to a news report, Aadtiya Thackeray h...",25,0.84,https://www.reddit.com/r/india/comments/g159jt...,proyo7,False,False,"On 7th April, The New Indian Express reported:...",2,2020-04-14 18:41:28,"Ummm if there is no community transmission, h..."
2309,Coronavirus,"Coronavirus Outbreak: A database of books, per...",3,0.71,https://www.firstpost.com/long-reads/coronavir...,Lister971191,False,False,,0,2020-04-12 07:43:10,


Text Preprocessing

we need to preprocess the text for the comments,title,text features

In [3]:
#https://stackoverflow.com/questions/54396405/how-can-i-preprocess-nlp-text-lowercase-remove-special-characters-remove-numb
def preprocess_text(text):
    text=str(text)
    text=text.lower()# make the text lowercase
    interval_char=re.compile('[/(){}\[\]\|@,;]')
    special_char= re.compile('[^0-9a-z #+_]')
    text=interval_char.sub(' ',text)
    text=special_char.sub('', text)
    words=text.split()
    text = ' '.join(i for i in words if i not in set(stopwords.words('english')))
    return text
#applying the preprocessing function to the 3 features
dataf['title'] = dataf['title'].apply(preprocess_text)
dataf['text'] = dataf['text'].apply(preprocess_text)
dataf['comments'] = dataf['comments'].apply(preprocess_text)

In [5]:
title=dataf["title"]
comments=dataf["comments"]
text=dataf['text']

vectorizing the textual features

we make 3 sepearate vectorizers for all 3 features

In [6]:
#https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect_title = CountVectorizer()
count_vect_text = CountVectorizer()
count_vect_comments = CountVectorizer()
title_token = count_vect_title.fit_transform(title)
text_token = count_vect_text.fit_transform(text)
comments_token= count_vect_comments.fit_transform(comments)
print(comments_token.shape,
text_token.shape,
title_token.shape)

(2311, 62288) (2311, 34082) (2311, 6600)


After vectorizing we use TFIDF to get the features for training the model

In [7]:
tfidf_transformer_title = TfidfTransformer()
tfidf_transformer_text = TfidfTransformer()
tfidf_transformer_comments = TfidfTransformer()

TITLE = tfidf_transformer_title.fit_transform(title_token)
TEXT = tfidf_transformer_text.fit_transform(text_token)
COMMENTS = tfidf_transformer_comments.fit_transform(comments_token)

In [8]:
y=dataf["flair"] #labels
TITLE=pd.DataFrame(TITLE.toarray())
TEXT=pd.DataFrame(TEXT.toarray())
COMMENTS=pd.DataFrame(COMMENTS.toarray())

X=pd.concat([TITLE,TEXT,COMMENTS,dataf['score'],dataf['upvote_ratio'],
             dataf['locked'],dataf['comms_num']],axis=1)
# X = features

Our final X is (Title, Text, Comments, Score, Upvote_Ratio, Locked, comms_num)

We split the data into 67% Train and 33% Test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We test a variety of ML models and choose the one with the best results

Random Forest Classifier

In [10]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators = 1000, random_state = 42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print('accuracy',accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

accuracy 0.8152031454783748
                    precision    recall  f1-score   support

          AskIndia       0.81      0.82      0.81        76
  Business/Finance       0.81      0.77      0.79        61
       CAA-NRC-NPR       0.97      0.85      0.91        40
       Coronavirus       0.81      0.89      0.84        61
              Food       0.97      0.87      0.92        77
     Non_Political       0.87      0.88      0.88        76
       Photography       0.89      0.84      0.87        58
    Policy/Economy       0.73      0.71      0.72        68
          Politics       0.69      0.81      0.74        63
         Scheduled       0.74      0.67      0.70        60
Science/Technology       0.73      0.92      0.81        61
            Sports       0.87      0.76      0.81        62

          accuracy                           0.82       763
         macro avg       0.82      0.81      0.82       763
      weighted avg       0.82      0.82      0.82       763



Our classifier has an overall accuracy of 81.5%, we can see that is best at classifying the flair 'CAA-NRC-NPR' and 'Food' with high f1-scores of 0.91 and 0.92 respectively. It is least effective at classifying 'Policy/Economy' and 'Scheduled' with f1-scores of 0.72 and 0.70 respectively.

MLP classifier

In [11]:
from sklearn.neural_network import MLPClassifier
mlp=MLPClassifier(hidden_layer_sizes=(30,30,30))
mlp.fit(X_train, y_train)

y_pred1 = mlp.predict(X_test)

print('accuracy',accuracy_score(y_pred1, y_test))
print(classification_report(y_test, y_pred1))

accuracy 0.5504587155963303
                    precision    recall  f1-score   support

          AskIndia       0.51      0.67      0.58        76
  Business/Finance       0.70      0.46      0.55        61
       CAA-NRC-NPR       0.73      0.68      0.70        40
       Coronavirus       0.78      0.51      0.61        61
              Food       0.55      0.69      0.61        77
     Non_Political       0.69      0.66      0.68        76
       Photography       0.77      0.52      0.62        58
    Policy/Economy       0.55      0.54      0.55        68
          Politics       0.37      0.73      0.49        63
         Scheduled       0.46      0.20      0.28        60
Science/Technology       0.41      0.69      0.51        61
            Sports       0.81      0.21      0.33        62

          accuracy                           0.55       763
         macro avg       0.61      0.55      0.54       763
      weighted avg       0.60      0.55      0.54       763



MLP classifier gives an accuracy of 55 %. It performs best on the CAA-NRC-NPR flair and the worst on Scheduled with f1 scores of 0.70 and 0.28 respectively.

Naive Bayes Classifier

In [13]:
from sklearn.naive_bayes import MultinomialNB
NaiveBayes=MultinomialNB()
NaiveBayes.fit(X_train, y_train)

y_pred2 = NaiveBayes.predict(X_test)

print('accuracy',accuracy_score(y_pred2, y_test))
print(classification_report(y_test, y_pred2))

accuracy 0.1913499344692005
                    precision    recall  f1-score   support

          AskIndia       0.21      0.82      0.34        76
  Business/Finance       0.20      0.02      0.03        61
       CAA-NRC-NPR       0.00      0.00      0.00        40
       Coronavirus       0.00      0.00      0.00        61
              Food       0.67      0.10      0.18        77
     Non_Political       0.00      0.00      0.00        76
       Photography       0.86      0.21      0.33        58
    Policy/Economy       0.50      0.06      0.11        68
          Politics       0.11      0.68      0.19        63
         Scheduled       0.00      0.00      0.00        60
Science/Technology       0.39      0.25      0.30        61
            Sports       1.00      0.02      0.03        62

          accuracy                           0.19       763
         macro avg       0.33      0.18      0.13       763
      weighted avg       0.34      0.19      0.13       763



  _warn_prf(average, modifier, msg_start, len(result))


The Naive Bayes classifier performs poorly with an accuracy of 19%. It performs best on AskIndia and the worst on Coronavirus, Scheduled, Non-Political, CAA-NRC-NPR

After performing feature extraction we find that our dataset has about 100,000 features(encoded text features) and 2311 data points. Since we have a large number of features which are 1/0 (encoded text), we can perform classification without dimensionality reduction. Had the number of features been more and non-binary we might have to explore the option of using PCA as well to reduce computation time.

After training multiple models we see that the performance of the Randomforest classifier is the best. 

Hence, we save the random forest classifier and the countvectorizer, TFIDF files for buliding our app

In [None]:
import pickle
pickle.dump(clf, open('random_forest.p', 'wb'))

In [None]:
pickle.dump(count_vect_title, open('CountVectorizer_title.p', 'wb'))
pickle.dump(count_vect_text, open('CountVectorizer_text.p', 'wb'))
pickle.dump(count_vect_comments, open('CountVectorizer_comments.p', 'wb'))
pickle.dump(tfidf_transformer_title, open('TFIDF_title.p', 'wb'))
pickle.dump(tfidf_transformer_text, open('TFIDF_text.p', 'wb'))
pickle.dump(tfidf_transformer_comments, open('TFIDF_comments.p', 'wb'))