<h1><center> Mini-Project 2 - Reddit discussions classification </center></h1>

<div style="text-align: right"> Amen Memmi</div>
<div style="text-align: right"> amen.memmi@mail.mcgill.ca</div>
<div style="text-align: right">  ID: 260755070</div>

This notebook is prepared in the framework of Mini Project 2, Applied Machine Learning course, McGill University.


The goal is to devise a machine learning algorithm to analyze short conversations extracted from the Reddit
website, and automatically classify them according to their topics, which include _hockey, movies, nba,
news, nfl, politics, soccer and worldnews_.


Provided data is as follows: 
 - The training set consists of approximately 160,000 conversations. 
 - The test set consists of approximately 50,000 conversations. 
 - Only training labels are provided.

Let's start by loading the necessary 

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

#### Import the data

In [2]:
data = pd.read_csv('train_input.csv')
data['category'] =  pd.read_csv('train_output.csv')['category']
data = data.drop(columns={'id'})
data.tail()

Unnamed: 0,conversation,category
164995,"<speaker_1> 2015 nfl draft "" i told you so "" t...",nfl
164996,<speaker_1> pk subban on lundqvist 's <number>...,hockey
164997,<speaker_1> kyrie irving and kevin love had a ...,nba
164998,<speaker_1> miroslav klose has the broken the ...,soccer
164999,<speaker_1> attorney charged with having sex w...,news


The following function is to clean the conversations

In [3]:
def clean_str(s):
    """Clean sentence"""
    for expr in [r"</d>", r"</s>",r"<speaker_1>",r"<speaker_2>",r"[^A-Za-z0-9(),!?\'\`]"]:
        s = re.sub(expr, " ", s)
    for expr in [r"\'s",r"\'ve",r"\'t",r"\'re",r"\'d",r"\'ll",]:
        s = re.sub(expr, " "+expr[1:], s)
    for expr in [r",",r"!",r"\(",r"\)"r"\?"]:
        s = re.sub(expr, " "+expr[1:]+" ", s)
    s = re.sub(r"\s{2,}", " ", s)
    s = re.sub(r'\S*(x{2,}|X{2,})\S*',"xxx", s)
    s = re.sub(r'[^\x00-\x7F]+', "", s)
    return s.strip().lower()

#### Clean the conversations

In [4]:
data["conversation"] = data["conversation"].apply(lambda x: clean_str(x))

#### Split the data into train and test sets:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data.conversation, data.category, test_size=0.2, random_state=42)

## Hard coded Naive Bayes classifier

Here I am coding the Naive Bayes classifier using Bayes rule. First we compute the priors and likelihood.

In [65]:
%%time
# Extracting features from conversations
count_vect = CountVectorizer()            # Convert a coonversations to a matrix of token counts
X_train_counts = count_vect.fit_transform(X_train)
classes = data['category'].unique()
prior = np.zeros(len(classes))
likelihood = np.zeros((len(classes),X_train_counts.shape[1]))
for c in range(len(classes)):
    extracted_class = data[data["category"]==classes[c] ].conversation
      # Prior probabilities of each class
    prior[c] = len(extracted_class)/X_train_counts.shape[0]
    aux = np.asarray(count_vect.transform(extracted_class).sum(axis=0))[0]   # summing over the lines and transforming to array   
      # features likelihood conditional probability of feature given class
    likelihood[c] = (aux+1)/(sum(aux+1))                                     # the +1 is for Laplace smoothing
                                                           

Wall time: 47.6 s


Now we make the class prediction

In [75]:
%%time
test_count=count_vect.transform(X_test )
prediction=[]
for conv_index in range(test_count.shape[0]):
    score=np.zeros((len(classes),1))
    for c in range(len(classes)):
        score[c]= np.log(prior[c])+sum(np.log(likelihood[c,test_count[conv_index].indices]))
    prediction.append(classes[np.argmax(score)])


Wall time: 21.3 s


In [78]:
# Performance of the hard coded NB Classifier
np.mean(prediction == y_test)

0.9318787878787879

## Naive Bayes prediction using MultinomialNB from scikit 

In [34]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

Here I am creating a pipeline for the **MultinomialNB** classifier with  **CountVectorizer** (to convert a coonversations to a matrix of token counts) and **TfidfTransformer** (to transform count matrix to a normalized _tf-idf_ [term-frequency times inverse document-frequency] representation)

In [35]:
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
# parameters' names are arbitrary but will be used in Grid search later
text_clf = text_clf.fit(X_train, y_train)

In [36]:
# Performance of NB Classifier
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)

0.9055151515151515

### Grid Search


Here, I'm creating a list of parameters to tune. <br>
All the parameters' names start with the classifier parameters' names (remember the arbitrary names I gave above). 

**E.g.**:
- vect__ngram_range; here we are telling to use unigram _ngram_range=(1,1)_ and bigrams _ngram_range=(1,2)_ and choose the one which is optimal.<br>
- tfidf__use_idf: whether to use tfidf or not
- clf__alpha: the bias term alpha of the clf

In [81]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1),(1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-1, 1e-2)}

 Next, I am creating an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.

In [82]:
%%time
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

Wall time: 3min 19s


Let's see the best score and the corresponding parameters

In [84]:
print (gs_clf.best_score_)
print (gs_clf.best_params_)

0.9482272727272727
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


We see that using bigrams, tfidf and alpha=0.01 allows better accuracy

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Now let's reevaluate the hard coded version with bigrams, trigram and more

In [27]:
%%time
# Extracting features from conversations
count_vect = CountVectorizer(ngram_range=(1,2))            # Convert a coonversations to a matrix of token counts
X_train_counts = count_vect.fit_transform(X_train)
classes = data['category'].unique()
prior = np.zeros(len(classes))
likelihood = np.zeros((len(classes),X_train_counts.shape[1]))
for c in range(len(classes)):
    extracted_class = data[data["category"]==classes[c] ].conversation
      # Prior probabilities of each class
    prior[c] = len(extracted_class)/X_train_counts.shape[0]
    aux = np.asarray(count_vect.transform(extracted_class).sum(axis=0))[0]   # summing over the lines and transforming to array   
      # features likelihood conditional probability of feature given class
    likelihood[c] = (aux+1)/(sum(aux+1))                                     # the +1 is for Laplace smoothing
test_count=count_vect.transform(X_test )
prediction=[]
for conv_index in range(test_count.shape[0]):
    score=np.zeros((len(classes),1))
    for c in range(len(classes)):
        score[c]= np.log(prior[c])+sum(np.log(likelihood[c,test_count[conv_index].indices]))
    prediction.append(classes[np.argmax(score)])                                                           

Wall time: 1min 27s


In [28]:
# Performance of the hard coded NB Classifier
np.mean(prediction == y_test)                                                          

0.9786969696969697

#### Trigrams

In [46]:
%%time
# Extracting features from conversations
count_vect = CountVectorizer(ngram_range=(1,3))            # Convert a coonversations to a matrix of token counts
X_train_counts = count_vect.fit_transform(X_train)
classes = data['category'].unique()
prior = np.zeros(len(classes))
likelihood = np.zeros((len(classes),X_train_counts.shape[1]))
for c in range(len(classes)):
    extracted_class = data[data["category"]==classes[c] ].conversation
      # Prior probabilities of each class
    prior[c] = len(extracted_class)/X_train_counts.shape[0]
    aux = np.asarray(count_vect.transform(extracted_class).sum(axis=0))[0]   # summing over the lines and transforming to array   
      # features likelihood conditional probability of feature given class
    likelihood[c] = (aux+1)/(sum(aux+1))                                     # the +1 is for Laplace smoothing
test_count=count_vect.transform(X_test )
prediction=[]
for conv_index in range(test_count.shape[0]):
    score=np.zeros((len(classes),1))
    for c in range(len(classes)):
        score[c]= np.log(prior[c])+sum(np.log(likelihood[c,test_count[conv_index].indices]))
    prediction.append(classes[np.argmax(score)])                                                           

Wall time: 2min 23s


In [30]:
# Performance of the hard coded NB Classifier
np.mean(prediction == y_test)

0.9898181818181818

I tried higher than trigram but the gain is negligible while computation time goes noticibely higher. <br>
Let's stick with trigrams and generate the prediction for the test set to submit.

In [42]:
data_test = pd.read_csv('test_input.csv')
data_test = data_test.drop(columns={'id'})
data_test["conversation"] = data_test["conversation"].apply(lambda x: clean_str(x))

In [47]:
%%time
# Extracting features from conversations
test_count = count_vect.transform(data_test["conversation"] )
prediction=[]
for conv_index in range(test_count.shape[0]):
    score=np.zeros((len(classes),1))
    for c in range(len(classes)):
        score[c]= np.log(prior[c])+sum(np.log(likelihood[c,test_count[conv_index].indices]))
    prediction.append(classes[np.argmax(score)])      

Wall time: 57.6 s


In [51]:
prediction_df = pd.DataFrame({'id': range(len(prediction)), 'category':prediction})
prediction_df.sample(3)

Unnamed: 0,id,category
25010,25010,news
48790,48790,nfl
44715,44715,movies


We export the prediction dataframe:

In [None]:
prediction_df.to_csv('prediction_df.csv', index=False)