# Meme classification

For this task, I will use the token counts method to vectorize the meme transcriptions. I am using the Scikit-Learn library for most of the next steps, and [this article](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a) as reference.

## Dataset splitting and model comparison before tuning

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import ast

In [23]:
# Load the dataset
df = pd.read_csv("./data/final_dataset.csv", index_col=0)
df.columns

Index(['meme_id', 'meme_template', 'subreddit', 'category', 'url',
       'transcription', 'title', 'tokenized_transcription', 'tokenized_title',
       'tokenized_merged_transc_title'],
      dtype='object')

In [3]:
# Split train and test data

X_train, X_test, y_train, y_test = train_test_split(df["tokenized_transcription"], df["category"], random_state=42, test_size=0.2)

In [4]:
X_train

596     ['foe', 'Poor', 'pe', 'ople', 'ee', 'eee', 'Le...
5014                  ['nae', 'MELROSE', 'a', 'eS', 'aa']
1612    ['BIRTH', 'CONTROL', 'EFFECTIVENESS', 'ED', 'a...
2683    ['Thats', 'a', 'nasty', 'hoodie', 'Be', 'That'...
652     ['Normal', 'girls', 'with', 'natural', 'skin',...
                              ...                        
3772    ['Who', 'beds', 'WAT', 'ae', 'A', 'Russian', '...
5191    ['A', 'Joseph', 'Josephs', 'ae', 'A', 'A', 'wi...
5226    ['Me', 'Can', 'I', 'nave', 'some', 'itriend', ...
5390    ['Asking', 'your', 'crush', 'out', 'ad', 'a', ...
860     ['When', 'you', 'get', 'a', 'bunch', 'of', 'do...
Name: tokenized_transcription, Length: 4344, dtype: object

Let's train a Naive Bayes classifier first

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

text_clf = Pipeline([('vect', CountVectorizer()), # Counts words per sentence
                     ('tfidf', TfidfTransformer()), # Avoids giving more weight to longer sentences and reduces weight of very common words
                     ('clf', MultinomialNB()),
])

In [27]:
text_clf = text_clf.fit(X_train, y_train)

In [8]:
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)

0.5405156537753223

That's slightly better than random prediction. Let's now try using Support Vector Machines:

In [9]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', 
                                                   alpha=1e-3, random_state=42)),
])

In [10]:
_ = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
np.mean(predicted_svm == y_test)

0.7016574585635359

In [11]:
sample = X_test.sample(1)
single_prediction = text_clf_svm.predict(sample)[0]
sample_sentence = sample.values[0]
print(f"Sample tokenized sentence: {sample_sentence}, \nPredicted topic: {single_prediction}")
print(f"Actual topic: {y_test[sample.index].values[0]}")

Sample tokenized sentence: ['Asmall', 'injury', 'like', 'stubbing', 'my', 'toe', 'that', 'goes', 'away', 'within', 'minutes', 'Me', 'Is', 'this', 'a', 'reason', 'to', 'stay', 'home', 'from', 'school', '?'], 
Predicted topic: Youth
Actual topic: Youth


## Hyperparameter tuning

Let's start by tuning the hyperparameters of the Naive Bayes model:

In [12]:
from sklearn.model_selection import GridSearchCV

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

In [13]:
gs_clf.best_score_
gs_clf.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

In [21]:
predicted_gs = gs_clf.predict(X_test)
np.mean(predicted_gs == y_test)

0.7265193370165746

After tuning the hyperparameters of the Naive Bayes model, we can see an accuracy improvement of nearly 20%, topping the results of SVM before tuning its hyperparameters. Let's now do just that:

In [15]:
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3),
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train, y_train)
gs_clf_svm.best_score_
gs_clf_svm.best_params_

{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

In [20]:
predicted_svm_gs = gs_clf_svm.predict(X_test)
np.mean(predicted_svm_gs == y_test)

0.7016574585635359

Apparently, the hyperparameters were already optimal for the SVM, so Naive Bayes is performing better than SVM, at this stage. 

To improve the prediction accuracy, it would be interesting to integrate:
- The title of the post, or 
- The graphic content of the meme 
as additional inputs for the models. 

Regarding the second option, the choice of the data was a bit unfortunate: The memes in this dataset are all made with the most popular meme templates of 2018 (aptly titled the same way). This means that the distribution of use of any given template across the subreddits will probably be relatively homogeneous, resulting in a low covariance. I regrettably didn't think about this implication until it was too late. On the other hand, the choice of public, prepared datasets with usable categories/topics was practically non-existent. With this being a clear limitation of this work, future efforts might explore a more graphically diverse dataset with a computer vision-enhanced classification method.

Nevertheless, the first option seems a very viable way to improve prediction accuracy. This is what I will be exploring in the next steps.

## Adding the title as input

For this, we will use the title + transcription tokens made in the previous notebook. We just repeat the steps from before, but let's skip the training without hyperparameter tuning:

In [24]:
X2_train, X2_test, y2_train, y2_test = train_test_split(df["tokenized_merged_transc_title"], df["category"], random_state=42, test_size=0.2)
X2_train_counts = count_vect.fit_transform(X2_train)

In [28]:
text_clf_2 = Pipeline([('vect', CountVectorizer()), # Counts words per sentence
                     ('tfidf', TfidfTransformer()), # Avoids giving more weight to longer sentences and reduces weight of very common words
                     ('clf', MultinomialNB()),
])
# text_clf_2 = text_clf_2.fit(X2_train, y2_train)
gs_clf_2 = GridSearchCV(text_clf_2, parameters, n_jobs=-1)
gs_clf_2 = gs_clf_2.fit(X2_train, y2_train)

predicted_gs_2 = gs_clf_2.predict(X_test)
np.mean(predicted_gs_2 == y2_test)

0.7255985267034991

Apparently, the Naive Bayes did not benefit from the additional information. Could it be that the post titles are not sufficiently correlated with the topic? Let's try with the SVM:

In [29]:
text_clf_svm_2 = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', 
                                                   alpha=1e-3, random_state=42)),
])

gs_clf_svm_2 = GridSearchCV(text_clf_svm_2, parameters_svm, n_jobs=-1)
gs_clf_svm_2 = gs_clf_svm_2.fit(X2_train, y2_train)

predicted_svm_gs_2 = gs_clf_svm_2.predict(X_test)
np.mean(predicted_svm_gs_2 == y2_test)

0.7044198895027625

The result of logistic regression is ever so slightly higher, but probably not significant. It's safe to assume that the title of the post does not help the classification of the meme.