# Mini Project 1

In [1]:
print('Hello MP1!')


Hello MP1!


### Imported libraries used for the project
1. jupiter
2. scikit-learn
3. gensim
4. nltk
5. numpy
6. pandas
7. matplotlib

`conda install jupiter scikit-learn gensim nltk numpy pandas matplotlib`

## 1. Dataset Preparation & Analysis (5pts)

1.2. Load the dataset. You can use `gzip.open` and `json.load` to do that.

In [2]:
import gzip
import json

dataset = gzip.open('goemotions.json.gz')
dataset_json = json.load(dataset)

# Close the gz dataset once your finished loading the data as a json object
dataset.close()


1.3. (5pts) Extract the posts and the 2 sets of labels (emotion and sentiment), then plot the distribution
of the posts in each category and save the graphic (a histogram or pie chart) in pdf. Do this for both
the emotion and the sentiment categories. You can use `matplotlib.pyplot` and `savefig` to do this.
This pre-analysis of the dataset will allow you to determine if the classes are balanced, and which
metric is more appropriate to use to evaluate the performance of your classifiers.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter


numpy_dataset = np.array(dataset_json)

# Get column only for emotion and sentiment
emotion_dataset_col = numpy_dataset[:, 1]
sentiment_dataset_col = numpy_dataset[:, 2]

# Count the number of times each value appears
emotion_count = Counter(emotion_dataset_col)
sentiment_count = Counter(sentiment_dataset_col)

# Save the data values as a histogram
plt.hist(emotion_count.values())
plt.savefig('emotions_graph')

plt.close()


plt.hist(sentiment_count.values())
plt.savefig('sentiment_graph')

plt.close()


## 2. Words as Features (35pts)

2.1. □ (5pts) Process the dataset using `feature_extraction.text.CountVectorizer` to extract tokens/words
and their frequencies. Display the number of tokens (the size of the vocabulary) in the dataset.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
# import pandas as pd


# Phrases are in the first column of the dataset
phrases = numpy_dataset[:, 0]

# Process the dataset
vectorizer_emotions = CountVectorizer()

# X value is the processed_dataset
X_emotions = vectorizer_emotions.fit_transform(phrases)

print("Number of features (tokens in the vocabulary) =",
      len(vectorizer_emotions.get_feature_names_out()))


Number of features (tokens in the vocabulary) = 30449


In [14]:
emotions = numpy_dataset[:, 1]

emotions_and_phrases = phrases.copy()

for count, i in enumerate(phrases):
    emotions_and_phrases[count] = i + " " + emotions[count]


vectorizer_sentiments = CountVectorizer()
X_sentiments = vectorizer_sentiments.fit_transform(emotions_and_phrases)

print("Number of features (tokens in the vocabulary) including emotions =",
      len(vectorizer_sentiments.get_feature_names_out()))


Number of features (tokens in the vocabulary) including emotions = 30450


2.2. □ (2pts) Split the dataset into 80% for training and 20% for testing. For this, you can use `train_test_split`.

In [17]:
# Split the dataset
from sklearn.model_selection import train_test_split


# Split the dataset
training_dataset, testing_dataset = train_test_split(
    numpy_dataset, train_size=0.8, test_size=0.2)

# Split the feature vector of emotions
training_X_emotions, testing_X_emotions = train_test_split(
    X_emotions, train_size=0.8, test_size=0.2)

# Split the feature vector of sentiments
training_X_sentiments, testing_X_sentiments = train_test_split(
    X_sentiments, train_size=0.8, test_size=0.2)

# Print the size of both datasets
print("Size of training set =", training_dataset.shape[0])
print("Size of testing set =", testing_dataset.shape[0])


Size of training set = 137456
Size of testing set = 34364


2.3. Train and test the following classifiers, for both the emotion and the sentiment classification, using
word frequency as features.

* 2.3.1. □ (3pts) **Base-MNB**: a Multinomial Naive Bayes Classifier `(naive_bayes.MultinomialNB.html)`
with the default parameters.

In [18]:
from sklearn.naive_bayes import MultinomialNB


# Create the object classifiers for emotions
emotions_classifier_mb = MultinomialNB()

# Fit the model with training_X as X and columns of training_dataset as y
emotions_classifier_mb.fit(X=training_X_emotions,
                           y=training_dataset[:, 1])

# Make predictions with testing_X as X
emotion_prediction_mb = emotions_classifier_mb.predict(X=testing_X_emotions)
print(emotion_prediction_mb)


['neutral' 'neutral' 'neutral' ... 'optimism' 'neutral' 'neutral']


In [19]:
# Create the object classifiers for sentiments
sentiment_classifier_mb = MultinomialNB()

# Fit the model with training_X as X and columns of training_dataset as y
sentiment_classifier_mb.fit(X=training_X_sentiments,
                            y=training_dataset[:, 2])

# Make predictions with testing_X as X
sentiment_prediction_mb = sentiment_classifier_mb.predict(
    X=testing_X_sentiments)
print(sentiment_prediction_mb)


['positive' 'positive' 'positive' ... 'neutral' 'positive' 'negative']


In [20]:
# Part 2.4 for Multinomial classification
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Append Emotions results
performance_file = open("performance", "w")
performance_file.write(
    "-----Emotions classification (Multinomial Naive Bayes)-----\n")

performance_file.write(
    f"Emotions hyperparameters = {emotions_classifier_mb.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(
    y_true=testing_dataset[:, 1], y_pred=emotion_prediction_mb)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(
    y_true=testing_dataset[:, 1], y_pred=emotion_prediction_mb, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(
    f"----------------------------------------------------------\n\n")

performance_file.close()


In [21]:
# Append Sentiments results
performance_file = open("performance", "a")
performance_file.write(
    "-----Sentiments classification (Multinomial Naive Bayes)-----\n")

performance_file.write(
    f"Sentiments hyperparameters = {sentiment_classifier_mb.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(
    y_true=testing_dataset[:, 1], y_pred=sentiment_prediction_mb)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(
    y_true=testing_dataset[:, 1], y_pred=sentiment_prediction_mb, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(
    f"----------------------------------------------------------\n\n")

performance_file.close()


* 2.3.2. □ (3pts) **Base-DT:** a Decision Tree `(tree.DecisionTreeClassifier)` with the default parameters.

In [22]:
from sklearn.tree import DecisionTreeClassifier


# Create the object classifiers for emotions
emotions_classifier_dt = DecisionTreeClassifier()

# Fit the model with training_X as X and columns of training_dataset as y
emotions_classifier_dt.fit(X=training_X_emotions,
                           y=training_dataset[:, 1])

# Make predictions with testing_X as X
emotion_prediction_dt = emotions_classifier_dt.predict(X=testing_X_emotions)
print(emotion_prediction_dt)


['curiosity' 'neutral' 'neutral' ... 'optimism' 'admiration' 'admiration']


In [23]:
# Create the object classifiers for sentiments
sentiment_classifier_dt = DecisionTreeClassifier()

# Fit the model with training_X as X and columns of training_dataset as y
sentiment_classifier_dt.fit(X=training_X_sentiments,
                            y=training_dataset[:, 2])

# Make predictions with testing_X as X
sentiment_prediction_dt = sentiment_classifier_dt.predict(
    X=testing_X_sentiments)
print(sentiment_prediction_dt)


['positive' 'negative' 'positive' ... 'neutral' 'negative' 'negative']


In [24]:
# Part 2.4 for DecisionTree classification


# Append Emotions results
performance_file = open("performance", "a")
performance_file.write("-----Emotions classification (Decision Tree)-----\n")

performance_file.write(
    f"Emotions hyperparameters = {emotions_classifier_dt.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(
    y_true=testing_dataset[:, 1], y_pred=emotion_prediction_dt)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(
    y_true=testing_dataset[:, 1], y_pred=emotion_prediction_dt, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(
    f"----------------------------------------------------------\n\n")
performance_file.close()


In [25]:

# Append Sentiments results
performance_file = open("performance", "a")
performance_file.write(
    "-----Sentiments classification (Multinomial Naive Bayes)-----\n")

performance_file.write(
    f"Sentiments hyperparameters = {sentiment_classifier_dt.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(
    y_true=testing_dataset[:, 1], y_pred=sentiment_prediction_dt)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(
    y_true=testing_dataset[:, 1], y_pred=sentiment_prediction_dt, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(
    f"----------------------------------------------------------\n\n")

performance_file.close()


* 2.3.3. □ (3pts) **Base-MLP:** a Multi-Layered Perceptron `(neural network.MLPClassifier)` with the
default parameters.

* 2.3.4. □ (3pts) **Top-MNB:** a better performing Multinomial Naive Bayes Classifier found using `GridSearchCV`.
The gridsearch will allow you to find the best combination of hyper-parameters, as determined
by the evaluation function that you have determined in step 1.3. The only hyper-parameter that
you will experiment with is `alphafloat` with values 0.5, 0 and 2 other values of your choice.

In [43]:
from sklearn.model_selection import GridSearchCV

# hyperparameter used in gridsearch
hyperparam = {'alpha': [0, 0.5, 1.0, 5.0]}

# emotions gridsearch for Top Multinomial Naive Bayes
emo_top_mnb_gridsearch = GridSearchCV(emotions_classifier_mb, param_grid=hyperparam)
emo_top_mnb_gridsearch.fit(X=training_X_emotions, y=training_dataset[:, 1])
emo_prediction_tmb = emo_top_mnb_gridsearch.predict(X=testing_X_emotions)
print(emo_prediction_tmb)

# sentiments gridsearch for Top Multinomial Naive Bayes
sen_top_mnb_gridsearch = GridSearchCV(sentiment_classifier_mb, param_grid=hyperparam)
sen_top_mnb_gridsearch.fit(X=training_X_sentiments, y=training_dataset[:, 1])
sen_prediction_tmb = sen_top_mnb_gridsearch.predict(X=testing_X_sentiments)
print(sen_prediction_tmb)



['neutral' 'neutral' 'neutral' ... 'neutral' 'neutral' 'neutral']


In [None]:
# Part 2.4 for Top Multinomial Naive Bayes classification with GridSearchCV (Emotions)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Append Emotions results
performance_file = open("performance", "a")
performance_file.write("-----Emotions classification (Top Multinomial Naive Bayes with GridSearchCV)-----\n")

performance_file.write(f"Emotions hyperparamenters = {emo_top_mnb_gridsearch.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(y_true=testing_dataset[:, 1], y_pred=emo_prediction_tmb)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(y_true=testing_dataset[:, 1], y_pred=emo_prediction_tmb, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(f"----------------------------------------------------------\n\n")

performance_file.close()

In [None]:
# Part 2.4 for Top Multinomial Naive Bayes classification with GridSearchCV (Sentiment)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Append Emotions results
performance_file = open("performance", "a")
performance_file.write("-----Sentiment classification (Top Multinomial Naive Bayes with GridSearchCV)-----\n")

performance_file.write(f"Sentiment hyperparamenters = {sen_top_mnb_gridsearch.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(y_true=testing_dataset[:, 1], y_pred=sen_prediction_tmb)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(y_true=testing_dataset[:, 1], y_pred=sen_prediction_tmb, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(f"----------------------------------------------------------\n\n")

performance_file.close()

* 2.3.5. □ (3pts) **Top-DT:** a better performing Decision Tree found using `GridSearchCV.` The hyperparameters
that you will experiment with are:
  * `criterion:` gini or entropy
  * `max depth:` 2 different values of your choice
  * `min samples split:` 3 different values of your choice

* 2.3.6. □ (3pts) **Top-MLP:** a better performing Multi-Layered Perceptron found using GridSearchCV.
The hyper-parameters that you will experiment with are:
    * `activation:` sigmoid, tanh, relu and identity
    * 2 network architectures of your choice: for eg, 2 hidden layers with 30+50 nodes and 3 hidden
layers with 10 + 10 + 10
    * `solver:` Adam and stochastic gradient descent

In [45]:
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

# hyperparameter used in gridsearch
hyperparam = {'activation': ['sigmoid', 'tanh', 'relu', 'identity'],
                'hidden_layer_sizes': [2, 3],
                'solver': ['Adam', 'sgd']}

emo_top_mlp_gridsearch = GridSearchCV(MLPClassifier(), param_grid=hyperparam)
emo_top_mlp_gridsearch.fit(X=training_X_emotions, y=training_dataset[:, 1])
emo_prediction_tmlp = emo_top_mlp_gridsearch.predict(X=testing_X_emotions)
print(emo_prediction_tmlp)


sen_top_mlp_gridsearch = GridSearchCV(MLPClassifier(), param_grid=hyperparam)
sen_top_mlp_gridsearch.fit(X=training_X_emotions, y=training_dataset[:, 1])
sen_prediction_tmlp = sen_top_mlp_gridsearch.predict(X=testing_X_emotions)
print(emo_prediction_tmlp)

50 fits failed out of a total of 80.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\mateo\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\mateo\miniconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 762, in fit
    return self._fit(X, y, incremental=False)
  File "c:\Users\mateo\miniconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 385, in _fit
    self._validate_hyperparameters()
  File "c:\Users\mateo\miniconda3\lib\site-packages\sklearn\neural_network\_multilayer_perc

KeyboardInterrupt: 

In [None]:
# Part 2.4 for Top Multi-Layered Percentron classification with GridSearchCV (Emotions)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Append Emotions results
performance_file = open("performance", "a")
performance_file.write("-----Emotions classification (Top Multi-Layered Percentron with GridSearchCV)-----\n")

performance_file.write(f"Emotions hyperparamenters = {emo_top_mlp_gridsearch.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(y_true=testing_dataset[:, 1], y_pred=emo_prediction_tmlp)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(y_true=testing_dataset[:, 1], y_pred=emo_prediction_tmlp, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(f"----------------------------------------------------------\n\n")

performance_file.close()

In [None]:
# Part 2.4 for Top Multi-Layered Percentron classification with GridSearchCV (Sentiment)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Append Sentiment results
performance_file = open("performance", "a")
performance_file.write("-----Sentiment classification (Top Multi-Layered Percentron with GridSearchCV)-----\n")

performance_file.write(f"Sentiment hyperparamenters = {sen_top_mlp_gridsearch.n_features_in_}\n")

confusion_matrix_output = confusion_matrix(y_true=testing_dataset[:, 1], y_pred=sen_prediction_tmlp)
performance_file.write(f"Confusion Matrix = \n{confusion_matrix_output}\n\n")

class_report = classification_report(y_true=testing_dataset[:, 1], y_pred=sen_prediction_tmlp, zero_division=0)
performance_file.write(f"Classification Report = \n{class_report}\n")
performance_file.write(f"----------------------------------------------------------\n\n")

performance_file.close()

2.4. □ (5pts) For each of the 6 classifiers above and each of the classification tasks (emotion or sentiment),
produce and save the following information in a file called `performance`:
* a string clearly describing the model (e.g. the model name + hyper-parameter values) and the
classification task (emotion or sentiment)
* the confusion matrix – use `metrics.confusion_matrix`
* the precision, recall, and F1-measure for each class, and the accuracy, macro-

2.5. □ (7.5pts) **Do your own exploration:** Do only one of the following, depending on your own interest:
* Use tf-idf instead of word frequencies and redo all substeps of 2.3 above – you can use `TfidfTransformer`
for this. Display the results of this experiment.
* Remove stop words and redo all substeps of 2.3 above – you can use the parameter of `CountVectorizer`
for this. Display the results of this experiment.
* Play with `train_test_split` in order have different splits of 80% training, 20% test sets and
different sizes of training sets and redo all substeps of 2.3 above. Show and explain how the
performance of your models vary depending on the training/test sets are used.

## 3. Embeddings as Features (20pts)

3.1. □ (0pts) Use `gensim.downloader.load` to load the `word2vec-google-news-300` pretrained embedding model.

In [26]:
from gensim import downloader

3.2. □ (2pts) Use the `tokenizer` from `nltk` to extract words from the Reddit posts. Display the number
of tokens in the training set.

In [27]:
import nltk


# token_phrases = nltk.data.load(list(phrases))