# Mini Project 1

In [57]:
print('Hello MP1!')


Hello MP1!


### Import the libraries used for the project
1. scikit-learn
2. gensim
3. nltk libraries
4. numpy
5. pandas
6. matplotlib.pyplot
7. json
8. gzip

## 1. Dataset Preparation & Analysis (5pts)

1.2. Load the dataset. You can use `gzip.open` and `json.load` to do that.

In [58]:
import gzip
import json

dataset = gzip.open('goemotions.json.gz')
dataset_json = json.load(dataset)

# Close the gz dataset once your finished loading the data as a json object
dataset.close()


1.3. (5pts) Extract the posts and the 2 sets of labels (emotion and sentiment), then plot the distribution
of the posts in each category and save the graphic (a histogram or pie chart) in pdf. Do this for both
the emotion and the sentiment categories. You can use `matplotlib.pyplot` and `savefig` to do this.
This pre-analysis of the dataset will allow you to determine if the classes are balanced, and which
metric is more appropriate to use to evaluate the performance of your classifiers.

In [59]:
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter


numpy_dataset = np.array(dataset_json)

# Get column only for emotion and sentiment
emotion_dataset_col = numpy_dataset[:, 1]
sentiment_dataset_col = numpy_dataset[:, 2]

# Count the number of times each value appears
emotion_count = Counter(emotion_dataset_col)
sentiment_count = Counter(sentiment_dataset_col)

# Save the data values as a histogram
plt.hist(emotion_count.values())
plt.savefig('emotions_graph')

plt.close()


plt.hist(sentiment_count.values())
plt.savefig('sentiment_graph')

plt.close()


## 2. Words as Features (35pts)

2.1. □ (5pts) Process the dataset using `feature_extraction.text.CountVectorizer` to extract tokens/words
and their frequencies. Display the number of tokens (the size of the vocabulary) in the dataset.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


# Phrases are in the first column of the dataset
phrases = numpy_dataset[:, 0]
print(phrases)

# Process the dataset
vectorizer = CountVectorizer()

# X value is the processed_dataset
X = vectorizer.fit_transform(phrases)
# print(X.toarray())
# print(processed_dataset[:100, :100].toarray())

# Print the size of the vocabulary (number of tokens)
# This is done by getting the size of the array(the columns will be the number of features, words in the vocabulary)
# Can also be done by getting the length of vectorizer.get_feature_names_out()
print("Number of features (tokens in the vocabulary) =",
      len(vectorizer.get_feature_names_out()))
# print("Number of features (tokens in the vocabulary) =",
#       X.shape[1])


['That game hurt.' "You do right, if you don't care then fuck 'em!"
 'Man I love reddit.' ...
 'Well when you’ve imported about a gazillion of them I or your country it’s gets serious.'
 'That looks amazing'
 "The FDA has plenty to criticize. But like here, it's usually criticized horribly off base. It needs to grow some balls and actually enforce things. "]
Number of features (tokens in the vocabulary) = 30449


2.2. □ (2pts) Split the dataset into 80% for training and 20% for testing. For this, you can use `train_test_split`.

In [61]:
# Split the dataset
from sklearn.model_selection import train_test_split


# Split the dataset
training_dataset, testing_dataset = train_test_split(
    numpy_dataset, train_size=0.8, test_size=0.2)

# Split the X (vectorizer matrix class)
training_X, testing_X = train_test_split(X, train_size=0.8, test_size=0.2)

# Print the size of both datasets
print("Size of training set =", training_dataset.shape[0])
print("Size of testing set =", testing_dataset.shape[0])
print(training_dataset[:, 1])


Size of training set = 137456
Size of testing set = 34364
['disapproval' 'neutral' 'realization' ... 'admiration' 'neutral'
 'admiration']


2.3. Train and test the following classifiers, for both the emotion and the sentiment classification, using
word frequency as features.

* 2.3.1. □ (3pts) **Base-MNB**: a Multinomial Naive Bayes Classifier `(naive_bayes.MultinomialNB.html)`
with the default parameters.

In [64]:
from sklearn.naive_bayes import MultinomialNB


# Create the object classifiers for both emotions and sentiments
emotions_classifier_mb = MultinomialNB()
sentiment_classifier_mb = MultinomialNB()

# Fit the model with training_X as X and columns of training_dataset as y
emotions_classifier_mb.fit(X=training_X,
                  y=training_dataset[:, 1])
sentiment_classifier_mb.fit(X=training_X,
                  y=training_dataset[:, 2])

# Make predictions with testing_X as X
emotion_prediction = emotions_classifier_mb.predict(X=testing_X)
print(emotion_prediction)
sentiment_prediction = sentiment_classifier_mb.predict(X=testing_X)
print(sentiment_prediction)


['neutral' 'neutral' 'optimism' ... 'neutral' 'neutral' 'neutral']
['positive' 'neutral' 'neutral' ... 'ambiguous' 'positive' 'positive']


* 2.3.2. □ (3pts) **Base-DT:** a Decision Tree `(tree.DecisionTreeClassifier)` with the default parameters.

* 2.3.3. □ (3pts) **Base-MLP:** a Multi-Layered Perceptron `(neural network.MLPClassifier)` with the
default parameters.

* 2.3.4. □ (3pts) **Top-MNB:** a better performing Multinomial Naive Bayes Classifier found using `GridSearchCV`.
The gridsearch will allow you to find the best combination of hyper-parameters, as determined
by the evaluation function that you have determined in step 1.3. The only hyper-parameter that
you will experiment with is `alphafloat` with values 0.5, 0 and 2 other values of your choice.

* 2.3.5. □ (3pts) **Top-DT:** a better performing Decision Tree found using `GridSearchCV.` The hyperparameters
that you will experiment with are:
  * `criterion:` gini or entropy
  * `max depth:` 2 different values of your choice
  * `min samples split:` 3 different values of your choice

* 2.3.6. □ (3pts) **Top-MLP:** a better performing Multi-Layered Perceptron found using GridSearchCV.
The hyper-parameters that you will experiment with are:
    * `activation:` sigmoid, tanh, relu and identity
    * 2 network architectures of your choice: for eg, 2 hidden layers with 30+50 nodes and 3 hidden
layers with 10 + 10 + 10
    * `solver:` Adam and stochastic gradient descent

2.4. □ (5pts) For each of the 6 classifiers above and each of the classification tasks (emotion or sentiment),
produce and save the following information in a file called `performance`:
* a string clearly describing the model (e.g. the model name + hyper-parameter values) and the
classification task (emotion or sentiment)
* the confusion matrix – use `metrics.confusion_matrix`
* the precision, recall, and F1-measure for each class, and the accuracy, macro-

2.5. □ (7.5pts) **Do your own exploration:** Do only one of the following, depending on your own interest:
* Use tf-idf instead of word frequencies and redo all substeps of 2.3 above – you can use `TfidfTransformer`
for this. Display the results of this experiment.
* Remove stop words and redo all substeps of 2.3 above – you can use the parameter of `CountVectorizer`
for this. Display the results of this experiment.
* Play with `train_test_split` in order have different splits of 80% training, 20% test sets and
different sizes of training sets and redo all substeps of 2.3 above. Show and explain how the
performance of your models vary depending on the training/test sets are used.