Homework 4: Sentiment Analysis - Task 0, Task 1, Task 5 (all primarily written tasks)
----

The following instructions are only written in this notebook but apply to all notebooks and `.py` files you submit for this homework.

Due date: February 28th, 2024

Points: 
- Task 0: 5 points
- Task 1: 10 points
- Task 2: 30 points
- Task 3: 20 points
- Task 4: 20 points
- Task 5: 15 points

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __two (pairs)__. If you prefer to work on your own, you may, but be aware that this homework has been designed as a partner project.

Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, post on piazza to request permission
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).


Task 0: Name, References, Reflection (5 points)
---

Names
----
Names: Kaan Tural, Arinjay Singh

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

- https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
    - The training data and dev data were sub-sampled from this source.


Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?

Yes, this work does reflect our best effort.

2. What was/were the most challenging part(s) of the assignment?

The most challenging part of the assignment was training the Naive Bayes Classifier.

3. If you want feedback, what function(s) or problem(s) would you like feedback on and why?

I would like feedback about our feedforward neural net design because it is still a relatively new concept to us.

4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc.

The partnership worked well, Kaan worked on Tasks 1 and 2 while Arinjay worked on Tasks 3 and 4. Task 5 was completed together. The workload was relatively manageable compared to the previous homeworks. 

Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? 

The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 

2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)?

The data was collected from the Internet Movie Database (IMDB), however, it is not specified how the data was acquired. The discussions regarding the dataset in Kaggle seem to suggest that the data was retrieved using a webscraper.

3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)

The dataset's train and test sets have 25,000 reviews each. The sub-sample used in this assignment has a train set of 1600 reviews (425421 tokens) and a dev set of 200 reviews (54603 tokens).

4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)

The data is made up of movie reviews that are considered highly polar.

5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)

The authors of the data are the reviewers that post their opinions about movies on IMDB.

6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?

The training data has 804 (50.25%) positive reviews labeled as 1 and 796 (40.75%) negative reviews labeled as 0. The dev set has 105 (52.5%) positive reviews labeled as 1 and 95 (47.5%) negative reviews labeled as 0.

7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?

The vocabulary sizes of the train and the dev sets were 30705 and 8953 respectively.

8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?

The vocabulary overlap between the train and dev set was 6574.

In [2]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

[nltk_data] Downloading package punkt to /Users/arinjay/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [21]:
import pandas as pd

train_reviews, train_labels = sutils.generate_tuples_from_file('movie_reviews_train.txt')
dev_reviews, dev_labels = sutils.generate_tuples_from_file('movie_reviews_dev.txt')

train_df = pd.DataFrame({'Review': train_reviews, 'Label': train_labels})
dev_df = pd.DataFrame({'Review': dev_reviews, 'Label': dev_labels})

# Vocabulary
print("Vocabulary:\n")

train_vocab = train_df['Review'].explode().unique()
train_vocab_size = len(train_vocab)
print(f"Training Vocab Size: {train_vocab_size}")

dev_vocab = dev_df['Review'].explode().unique()
dev_vocab_size = len(dev_vocab)
print(f"Dev Vocab Size: {dev_vocab_size}")

overlap_vocab = set(train_vocab) & set(dev_vocab)
overlap_vocab_size = len(overlap_vocab)
print(f"Overlap Vocab Size: {overlap_vocab_size}")

Vocabulary:

Training Vocab Size: 30705
Dev Vocab Size: 8953
Overlap Vocab Size: 6574


In [24]:
# Dataset Size
print("Dataset Size:\n")

train_size = len(train_df)
print(f"Training Size: {train_size}")

dev_size = len(dev_df)
print(f"Dev Size: {dev_size}")

train_tokens = train_df['Review'].apply(len).sum()
print(f"Training Tokens: {train_tokens}")

dev_tokens = dev_df['Review'].apply(len).sum()
print(f"Dev Tokens: {dev_tokens}")

Dataset Size:

Training Size: 1600
Dev Size: 200
Training Tokens: 425421
Dev Tokens: 54603


In [25]:
# Class Distribution
print("Class Distribution:\n")

label_distribution_train = train_df['Label'].value_counts()
label_distribution_dev = dev_df['Label'].value_counts()

total_train_samples = label_distribution_train.sum()
total_dev_samples = label_distribution_dev.sum()

label_distribution_train_percent = label_distribution_train / total_train_samples * 100
label_distribution_dev_percent = label_distribution_dev / total_dev_samples * 100

print("Training Label Distribution:")
print(label_distribution_train)
print(label_distribution_train_percent)

print("\nDev Label Distribution:")
print(label_distribution_dev)
print(label_distribution_dev_percent)

Class Distribution:

Training Label Distribution:
1    804
0    796
Name: Label, dtype: int64
1    50.25
0    49.75
Name: Label, dtype: float64

Dev Label Distribution:
1    105
0     95
Name: Label, dtype: int64
1    52.5
0    47.5
Name: Label, dtype: float64


Task 5: Model Evaluation (15 points)
---
Save your three graph files for the __best__ configurations that you found with your models using the `plt.savefig(filename)` command. The `bbox_inches` optional parameter will help you control how much whitespace outside of the graph is in your resulting image.

__NOTE:__ Run each notebook containing a classifier 3 times, resulting in __NINE__ saved graphs (don't just overwrite your previous ones).

You will turn in all of these files.

10 points in this section are allocated for having all nine graphs legible, properly labeled, and present.




1. (1 pt) When using __10%__ of your data, which model had the highest f1 score?

The feedforward neural net had the highest f1 score when using 10% of the training data at 0.685.

2. (1 pt) Which classifier had the most __consistent__ performance (that is, which classifier had the least variation across all three graphs you have for it -- no need to mathematically calculate this, you can just look at the graphs)?

The Logistic Regression model had the most consistent performance as all three of its graphs were more or less identical.

3. (1 pt) For each model, what percentage of training data resulted in the highest f1 score?
    1. Naive Bayes:
    2. Logistic Regression: 80%
    3. Neural Net: 70%

4. (2 pts) Which model, if any, appeared to overfit the training data the most? Why?


