Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data)
----

Due date: March 13th, 2025 @ 9pm Boston time

Points: (will be listed on Canvas)

Goals:
- explore & use word embeddings
- train neural models from the ground up!
- get comfy with a modern neural net library (`pytorch`)
    - you'll likelye make close friends with the docs: https://pytorch.org/tutorials/beginner/basics/intro.html 
- evaluate neural vs. vanilla n-gram language models

Complete in groups of: __two (pairs)__. If you prefer to work on your own, you may, but be aware that this homework has been designed as a partner project!

Allowed python modules:
- `gensim`, `numpy`, `matplotlib`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, all built-in python libraries (e.g. `math` and `string`), and anything else we imported in the starter code
- if you would like to use a library not on this list, post on piazza to request permission
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).

Data processing:
- You may __choose__ how you would like to tokenize your text for this assignment
- You'll want to __deal with internal commas (commas inside of the sentences) appropriately__ when you read in the data, so use the python [`csv` module](https://docs.python.org/3/library/csv.html) or some other module to read the csv in (vs. splitting on commas).

Warnings:
- You might see:
```
notebook controller is DISPOSED. 
View Jupyter log for further details.
```
This is not an error per se--go to the last cell that ran successfully (or the first cell) and run them one-by-one, waiting for the prior one to finish running before moving to the next.


Names
----
Names:<br>
__Katherine Aristizabal Norena__<br>
__Jose Meza Llamosas__<br>

Task 1: Provided Data Write-Up (7 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ 🎃 spooky 👻 authors 🧟 data set. Please __bold__ your answers to all written questions! Each row in this dataset represents one sentence.

1. Where did you get the data from? The provided dataset is the training data from: https://www.kaggle.com/competitions/spooky-author-identification 
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)? 
__The dataset was collected from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenize.__
3. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)
__Text extracts from books written by spooky authors__
4. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people) __The author of the collection is Kaggle itself, but the authors books from which the text extracts were collected are Edgar Allan Poe, HP Lovecraft and Mary Shelley__
5. (1 pt) How large is the dataset? (# texts/sentences, # total tokens by word) __19579 total text/sentences, and 634080 total tokens by word__
6. (1 pt) What are the minimum, maximum, and average sentence lengths (by tokens) in your dataset? __Maximum length is 878 tokens, minimum length is 6 tokens, and the average is 32.39 tokens__
7. (2 pts) How large is the vocabulary, both tokenized by character and by word? __The vocabulary size for tokenized by words is 25374 tokens, and for tokenized by characters is 60 tokens.__


In [None]:
# import your libraries here
import pandas as pd
import numpy as np
# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils as nutils

In [None]:
# code that you need to answer the above questions here!
# but make sure that you leave the answers you want us to grade in the markdown!

# Loading dataset
TRAIN_FILE = "spooky_author_train.csv"
data = pd.read_csv(TRAIN_FILE)
tokens_word = nutils.read_file_spooky(TRAIN_FILE, by_character = False, ngram=1)
tokens_char = nutils.read_file_spooky(TRAIN_FILE, by_character= True, ngram=1)

In [None]:
print(f"Number of texts/sentences: {len(data)}")
print(f"Number of tokens: {len([item for sublist in tokens_word for item in sublist])}")

Number of texts/sentences: 19579
Number of tokens: 634080


In [None]:
count_tokens = []
for sentence in tokens_word:
    count_tokens.append(len(sentence))
max_len = np.max(count_tokens)
min_len = np.min(count_tokens)
mean_len = np.mean(count_tokens)

In [None]:
print(f"The max length by tokens is: {max_len: .2f}")
print(f"The min length by tokens is: {min_len: .2f}")
print(f"The average length by tokens is: {mean_len: .2f}")

The max length by tokens is:  878.00
The min length by tokens is:  6.00
The average length by tokens is:  32.39


In [None]:
tokens_word_total = set([item for sublist in tokens_word for item in sublist])
tokens_char_total = set([item for sublist in tokens_char for item in sublist])

In [None]:
print(f"The vocabulary size for tokenized by word is {len(tokens_word_total)} tokens and for tokenized by character is {len(tokens_char_total)} tokens")

The vocabulary size for tokenized by word is 25374 tokens and for tokenized by character is 60 tokens
