## Lab Assignment Seven: Sequential Network Architectures

Team: Katie Laird, Cameron Miller, Will Landin

Dataset: 

Select a dataset that is text. That is, the dataset should be text data. In terms of generalization performance, it is helpful to have a medium sized dataset of similar sized text documents. It is fine to perform binary classification or multi-class classification. The classification should be "many-to-one" sequence classification.

Dataset: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/

In [1]:
import numpy as np
import pandas as pd
import re

### Preparation

[1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed). Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). Discuss methods of tokenization in your dataset as well as any decisions to force a specific length of sequence. 

In [2]:
# read in the dataset as a pandas dataframe
df = pd.read_csv('oscar_speech_db.csv')
df.head(5)

Unnamed: 0,Year,Category,Film Title,Winner,Presenter,Date & Venue,Speech
0,1939 (12th) Academy Awards,Actress,Gone with the Wind,Vivien Leigh,Spencer Tracy,"February 29, 1940; Ambassador Hotel, Cocoanut ...","VIVIEN LEIGH:\r\nLadies and gentlemen, please..."
1,1939 (12th) Academy Awards,Actress in a Supporting Role,Gone with the Wind,Hattie McDaniel,Fay Bainter,"February 29, 1940; Ambassador Hotel, Cocoanut ...",HATTIE McDANIEL:\r\nAcademy of Motion Picture...
2,1941 (14th) Academy Awards,Actor in a Supporting Role,How Green Was My Valley,Donald Crisp,James Stewart,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","DONALD CRISP:\r\nLadies and gentlemen, it's a..."
3,1941 (14th) Academy Awards,Actress,Suspicion,Joan Fontaine,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...",JOAN FONTAINE:\r\nI want to thank the ladies ...
4,1941 (14th) Academy Awards,Actress in a Supporting Role,The Great Lie,Mary Astor,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","MARY ASTOR:\r\nLadies and gentlemen, twenty-t..."


In [3]:
# print the number of rows in the dataset
print("Number of rows in the dataset:", len(df))

Number of rows in the dataset: 1669


In [4]:
# change the Year column to only contain the actual year
df['Year'] = df['Year'].str[:4]
df.head(5)

Unnamed: 0,Year,Category,Film Title,Winner,Presenter,Date & Venue,Speech
0,1939,Actress,Gone with the Wind,Vivien Leigh,Spencer Tracy,"February 29, 1940; Ambassador Hotel, Cocoanut ...","VIVIEN LEIGH:\r\nLadies and gentlemen, please..."
1,1939,Actress in a Supporting Role,Gone with the Wind,Hattie McDaniel,Fay Bainter,"February 29, 1940; Ambassador Hotel, Cocoanut ...",HATTIE McDANIEL:\r\nAcademy of Motion Picture...
2,1941,Actor in a Supporting Role,How Green Was My Valley,Donald Crisp,James Stewart,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","DONALD CRISP:\r\nLadies and gentlemen, it's a..."
3,1941,Actress,Suspicion,Joan Fontaine,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...",JOAN FONTAINE:\r\nI want to thank the ladies ...
4,1941,Actress in a Supporting Role,The Great Lie,Mary Astor,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","MARY ASTOR:\r\nLadies and gentlemen, twenty-t..."


In [5]:
# remove all null rows
df = df.dropna()
df.head(5)

Unnamed: 0,Year,Category,Film Title,Winner,Presenter,Date & Venue,Speech
0,1939,Actress,Gone with the Wind,Vivien Leigh,Spencer Tracy,"February 29, 1940; Ambassador Hotel, Cocoanut ...","VIVIEN LEIGH:\r\nLadies and gentlemen, please..."
1,1939,Actress in a Supporting Role,Gone with the Wind,Hattie McDaniel,Fay Bainter,"February 29, 1940; Ambassador Hotel, Cocoanut ...",HATTIE McDANIEL:\r\nAcademy of Motion Picture...
2,1941,Actor in a Supporting Role,How Green Was My Valley,Donald Crisp,James Stewart,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","DONALD CRISP:\r\nLadies and gentlemen, it's a..."
3,1941,Actress,Suspicion,Joan Fontaine,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...",JOAN FONTAINE:\r\nI want to thank the ladies ...
4,1941,Actress in a Supporting Role,The Great Lie,Mary Astor,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","MARY ASTOR:\r\nLadies and gentlemen, twenty-t..."


In [6]:
# Change the Year column to an integer type
df['Year'] = df['Year'].astype(int)

# Define a function to clean the speech
def clean_speech(speech):
    # Remove the name of the speaker
    speech = re.sub(r"^[A-Z ]+(?::|\r|\n)", '', speech)
    # Remove any non-alphabetic characters except basic punctuation
    speech = re.sub(r"[^a-zA-Z0-9.,'!? ]", '', speech)
    # Remove any excess whitespace
    speech = re.sub(r"\s+", ' ', speech).strip()
    return speech

# Apply the function to clean the 'Speech' column
df['Speech'] = df['Speech'].apply(clean_speech)

In [7]:
df.head(5)

Unnamed: 0,Year,Category,Film Title,Winner,Presenter,Date & Venue,Speech
0,1939,Actress,Gone with the Wind,Vivien Leigh,Spencer Tracy,"February 29, 1940; Ambassador Hotel, Cocoanut ...","Ladies and gentlemen, please forgive me if my ..."
1,1939,Actress in a Supporting Role,Gone with the Wind,Hattie McDaniel,Fay Bainter,"February 29, 1940; Ambassador Hotel, Cocoanut ...",HATTIE McDANIELAcademy of Motion Picture Arts ...
2,1941,Actor in a Supporting Role,How Green Was My Valley,Donald Crisp,James Stewart,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","Ladies and gentlemen, it's almost impossible t..."
3,1941,Actress,Suspicion,Joan Fontaine,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...",I want to thank the ladies and gentlemen that ...
4,1941,Actress in a Supporting Role,The Great Lie,Mary Astor,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","Ladies and gentlemen, twentytwo years ago this..."


In [8]:
# print speech from row 1
print(df['Speech'][1])

# delete HATTIE McDANIEL from the speech
df.at[1, 'Speech'] = re.sub(r"HATTIE McDaniel", '', df.at[1, 'Speech'], flags=re.IGNORECASE).strip()

df.head(5)

HATTIE McDANIELAcademy of Motion Picture Arts and Sciences, fellow members of the motion picture industry and honored guests. This is one of the happiest moments of my life, and I want to thank each one of you who had a part in selecting me for one of the awards for your kindness. It has made me feel very, very humble and I shall always hold it as a beacon for anything I may be able to do in the future. I sincerely hope I shall always be a credit to my race and to the motion picture industry. My heart is too full to tell you just how I feel. And may I say thank you and God bless you.


Unnamed: 0,Year,Category,Film Title,Winner,Presenter,Date & Venue,Speech
0,1939,Actress,Gone with the Wind,Vivien Leigh,Spencer Tracy,"February 29, 1940; Ambassador Hotel, Cocoanut ...","Ladies and gentlemen, please forgive me if my ..."
1,1939,Actress in a Supporting Role,Gone with the Wind,Hattie McDaniel,Fay Bainter,"February 29, 1940; Ambassador Hotel, Cocoanut ...","Academy of Motion Picture Arts and Sciences, f..."
2,1941,Actor in a Supporting Role,How Green Was My Valley,Donald Crisp,James Stewart,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","Ladies and gentlemen, it's almost impossible t..."
3,1941,Actress,Suspicion,Joan Fontaine,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...",I want to thank the ladies and gentlemen that ...
4,1941,Actress in a Supporting Role,The Great Lie,Mary Astor,Ginger Rogers,"February 26, 1942; Biltmore Hotel, Biltmore Bo...","Ladies and gentlemen, twentytwo years ago this..."


[1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

[1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your train/test splitting method is a realistic mirroring of how an algorithm would be used in practice. 

### Modeling

[3 points] Investigate at least two different sequential network architectures (e.g., a CNN and a Transformer). Alternatively, you may also choose a recurrent network and Transformer network. Be sure to use an embedding layer (try to use a pre-trained embedding, if possible). Adjust one hyper-parameter of each network to potentially improve generalization performance (train a total of at least four models). Visualize the performance of training and validation sets versus the training iterations, showing that the models converged.

[1 points] Using the best parameters and architecture from the Transformer in the previous step, add a second Multi-headed self attention layer to your network. That is, the input to the second attention layer should be the output sequence of the first attention layer.  Visualize the performance of training and validation sets versus the training iterations.

[2 points] Use the method of train/test splitting and evaluation criteria that you argued for at the beginning of the lab. Visualize the results of all the models you trained.  Use proper statistical comparison techniques to determine which method(s) is (are) superior.  

### Exceptional Work

[1 points] Use the pre-trained ConceptNet Numberbatch embedding and compare to pre-trained GloVe. Which method is better for your specific application? 