<a href="https://colab.research.google.com/github/hellojohnkim/mmai891/blob/main/24_891_JohnKim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 891: Individual Assignment

Version 1: Updated February 4, 2024

<font color='red'>\# TODO: fill in the below</font>

- John Kim
- 20439250
- MMAI 2024 891
- Genius Makers by Cade Metz
- Due April 21, 2024

# Assignment Instructions

This assignment contains two (2) questions and one (1) optional question for bonus marks. The questions and parts are wholly contained in this Google Colab Notebook.

You are to make a copy of this Notebook and edit the copy to provide your answers/solutions. You are to complete the assignment entirely within Google Colab. Why?

- It gives you practice using cloud-based interactive notebook environments (which is a popular workflow)
- It is easier for you to manage the environment (e.g., installing packages, etc.)
- Google Colab has nice, beefy machines, so you don't have to worry about running out of memory on your local computer.
- It will be easier for the TA to help you debug your code if you need help
- It will be easier for the TA to mark/run your code

## Questions

Each question has multiple tasks. There are two types of tasks: tasks that require you to write code and tasks that require you to write text responses.

For tasks that require **code**:
- Use Python to complete the task.
- You may use standard Python libraries, including scikit-learn, pandas,  numpy, transformers, and simpletransformers.
- Tips:
  - Submit code that runs without errors.
  - Submit code that is reproducible. E.g., set random number seeds as appropriate. You should be able to run your code again and again and again, from the top of the file to the bottom of the file, and get the exact same results each time. I should be able to run your code, from scratch, again and again, and get the exact same results that you get.
  - Submit code that is organized. Make your code readable. Provide comments to describe what the code is doing and why. Don’t leave “old” code lying around. Overall, if your code is clear and easy to read, then we will be happy. When we are happy, we give better marks.

For tasks that require **text responses**:
- Type your response in Notebook cell indicated.
- Use English. Use proper grammar, spelling, and punctuation. Be professional and clear. Be complete, but not overly verbose.
- Feel free to use [Markdown syntax](https://www.markdownguide.org/basic-syntax/) to format your answer (i.e., add bold, italics, lists, tables).
- You may refer to your code in your answer. Please do so very clearly. E.g., “As can be seen in on line X above …“


## What to Submit to the Course Portal

- You are to export your completed Notebook as a PDF file by clicking File->Print->Save as PDF.
- Please do not submit the Notebook (.ipynb) file to the course portal.
- Please submit the PDF export of the Notebook.
   - Please name the PDF file 23_891_FirstnameLastName.pdf
      - E.g., *23_891_StephenThomas.pdf*
   - Please make sure you have run all the cells so we can see the output!
   - Best practice: Before exporting to PDF, click Runtime->Restart and run all.



# Preliminaries: Inspect and Set up environment

In [1]:
import datetime
import pandas as pd
import numpy as np

In [None]:
print(datetime.datetime.now())

2023-02-28 13:48:24.907366


In [2]:
!which python

/usr/local/bin/python


In [3]:
!python --version

Python 3.10.12


In [4]:
!echo $PYTHONPATH

/env/python


In [None]:
# TODO: install any packages you need to here. For example:
#pip install unidecode

In [2]:
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Question 1: Sentiment Analysis via Shallow ML


**Marking**

The coding parts (i.e., 1.a, 1.b, 1.c4) will be marked based on:

- *Correctness*. Code clearly and fully performs the task specified.
- *Reproducibility*. Code is fully reproducible. I.e., you (and I) are able to run this Notebook again and again, from top to bottom, and get the same results each time.
- *Style*. Code is organized. All parts commented with clear reasoning and rationale. No old code laying around. Code easy to follow.


Parts 2 and 3 will be marked on:

- *Quality*. Response is well-justified and convincing. Responses uses facts and data where possible.
- *Style*. Response uses proper grammar, spelling, and punctuation. Response is clear and professional. Response is complete, but not overly-verbose. Response follows length guidelines.


In [3]:
# DO NOT MODIFY THIS CELL

# First, we'll read the provided labeled training data
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1b8MAiN-xBdk6scM-DnufkuijDZivZJqM")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  2400 non-null   object
 1   Polarity  2400 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.6+ KB


In [4]:
# DO NOT MODIFY THIS CELL

# Next, we'll split it into training and test
from sklearn.model_selection import train_test_split

X = df['Sentence']
y = df['Polarity']

# So that we can evaluate how well our model is performing, we split our training data
# into training and validation.

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

Lowercase the text:
## Part 1.a: Preprocessing and FE Pipeline

Clean and preprocess the data (i.e., `X_train`) as you see necessary. Extract features from the text (i.e., vectorization using BOW and/or Bag of N-Grams and/or topics and/or lexical features).


In [12]:
# Define the vectorizer
vectorizer = None

# Lowercase the text:
X_train = X_train.str.lower()

# Remove punctuation
X_train = X_train.apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))

# Remove stop words:
stop = stopwords.words('english')
X_train = X_train.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# Feature Extraction

# Bag of Words
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)

# Bag of N-Grams
vectorizer = CountVectorizer(ngram_range=(2, 3))
X_train_ngrams = vectorizer.fit_transform(X_train)

# Topics
# Latent Dirichlet Allocation (LDA)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
lda = LatentDirichletAllocation(n_components=10)
X_train_topics = lda.fit_transform(X_train_tfidf)

# Lexical Features
# TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)

# Combine features
X_train_topics_flattened = X_train_topics.ravel()
X_train_combined = np.hstack((X_train_bow, X_train_ngrams, X_train_topics_flattened, X_train_tfidf))

In [11]:
print(f"X_train_bow.shape: {X_train_bow.shape}")
print(f"X_train_ngrams.shape: {X_train_ngrams.shape}")
print(f"X_train_topics.shape: {X_train_topics.shape}")
print(f"X_train_tfidf.shape: {X_train_tfidf.shape}")

X_train_bow.shape: (1800, 3376)
X_train_ngrams.shape: (1800, 14551)
X_train_topics.shape: (1800, 10)
X_train_tfidf.shape: (1800, 3376)


In [13]:
print(f"X_train_combined.shape: {X_train_combined.shape}")

X_train_combined.shape: (18003,)


In [None]:
print(f"X_train_bow.shape: {X_train_bow.shape}")
print(f"X_train_ngrams.shape: {X_train_ngrams.shape}")
print(f"X_train_topics.shape: {X_train_topics.shape}")
print(f"X_train_tfidf.shape: {X_train_tfidf.shape}")

X_train_bow.shape: (1800, 3376)
X_train_ngrams.shape: (1800, 14551)
X_train_topics.shape: (1800, 10)
X_train_tfidf.shape: (1800, 3376)


In [None]:
print(f"X_train_combined.shape: {X_train_combined.shape}")

X_train_combined.shape: (18003,)


## Part 1.b: Model Training/Tuning/Cross Validation

Use your favorite shallow ML algorithm (such as decision trees, KNN, random forest, boosting variants) to train a classification model.  Don’t forget everything we’ve learned in the machine learning course: hyperparameter tuning, cross-validation, handling imbalanced data, etc. Make reasonable decisions and try to create the best-performing model that you can.


In [None]:
# TODO: insert code here

## Part 1.c: Model Assessment

Use your model to predict the sentiment of the testing data. Measure the performance (e.g., accuracy, AUC, F1-score) of your model.

In [None]:
# DO NOT MODIFY THIS CELL

test_df = pd.read_csv("https://drive.google.com/uc?export=download&id=1taoTluPBUMt9JkKAnlqDTrU49DJFpJGW")
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  600 non-null    object
 1   Polarity  600 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 9.5+ KB


In [None]:
# TODO: insert code here

## Part 2: Given the performance of your model, are you satisfied with the results? Explain.

Keep your response to 1000 characters or less.

TODO: Insert answer here.

## Part 3: Show five test instances in which your model was incorrect. Dive deep and find out why your model was wrong.

Keep your response to 1000 characters or less.

TODO: Insert answer here. (Feel free to create new code cells if necessary.)

# Question 2: Conceptual Understanding of the SOTA


**Marking**

The following questions will be marked on:

- *Quality*. Response is well-justified and convincing. Responses uses facts and data where possible.
- *Style*. Response uses proper grammar, spelling, and punctuation. Response is clear and professional. Response is complete, but not overly-verbose. Response follows length guidelines.


## Part 1: What is transfer learning and fine-tuning in NLP? What advantages does it have over training from scratch?

Keep your response to 1000 characters or less.

TODO: Insert answer here. (Feel free to create new code cells if necessary.)

## Part 2: What is a Large Language Model (LLM) and what are their strengths and weaknesses?

Keep your response to 1000 characters or less.

TODO: Insert answer here. (Feel free to create new code cells if necessary.)

# Question 3 (Optional/Bonus): Sentiment Analysis via Deep ML

This question is optional and is worth up to 5 extra credit marks.

Use deep learning (e.g., RNNs and variants, CNNs and variants, and/or transformers) to build a model on the same dataset as Q1 and compare the results with the Shallow ML model.

You may train your own deep ML model (using, e.g., the keras library) or fine-tune a pretrained deep ML model (using, e.g., the transformers library and the Huggingface ecoystem).

In [None]:
# TODO: Insert code here.