<a href="https://colab.research.google.com/github/jessica-guan/Python-DataSci-ML/blob/main/Natural%20Language%20Processing%3A%20Sentiment%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 22: Natural Language Processing Review**
---

### **Description**
In this week's homework, you will review how to use more advanced forms of neural nets to perform tasks in NLP such as classification.

<br>

### **Structure**
**Part 1**: IMDB Sentiment Classification




<br>

### **Cheat Sheets**
[Natural Language Processing II](https://docs.google.com/document/d/1p3xVUL1F6SEkusCI4klPLYqQwCkVN5s00ZvJjBpiSqM/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf
import numpy as np
import os

from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam, SGD
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator

from sklearn.model_selection import train_test_split

from random import choices

import warnings
warnings.filterwarnings('ignore')

<a name="p1"></a>

---
## **Part 1: IMDB Sentiment Classification**
---

In this part we will focus on building a CNN model using the IMDB sentiment classification dataset. This is a dataset of 25,000 movie reviews with sentiment labels: 0 for negative and 1 for positive.

<br>


**Run the code provided below to import the dataset.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTdgncgNHtppfS89LHOh1kGl5tYzoEUrUwmOPOQF7mQ0U5Rzba27H45imvZ06_J2x0-wCJySylP5V3_/pub?gid=1712575053&single=true&output=csv'

df = pd.read_csv(url)
df.head()

x_train, x_test, y_train, y_test = train_test_split(df["review"], df["sentiment"], test_size = 0.2, random_state = 42)

x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

### **Problem #1.1: Create the `TextVectorization` layer**


To get started, let's create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

3. Look at the newly built vocabulary.

#### **1. Initialize the layer with the specified parameters.**

* The vocabulary should be at most 5000 words.
* The layer's output should always be 64 integers.

In [None]:
vectorize_layer = TextVectorization(
    max_tokens = 5000,
    output_mode = 'int',
    output_sequence_length = 64
  )

#### **2. Adapt the layer to the training data.**

In [None]:
vectorize_layer.adapt(x_train)

#### **3. Look at the newly built vocabulary.**

In [None]:
vectorize_layer.get_vocabulary()[:50]

['',
 '[UNK]',
 'the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'in',
 'it',
 'i',
 'this',
 'that',
 'br',
 'was',
 'as',
 'for',
 'with',
 'movie',
 'but',
 'film',
 'on',
 'not',
 'you',
 'are',
 'his',
 'have',
 'be',
 'he',
 'one',
 'its',
 'at',
 'all',
 'by',
 'an',
 'they',
 'from',
 'who',
 'so',
 'like',
 'just',
 'or',
 'her',
 'about',
 'if',
 'has',
 'out',
 'some',
 'there',
 'what']

### **Problem #1.2: Build and Train a Dense model**

Complete the code below to build a model with the following layers.

An Embedding layer such that:
- The vocabulary contains 5000 tokens.
- The input length corresponds to the output of the vectorization layer.
- The number of outputs per input is 128.

<br>

Hidden layers such that:

- There's at least one Dense layer.

<br>

A Dense layer for outputting classification probabilities for "negative" or "positive" labels.

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim=5000, output_dim=128))

# Hidden Layers
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(128, activation='relu'))

# Output Layer
model.add(Dense(2, activation='softmax'))

This other alternative includes building the model with CNN.
**Which architecture performs better?**

In [None]:
# [OPTIONAL] USING CNNs
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 5000, output_dim = 128, input_length = 64))

# Hidden Layers
model.add(Conv1D(filters = 16, kernel_size = 4, activation = 'relu'))
model.add(MaxPooling1D(pool_size = 3))
model.add(Flatten())

# Output Layer
model.add(Dense(2, activation = 'softmax'))



# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")



# Fitting
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 5, batch_size = 256)


# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)

(None, 1) -> (None, 64)
(None, 64) -> (None, 64, 128)
(None, 64, 128) -> (None, 61, 16)
(None, 61, 16) -> (None, 20, 16)
(None, 20, 16) -> (None, 320)
(None, 320) -> (None, 2)




Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[0.5171982645988464, 0.7810999751091003]

---
#End of notebook

© 2024 The Coding School, All rights reserved