<a href="https://colab.research.google.com/github/jessica-guan/Python-DataSci-ML/blob/main/Natural%20Language%20Processing%3A%20Text%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 22: Natural Language Processing Review**
---

### **Description**
In today's lab, we will review everything we have learned about implementing a neural network for NLP tasks including text classification.

<br>

### **Lab Structure**

**Part 1**: [Text Classification of Hotel Reviews](#p1)

**Part 2**: [Convolutional Neural Networks](#p2)


<br>

### **Goals**
By the end of this lab, you will:
* Understand how to apply vectorization and embedding layers in models.
* Compare a fully connected network to a CNN for text classification with embeddings.

<br>

### **Cheat Sheets**
[Natural Language Processing II](https://docs.google.com/document/d/1p3xVUL1F6SEkusCI4klPLYqQwCkVN5s00ZvJjBpiSqM/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf
import numpy as np
import os

from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam, SGD
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator

from sklearn.model_selection import train_test_split

from random import choices

import warnings
warnings.filterwarnings('ignore')

<a name="p1"></a>

---
## **Part 1: Text Classification of Hotel Reviews**
---

In this part we will focus on building a model using a TripAdvisor dataset containing hotel reviews. This is a dataset of 20,000 hotel reviews including the `Review` and a `Rating` on a scale of 1-5.

<br>


**Run the code provided below to import the dataset.**

In [None]:
url = 'https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/main/tripadvisor_reviews/tripadvisor_hotel_reviews.csv'
df = pd.read_csv(url)
df.head()

x_train, x_test, y_train, y_test = train_test_split(df["Review"], df["Rating"], test_size = 0.2, random_state = 42)

x_train = np.array(x_train)
x_test = np.array(x_test)

y_train = to_categorical(y_train - 1, num_classes=5)
y_test = to_categorical(y_test - 1, num_classes=5)

### **Problem #1.1: Create the `TextVectorization` layer**


To get started, let's create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

3. Look at the newly built vocabulary.

#### **1. Initialize the layer with the specified parameters.**

* The vocabulary should be at most 5000 words.
* The layer's output should always be 64 integers.

In [None]:
vectorize_layer = TextVectorization(
    max_tokens = 5000,
    output_mode = 'int',
    output_sequence_length = 64
  )

#### **2. Adapt the layer to the training data.**

In [None]:
vectorize_layer.adapt(x_train)

#### **3. Look at the newly built vocabulary.**

In [None]:
vectorize_layer.get_vocabulary()[:50]

['',
 '[UNK]',
 'hotel',
 'room',
 'not',
 'great',
 'nt',
 'good',
 'staff',
 'stay',
 'did',
 'just',
 'nice',
 'rooms',
 'no',
 'location',
 'stayed',
 'service',
 'time',
 'night',
 'beach',
 'clean',
 'day',
 'breakfast',
 'food',
 'like',
 'really',
 'resort',
 'place',
 'pool',
 'people',
 'friendly',
 'small',
 'little',
 'got',
 'walk',
 'excellent',
 'area',
 '2',
 'best',
 'helpful',
 'restaurant',
 'bar',
 'bathroom',
 'bed',
 'restaurants',
 'water',
 'recommend',
 'trip',
 'went']

### **Problem #1.2: Build and Train a Dense model**

Complete the code below to build a model with the following layers.

An Embedding layer such that:
- The vocabulary contains 5000 tokens.
- The input length corresponds to the output of the vectorization layer.
- The number of outputs per input is 128.

<br>

Hidden layers such that:

- There's at least one Dense layer.

<br>

A Dense layer for outputting classification probabilities for each of the possible ratings (1-5).

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim=5000, output_dim=128))

# Hidden Layers
model.add(Flatten())
model.add(Dense(64, activation='relu'))

# Output Layer
model.add(Dense(5, activation='softmax'))

# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")

(None, 1) -> (None, 64)
(None, 64) -> (None, 64, 128)
(None, 64, 128) -> (None, 8192)
(None, 8192) -> (None, 64)
(None, 64) -> (None, 5)






In [None]:
# Fitting
opt = Adam(learning_rate = 0.01)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=200, epochs=5)

# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[3.0836269855499268, 0.5667235851287842]

<a name="p2"></a>

---
## **Part 2: Convolutional Neural Networks**
---

Complete the code below to train a new model that is identical to the one above, except using any or all of the CNN layers that keras provides. The goal is to create a model that performs as well as possible on the *test set*.
**Which architecture performs better?**

In [None]:
model = Sequential()

# Input, Vectorization, and Embedding Layers
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim=5000, output_dim=128))

# Hidden Layers
model.add(Flatten())
model.add(Dense(64, activation='relu'))

# Output Layer
model.add(Dense(5, activation='softmax'))

# Printing Structure
for layer in model.layers:
  print(str(layer.input_shape) + " -> " + str(layer.output_shape))
print("\n\n\n")

# Fitting
opt = Adam(learning_rate = 0.01)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=200, epochs=5)

# Evaluating
print("\n\n\n")
model.evaluate(x_train, y_train, verbose=0)
model.evaluate(x_test, y_test, verbose=0)

(None, 1) -> (None, 64)
(None, 64) -> (None, 64, 128)
(None, 64, 128) -> (None, 8192)
(None, 8192) -> (None, 64)
(None, 64) -> (None, 5)




Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5






[3.2103271484375, 0.5718467831611633]

---
##© 2024 The Coding School, All rights reserved