# Task description:
You will repeat the process we did on the IMDb dataset (in the lab part 1 notebook) on the [consumer finance complaints dataset](https://catalog.data.gov/dataset/consumer-complaint-database). This time, however, you will use a pretrained word embedding layer on top of your classification model and evaluate its performance. in addition, You will also train another model with your own randomly initialized embedding layer (like we did in the lab) and compare their performance. I have uploaded a smaller, cleaner version of this data set to [this drive link](https://drive.google.com/file/d/1qr32I8pzfsGIOHrDM3jTqqhMoMnZhMx-/view?usp=sharing), This is what you'll be working work with. This will be a 10 class classification problem. refer to this keras [tutorial](https://keras.io/examples/nlp/pretrained_word_embeddings/) to learn how to deal wth pretrained word embeddings. 

In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, MaxPooling1D, GlobalMaxPooling1D, Dropout
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Download the dataset and explore the dataset structure: 

### download the dataset from the drive [link](https://drive.google.com/file/d/1qr32I8pzfsGIOHrDM3jTqqhMoMnZhMx-/view?usp=sharing). 

In [None]:
# Reuse the code from the lab file, get the file ID from the url 
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
file_id = '1qr32I8pzfsGIOHrDM3jTqqhMoMnZhMx-'
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('consumer_complaint_10class.csv')

### Read the dataset with pandas, and display a sample.

your input data is in the `complaints` column, and your target classes are encoded as integer numbers in the `categories_id` column.The `categories` column containes the category name associated with each category index.

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv("consumer_complaint_10class.csv")

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,complaint,category,category_id
0,15,It should be illegal. I havent use my credit c...,Credit card or prepaid card,0
1,16,I have been a Kohls credit card holder for ove...,"Credit reporting, credit repair services, or o...",1
2,30,Banking services or operating as expected. Sun...,Checking or savings account,2
3,31,In accordance with the Fair Credit Reporting a...,"Credit reporting, credit repair services, or o...",1
4,32,hello dear agency referring to my report i too...,"Credit reporting, credit repair services, or o...",1


### display the class names and frequencies (value counts).

In [None]:
data.category

0                               Credit card or prepaid card
1         Credit reporting, credit repair services, or o...
2                               Checking or savings account
3         Credit reporting, credit repair services, or o...
4         Credit reporting, credit repair services, or o...
                                ...                        
781713                                     Credit reporting
781714                                          Credit card
781715                                      Debt collection
781716                                             Mortgage
781717                                     Credit reporting
Name: category, Length: 781718, dtype: object

In [None]:
data.category.value_counts()

Credit reporting, credit repair services, or other personal consumer reports    293228
Debt collection                                                                 165980
Mortgage                                                                         91070
Credit card or prepaid card                                                      69094
Checking or savings account                                                      44435
Student loan                                                                     30426
Credit reporting                                                                 29827
Money transfer, virtual currency, or money service                               21797
Credit card                                                                      18757
Vehicle loan or lease                                                            17104
Name: category, dtype: int64

# split into training, test and validation. Create a dataset object with a `1024` batch size.


### split into  train, test and validation:
split with ratio `60%` train, `20%`validation and, `20%` test. You can use  `sklearn.model_selection.train_test_split`.  

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical 

In [None]:
X = data.complaint
y = to_categorical(data.category_id)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

### create tensorflow dataset objects for the training, validation and test sets:
 batch size= `1024`

 check out [`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

In [None]:
batch_size = 1024
seed = 123

train_ds = (tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size))
test_ds = (tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(batch_size))
val_ds = (tf.data.Dataset.from_tensor_slices((X_val, y_val)).batch(batch_size))

In [None]:
for text_batch, label_batch in train_ds.take(1): #Creates a Dataset with at most 1 element from this dataset.
  for i in range(5):
    print('label: ', label_batch[i].numpy())
    print( 'review text:\n', text_batch.numpy()[i])
    print("*"*20)

label:  5
review text:
 b"Case number : XXXX ( now XXXX ) : XXXXSenator XXXX, It 's been about 18 months of filing complaints with the CFPB XXXX still we are no closer to a resolution than when we started. \n\nOur first CFPB complaint was with XXXX XXXX XXXX who legally IAW RESPA transferred our mortgage to Homeward/Ocwen XXXX. \n\nIn the first CFPB complaint it appears that XXXX XXXX XXXX told CFPB that they may have made an error in the mortgage company that they named in the RESPA Notice but they corrected any alleged error by phone call to us. Their claims clearly were not true and not IAW RESPA the CFPB did absolutely nothing. RESPA requires corrections in writing within specified time limits and since none have been received we continue to make payments to Homeward/Ocwen XXXX who continues to accept and process the payments. \n\nAfter the Ocwen Loan Servicing LLC letter which was n't and is n't IAW RESPA we contacted Homeward/Ocwen XXXX by phone and they said that we did n't have

# define the preprocessing vectorize layer. 

1. create a custom preprocessing function (like we did in the lab) to lowercase the text and strip punctuations. You are free to add any extra preprocessing steps you see fit. 


2. define the vectorizer with `max_tokens=20000` (vocab size=20000) and the custom preprocessing function. 


3. `Adapt` the vectorizer on a text only version of our dataset (like we did in the lab).

In [None]:
# Create a custom preprocessing function to convert to lowercase
# and strip HTML break tags '<br />'.

def custom_preprocessing(input_data):
  # convert to lowercase
  lowercase = tf.strings.lower(input_data)
  # remove html tags
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  # remove punctuation/special characters and return
  return tf.strings.regex_replace(stripped_html,
                                  f'[{re.escape(string.punctuation)}]', '')


# Max vocabulary size 
vocab_size = 20000

# Use the text vectorization layer to normalize, tokenize, and map strings to
# integers. Note that the layer uses the custom preprocessing function defined above.
vectorize_layer = TextVectorization(
    standardize=custom_preprocessing,
    max_tokens=vocab_size,
    output_mode='int',
    )

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds) #Fits the state of the preprocessing layer to the dataset.

# download a pretrained embedding layer to use in your model. 
Here you will need to refer to the tensorflow [tutorial](https://keras.io/examples/nlp/pretrained_word_embeddings/). We will use a pretrained Glove model with `embedding dimensionality = 50`. for your convenience I have uploaded it [here](https://drive.google.com/file/d/1wAEvp6qEljV6pOoZVa_LuaPt9q2o41X_/view?usp=sharing)

### download the model from drive.

In [None]:
# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
file_id = '1wAEvp6qEljV6pOoZVa_LuaPt9q2o41X_'
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('glove.6B.50d.zip')

unzip the file into an appropriately named folder.


### Unzip the model.

In [None]:
!unzip -q glove.6B.50d.zip 

replace glove.6B.50d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y


### read the words and corresponding embedding from the text file into a dictionary. 
- read the file line by line. 
- store the first word of each line as key in the dict.
- convert the rest of the line to numpy array and store as value in the dict.

checkout `numpy.fromstring`. You can refer to the code from the tensorflow tutorial.

In [None]:
import os
import random 
from glob import glob
from pathlib import Path

path_to_glove_file = "glove.6B.50d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


### construct an embedding matrix from the embedding dict just obtained.

1. get the vocab list from the vectorizer layer `vectorizer.get_vocabulary()`.
2. convert the vocab list into dictionary {'word string': word index}. lets call this ***word_index***.
3. for every word in your vocab get the corresponding vector from the embeddings dict and use it to populate the embedding matrix at column number=word index.
4. if word is not in embedding dictionary just assign a vector of zeros to it.

you can refer to the code from the tensorflow tutorial.

In [None]:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
num_tokens = len(voc) + 2
embedding_dim = 50  #Change dim to 50
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 16687 words (3313 misses)


# define the model, compile and train. 



### define an embedding layer from the pre-trained embeddings matrix. 

initialize a `tf.keras.layers.embedding` layer object and set the `embeddings_initializer` from your embedding matrix, and set `trainable=False`.

In [None]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

### define a sequential model containing:
1. the vectorize layer. 
2. the embedding layer.
3. a global average pooling 1d layer. 
4. any number of hidden layers.
5. an output layer with the correct shape.

Refer to the lab part 1 notebook.  

In [None]:
from tensorflow.keras import layers

model = Sequential([
  vectorize_layer,
  embedding_layer,
  GlobalAveragePooling1D(),
  Dense(512, activation="relu"),
  Dense(256, activation="relu"),
  Dense(128, activation="relu"),
  Dropout(0.5),
  Dense(10, activation="softmax")
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     (None, None, 50)          1000100   
                                                                 
 global_average_pooling1d_4   (None, 50)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_16 (Dense)            (None, 512)               26112     
                                                                 
 dense_17 (Dense)            (None, 256)               131328    
                                                                 
 dense_18 (Dense)            (None, 128)              

### start the training with any callbacks you want.

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(patience=5, min_delta=0.005)

In [None]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=35,
    callbacks=[early_stop])

Epoch 1/35


  return dispatch_target(*args, **kwargs)


Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35


<keras.callbacks.History at 0x7f55fe210a90>

# Evaluate your model on the test set.

In [None]:
model.evaluate(test_ds)



[0.9282825589179993, 0.6870810389518738]

# create another model with a randomly initialized trainable embedding layer. Use the same architecture and the same vectorize layer.

This is what we did in the lab. refer to the part 1 notebook. It's important that you use the same architecture with the same number of hidden layers so we can fairly compare the two approaches.

In [None]:
from tensorflow.keras import layers

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(512, activation="relu"),
  Dense(256, activation="relu"),
  Dense(128, activation="relu"),
  Dropout(0.5),
  Dense(10, activation="softmax")
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_2 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, None, 50)          1000000   
                                                                 
 global_average_pooling1d_6   (None, 50)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_24 (Dense)            (None, 512)               26112     
                                                                 
 dense_25 (Dense)            (None, 256)               131328    
                                                                 
 dense_26 (Dense)            (None, 128)              

# train this other model (with the same parameters).

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(patience=5, min_delta=0.005)

In [None]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=35,
    callbacks=[early_stop])

Epoch 1/35


  return dispatch_target(*args, **kwargs)


Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35


<keras.callbacks.History at 0x7f55fef70ad0>

# Evaluate this model on the test set.

In [None]:
model.evaluate(test_ds)



[0.5839284062385559, 0.8137632608413696]

## compare the two and comment on your results.

****

*   **The first model that uses the Pretrained Embedding Layer gives an accuracy of 68%.**
*   **The seconed model that uses the randomly intialized Embedding Layer gives an accuracy of 81%.**
*   **The reason behind this may be that the pretrained layer was trained using a dataset that doesn't include words of our dataset.**


