<a href="https://colab.research.google.com/github/gprasad125/lign167_finalproject/blob/main/Copy_of_Project_Tester.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Multiclass Text Classification to Analyze Famous Quotes 

#### Gokul Prasad & Hoang Nguyen 
#### LIGN 167, Winter 2022

In this project, we'll aim to classify a variety of quotes with tags that refer to certain themes or elements specific to that particular quote. 

For example, Albert Einstein's quote “Life is like riding a bicycle. To keep your balance, you must keep moving.” would have tags like "life" or "simile" because it contains thematic elements about life, and contains a simile. 

In [None]:
# Scraping
import requests
from bs4 import BeautifulSoup
import time 

# Data manipulation / cleaning / visualization
import pandas as pd
import numpy as np
import gensim as gm
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

import re 
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Sklearn modeling
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV

# keras modeling
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from keras.metrics import Precision, Recall

# Transformers for model 2.2
import transformers
from transformers import AutoTokenizer,TFBertModel

from tensorflow.keras.optimizers import SGD

import warnings
warnings.filterwarnings('ignore')

# Scraping and Cleaning the Data 

We'll be sourcing our data from https://www.goodreads.com/quotes. This is a website containing 100 pages worth of quotes, each of them classified with a few tags. 

Firstly, we'll loop through the pages, and scrape the website HTML data with BeautifulSoup. Then, we'll use lambda functions to pull author data, quote data, and tag data. We'll put each of these into lists, and then create a pandas DataFrame to hold all our data. 

In [None]:
goodreads_quotes = []
goodreads_tags = []

for i in range(1, 101):

  url = 'https://www.goodreads.com/quotes?page={}'.format(i)

  time.sleep(5)
  scrape = requests.get(url)
  parsed = BeautifulSoup(scrape.content, 'html.parser')

  elements_quotes = parsed.find_all('div', class_ = "quoteText")
  
  quotes = [x.text.strip() for x in elements_quotes]
  tags = parsed.find_all(class_ = 'quoteFooter')

  goodreads_quotes += quotes
  goodreads_tags += tags

In [None]:
data = {'quote':goodreads_quotes, 'tags':goodreads_tags}
goodreads = pd.DataFrame(data)
goodreads.head()

As we can see, our dataset contains some pretty messy strings in both columns. We'll need to process the data to make sure it's usable for our modeling later on. 

For quotes, we'll first make all characters lowercase, and then use regex functionality to substitute any non alphanumeric / whitespace character with a blank string. 

For example, if we input a quote like "I love. LIGN \n167!!?   \n \nOscar Wilde " we would receive an output of "i love lign 167". We'll apply this to our Author and Quote columns to clean them up and make them much more simplified strings. 

In [None]:
def quotes_cleaning(text):
    
    text = text.lower()
    
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    
    return text

def tags_cleaning(text):
    
    text = re.sub('[\[ \]]', ' ', str(text))
    text = re.sub('[^\w]', ' ', text)
    text = re.sub('[\s]', ' ', text)
    text = re.sub('[0-9]', ' ', text)
    
    text = ' '.join(text.split())
    
    return text.split(' ')

def remove_author(quote):

  if quote[0] == '“':

    end_of_quote = quote.index('”')
    quote = quote[1:end_of_quote]

  return quote

def bs_to_list(tags):

  if type(tags) != list:

    tags = tags.find_all('a')
  
    tag_strs = []
    for tag in tags[:-1]:

      tag = str(tag)
      start_idx = tag.index('">')
      end_idx = tag.index('</')
      tag = tag[start_idx + 2:end_idx]
      tag_strs.append(tag)

    tags = tag_strs

  return tags

For the tags, we have to a slightly more complicated function since the data is tucked into lists. Firstly, we'll make it a string, and use regex to remove the surrounding brackets, remove non-word characters, and replace all multi-whitespaces with a single space. We'll then render the string as a list again, and return the list. 

For example, if we input a list like [deep?, wonderous.., love-happy], we would get an output of [deep, wonderous, love, happy]

In [None]:
goodreads['quote'] = goodreads['quote'].apply(remove_author)
goodreads['quote'] = goodreads['quote'].apply(quotes_cleaning)
goodreads['tags'] = goodreads['tags'].apply(bs_to_list)
goodreads['tags'] = goodreads['tags'].apply(tags_cleaning)

isEmpty = goodreads['tags'].apply(lambda x: '' in x)
goodreads['isEmpty'] = isEmpty

Now, having cleaned the dataset more fully, we can see the impact on our data. We've also added an isEmpty column, which marks whether or not the quote has no tags. Since the tags are in a list, empty lists will exist as "[ ]" and not as a NaN value.  

In [None]:
goodreads.head(4)

# Reshaping Data for Modeling 

Now, while the data is cleaned, we can't really model accurately when our tags are all in a list. Inputting them into our sklearn Pipelines later would not work as we would want, so we have to find a way to reshape the dataframe. Firstly, we'll need to collect the minimum and maximum amount of tags, which we do as follows. 

In [None]:
max_tags = goodreads['tags'].apply(lambda x: len(x)).max()
max_tags

So we see that the maximum amount of tags a quote could have would be 48 tags. So, let's generate a function that will make each list of tags equivalent by adding the necessary number of None values to make it to a list of length 48. 

For example, an input of [life, duck, nature] would yield [life, duck, nature, None, None, None, None, None, None, None, None, ... None, None, None]. 

In [None]:
def pad(tags):

  needed = 48 - len(tags)
  tags = tags + ([None] * needed)

  return tags

Now we can apply that function to our Tags column, and use pandas get_dummies() functionality to reshape our dataframe to where each tag is a column, and the column contains 1s or 0s, reflecting whether or not a particular tag is in the quote belonging to that row. 

Unfortunately, pd.get_dummies() will create some duplicates so we'll groupby and sum to combine the duplicate tag columns. 

We then combine this dataframe with our original dataframe, and drop our tags columns. We can see the finished result

In [None]:
gr_tags = pd.DataFrame(goodreads['tags'].apply(pad).tolist())
gr_tags_oh = pd.get_dummies(gr_tags, prefix = 'tags')
gr_tags_oh = gr_tags_oh.groupby(gr_tags_oh.columns, axis = 1).sum()
reshaped_gr = pd.concat([goodreads, gr_tags_oh], axis = 1).drop(columns = ['isEmpty', 'tags'])
reshaped_gr.head()

We can see the distribution of tags as below.

In [None]:
cnts = reshaped_gr.iloc[:, 1:].sum(axis = 1)
cnts.hist()

# Modeling

Now, we can begin our modeling. 

Firstly, we'll get a list of all of our tags. We'll do this by taking all columns besides "Quote"

Next, we'll use sklearn's train_test_split() function to split our dataset into a training and testing set. We'll split so that our test set is 33% of our dataset size. As we have 100 rows into our data, then we'll have a training set of 67 rows and testing size of 33 rows. 

In [None]:
gr_tags = reshaped_gr.columns[1:]

train, test = train_test_split(reshaped_gr, test_size = 0.25, random_state = 42)

x_tr = train.quote
x_te = test.quote

print(x_tr.shape)
print(x_te.shape)

### Model 1: Decision Tree Classifier

Our first model will be using scikit-learn Pipelines. 

Inside our pipeline, we'll firstly vectorize the input data by converting the quote to their TFIDF formation. This will convert our string Quotes to becoming numerical values for input. Then, we have to consider how we will be handling multiple classes. We'll try with a OneVsRest classifier, because this will allow us to pass in each tag and use an single-class estimator on each tag's train and test data. 

However, we need to wrap the OneVsRest classifier around an estimator that makes sense for what we are trying to achieve here. We'll use a Decision tree classifier, because the sklearn functionality is pretty simplistic, doesnt require much shaping of the data, and should hopefully set a good basis for our first try. 

In [None]:
dt_classifier = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(DecisionTreeClassifier()))])

Now, we'll loop through each of the tags in our dataset, train our model on that particular tag, and then append it to a dictionary containg each tag and that tag's associated evaluation score. 

For our evaluating metric, we'll choose to use f1 scores over accuracy, because if we look at our data, we have an imbalance of tags. Some quotes have several tags, while others only have one or two. As such, using accuracy would likely not work well for this scenario. 

However, we have multiple classes, so it would not make much sense to get a bunch of f1 scores since each tag would give different results. We can instead collect each tag's precision and recall from when the model's predictions are compared to the actual test data. 

In [None]:
prec_recs = {}
for tag in gr_tags:
    
    dt_classifier.fit(x_tr, train[tag])
    prediction = dt_classifier.predict(x_te)
    
    precision_recall = precision_recall_fscore_support(test[tag], prediction, average = 'macro')
    prec_recs[tag] = precision_recall

So now we can calculate the average precision and recall for our tags by looping through our dictionary, summing up the total of both metrics, and dividing by the number of tags.

In [None]:
sum_precision = 0
sum_recall = 0

for key in prec_recs.keys():
    
    sum_precision += prec_recs[key][0]
    sum_recall += prec_recs[key][1]
    
mean_precision = sum_precision / len(prec_recs.keys())
mean_recall = sum_recall / len(prec_recs.keys())

Now we apply the formula of finding an f1 score which is (2 * p * r) / (p + r)

In [None]:
average_f1 = (2 * mean_precision * mean_recall) / (mean_precision + mean_recall)
average_f1

So we have an f1 score of about 0.708. F1 scores range from 0 to 1, and the closer they are to 1, the better the model, so we have set up a good baseline for ourselves. But we want to improve on this and make our model better classify our quotes. 

#### Optimizing Model 1  

Now that we have our baseline model, how can we optimize it? 

There are many concepts we can implement into our Pipeline, both from a text classification standpoint, as well as a sklearn standpoint. 

The first method we'll implement is getting rid of stop-words. These are words that appear extremely frequently in human language, and give very little value to our model. Removing them can allow our model to focus more strongly on the more important data. 

In [None]:
stop_words = set(stopwords.words('english'))

Now that we have defined the words to remove, we can try and optimize our other parameters with GridSearchCV. First, we'll need to select what parameters we can optimize. 

In [None]:
parameters = {
    'clf':(DecisionTreeClassifier(),),
    'clf__max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None],
    'clf__min_samples_split': [2, 3, 5, 7, 10, 15, 20],
    'clf__min_samples_leaf': [2, 3, 5, 7, 10, 15, 20]
}

Now we have created the parameters, we can place that into a GridSearchCV and train it on our data. 
Let's print out the best parameters we get. 

In [None]:
grids = GridSearchCV(dt_classifier, param_grid = parameters, cv = 3, return_train_score = True)
for tag in gr_tags:
    grids.fit(x_tr, train[tag])

grids.best_params_

Let's now re-run our training, testing, and calculating of precision and recall to calculate a new and hopefully improved average f1 score. 

In [None]:
dt_classifier = Pipeline([('tdidf', TfidfVectorizer(stop_words = stop_words)), ('dtc', DecisionTreeClassifier(max_depth = 2, min_samples_leaf = 2))])

prec_recs = {}
for tag in gr_tags:
    
    dt_classifier.fit(x_tr, train[tag])
    prediction = dt_classifier.predict(x_te)
    
    precision_recall = precision_recall_fscore_support(test[tag], prediction, average = 'macro')
    prec_recs[tag] = precision_recall

sum_precision = 0
sum_recall = 0

for key in prec_recs.keys():
    
    sum_precision += prec_recs[key][0]
    sum_recall += prec_recs[key][1]
    
mean_precision = sum_precision / len(prec_recs.keys())
mean_recall = sum_recall / len(prec_recs.keys())

average_f1 = (2 * mean_precision * mean_recall) / (mean_precision + mean_recall)
average_f1

So we see a decent improvement from 0.71 --> 0.78, achieved with GridSearchCV and stop_word inclusion to optimize our model. However, we'll take a look at other models / optimizations to see if we can get a heightened score. 

### Model 2: 

The next model we'll try is a Keras Sequential() model with a couple layers. Firstly, we'll already go ahead and remove stopwords from both the quotes and tags, and append these to new columns. 

In [None]:
goodreads = goodreads[goodreads['isEmpty'] == False]
goodreads['stop_quote'] = goodreads['quote'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
goodreads['stop_tags'] = goodreads['tags'].apply(lambda x: [z for z in x if z not in stop_words])

goodreads.head()

We'll shuffle the data using pandas sample() functions, and take only our new stop-word removed quotes and tags columns. Then we'll split training and testing data on an 80/20 split, and get validation data as half the test data.

The shapes of the data are as follows:

In [None]:
goodreads_sample = goodreads.sample(frac = 1)
goodreads_sample = goodreads_sample[['stop_tags', 'stop_quote']]
train, test = train_test_split(goodreads_sample, test_size = 0.2, shuffle = True)
val = test.sample(frac=0.5)
test.drop(val.index, inplace=True)
train.shape, test.shape, val.shape

Next, we'll encode the list of strings in the "stop_tags" column to shift them from just being Strings to an integer output. We'll accomplish this via keras' constant() and StringLookup() functionality. 

In [None]:
from ast import literal_eval
from tensorflow.ragged import constant
from tensorflow.keras import layers

terms = constant(train["stop_tags"].values)
lookup = layers.StringLookup(output_mode="multi_hot")
lookup.adapt(terms)

We're going to need some information about our quote data, so we'll quickly split them into lists and use pandas describe() methods to generate the info about max lengths, avg lengths, etc.

In [None]:
train["stop_quote"].apply(lambda x: len(x.split(" "))).describe()

In [None]:
max_seqlen = 10
batch_size = 6
padding_token = "<pad>"

from tensorflow.data import AUTOTUNE
from tensorflow.data import Dataset

auto = AUTOTUNE


def make_dataset(dataframe, is_train=True):
    labels = constant(dataframe["stop_tags"].values)
    label_binarized = lookup(labels).numpy()

    dataset = Dataset.from_tensor_slices(
        (dataframe["stop_quote"].values, label_binarized)
    )

    dataset = dataset.shuffle(batch_size * 10) if is_train else dataset
    return dataset.batch(batch_size)

train_dataset = make_dataset(train, is_train=True)
validation_dataset = make_dataset(val, is_train=False)
test_dataset = make_dataset(test, is_train=False)

In [None]:
vocabulary = set()
train["stop_quote"].str.lower().str.split().apply(vocabulary.update)
vocabulary_size = len(vocabulary)
print(vocabulary_size)

In [None]:
text_vectorizer = layers.TextVectorization(max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf")

import tensorflow as tf
# `TextVectorization` layer needs to be adapted as per the vocabulary from our
# training set.
with tf.device("/CPU:0"):
    text_vectorizer.adapt(train_dataset.map(lambda text, label: text))

train_dataset = train_dataset.map(lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto).prefetch(auto)
validation_dataset = validation_dataset.map(lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto).prefetch(auto)
test_dataset = test_dataset.map(lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto).prefetch(auto)

In [None]:
keras_model = Sequential()
keras_model.add(Dropout(0.2))
keras_model.add(Dense(1000, activation = 'relu'))
keras_model.add(Dense(500, activation = 'relu'))
keras_model.add(Dense(lookup.vocabulary_size(), activation = 'sigmoid'))

keras_model.compile(loss="binary_crossentropy", optimizer = 'adam', metrics=['categorical_accuracy'])
keras_model.build((None, vocabulary_size))
keras_model.summary()

In [None]:
history = keras_model.fit(train_dataset, 
                          validation_data = validation_dataset, 
                          epochs = 15,
                          callbacks = [EarlyStopping(monitor = 'categorical_accuracy', patience = 3)])

In [None]:
info = keras_model.evaluate(test_dataset)

In [None]:
def plot_result(item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history["val_" + item], label="val_" + item)
    plt.xlabel("Epochs")
    plt.ylabel(item)
    plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()


plot_result("loss")
plot_result("categorical_accuracy")

Given the unsuccessful nature of using a list of tags, let's try a completely different approach and see if we can make a better result, albeit from a different standpoint. 

Instead of trying to classify from a multilabel view, what if we just assign one tag to each quote? We saw earlier than many tags are repeated throughout the quotes, so random selection of one would simplify the problem. 

We'll apply this idea by using numpy's random.choice() method on each list of tags and putting that new tag in a column called 'rand_tag'. 

In [None]:
subset = goodreads[['stop_quote', 'stop_tags']]
subset['rand_tag'] = subset['stop_tags'].apply(lambda x: np.random.choice(x))
subset.head()

Next, we'll encode the tag by looping through the tags and placing them in a dictionary. If the tag exists in the dictionary, then nothing is done; however, if not, then it is added, and the number assigned is the new length of the dictionary. 

For example, a group of lists like:

['life', 'love', 'data']

['love', 'joy', 'science'] 

would generate a dictionary of 

{'life': 1, 'love': 2, 'data': 3, 'joy': 4, 'science': 5}

In [None]:
encoded = {}

for tag in subset.rand_tag:

  if tag not in encoded.keys():

    encoded[tag] = len(encoded)

subset['encoded_tag'] = subset.rand_tag.map(encoded)

Next, we'll shuffle the data again and reset the index. 

In [None]:
subset = subset.sample(frac = 1).reset_index(drop = True)
subset.head()

Then we'll onehot encode the tags via keras' to_categorical() function. This will take a Series of integers and convert it to arrays of 1 (present) and 0s (not present). 

In [None]:
onehot_encoded = to_categorical(subset.encoded_tag)
onehot_encoded

We're going to use a AutoTokenizer to help tokenize our quotes later on, as well as a pretrained BERT model. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert = TFBertModel.from_pretrained('bert-base-cased')

We'll split into training and testing data, with a split of 80/20 on our stop_quote and encoded_tag columns. 

We'll again find out information to inform our model via pandas describe() functionality. 

In [None]:
data = subset[['stop_quote', 'encoded_tag']]
train, test = train_test_split(data, test_size = 0.2, random_state = 42)

In [None]:
subset["stop_quote"].apply(lambda x: len(x.split(" "))).describe()

Now we'll develop our x values from our training and testing datasets and tokenize them via the tokenizer initialization. 

We have several parameters to use:
- add_special_tokens is True to add tokens to the start and end of our tokenized quotes. 
- max_length is based off the information we saw earlier. This informs why we have truncation = True
- return_attention_mask = True. This informs the model not to pay attention to the special padding tokens when it reads the sentence. 

In [None]:
x_train = tokenizer(
    text=train['stop_quote'].tolist(),
    add_special_tokens=True,
    max_length=9,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)
x_test = tokenizer(
    text=test['stop_quote'].tolist(),
    add_special_tokens=True,
    max_length=9,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

Next, we can generate our y values from our training and testing datasets.

In [None]:
y_train = train['encoded_tag']
y_test = test['encoded_tag']

We can also get our input_ids and attention_masks. The former is a list of tokens that we are going to feed into our BERT model to be read and interpreted. The latter is a list of indices that tells the model to use and not use certain ids that correspond to special_tokens. 


In [None]:
input_ids = x_train['input_ids']
attention_mask = x_train['attention_mask']

In [None]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense

Now, we begin building our model. 

Firstly, we need to 

In [None]:
max_len = 9
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
embeddings = bert(input_ids,attention_mask = input_mask)[0] 
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(256, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(64, activation = 'relu')(out)
y = Dense(1, activation = 'sigmoid')(out)
model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
model.layers[2].trainable = True

In [None]:
model.summary()

In [None]:
optimizer = Adam(
    learning_rate=5e-05, # this learning rate is for bert model , taken from huggingface website 
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)
# Set loss and metrics
loss = CategoricalCrossentropy(from_logits = True)
metric = CategoricalAccuracy('balanced_accuracy'),
# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = [Precision(), Recall()])#metric)

In [None]:
train_history = model.fit(
    x = {'input_ids': x_train['input_ids'],'attention_mask': x_train['attention_mask']} ,
    y = y_train,
    validation_data = (
    {'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']}, y_test
    ), epochs=1, batch_size=32
)

In [None]:
test_info = model.evaluate(
    {'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']},
    y_test
)

Now, we can finally calculate our final F1 score, and we see the major improvement with this model that we achieved. 

In [None]:
p = test_info.history['precision']
r = test_info.history['recall']

f1_score = (2 * p * r) / (p + r)
f1_score