<a href="https://colab.research.google.com/github/preyansh98/NLP_Final_Project/blob/main/NLP_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

*Multi-label classification is a process by which the properties of data-points that are not mutually exclusive are predicted. Each sample of data is categorized with a target label. Multi-label text classification can be utilized in a multitude of applications, such as social media targeting, recognizing opinions and sentiments and building recommendation systems. Most applications of multi-label text classification do not consider the effect that word order can have on the performance of their models. Our paper therefore aims to evaluate the performance of multi-label learning by comparing the three following pre-training techniques: bag-of-words, word2vec with consideration of word order, and ELMo which takes word context into account. Specifically, these pre-training techniques will be applied independently as feature extraction techniques, and then subsequently evaluated through a Binary Relevance across models such as Logistic Regression, Multinomial Naïve Bayes, and SVM, to solve the multi-label text classification of predicting movie genres based on their plot summaries.*

Project by

*   Preyansh Kaushik - 260790402
*   Elie Elia - 260759306
*   Rozerin Akkus  - 260775633


---

In [None]:
from google.colab import drive
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt 
import nltk
%tensorflow_version 1.x
import tensorflow as tf
import seaborn as sns

nltk.download('stopwords')

drive.mount('/content/drive')
path = "/content/drive/MyDrive/COMP550/"

# Load Datasets

We will load the movie dataset corpus for our experiment from CMU. 
This includes two files:

*   plot_summaries.txt : Movie ID and the plot summaries for the movies
*   movie.metadata.tsv : Tab-separated values for movie ID, name, genre



In [None]:
path_to_movie_dataset = path + "movie.metadata.tsv"
path_to_plot_summaries = path + "plot_summaries.txt"

# headers obtained from corpus website
headers = ["movie_id", "freebase_id", "movie_name", "movie_release_date", "movie_box_office_rev", "movie_runtime", "movie_langs", "movie_countries", "genres"]

metadata = pd.read_csv(path_to_movie_dataset, sep = "\t", names = headers)
print("Movie IDs available ", metadata.shape[0])
metadata.head()

In [None]:
movie_ids, plots = [], []

with open(path_to_plot_summaries, "rt") as f:
  lines = f.readlines()

  for line in lines:
    data = line.split(None, 1)

    movie_ids.append(data[0])
    plots.append(data[1])

movies_and_plots = pd.DataFrame({'movie_id': movie_ids, 'plot_summary' : plots})
movies_and_plots.head()

Now, we merge the two datasets. 

We only need movie_id, movie name, plot_summary, and genres. 

In [None]:
metadata['movie_id'] = metadata['movie_id'].astype(str)

dataset = pd.merge(movies_and_plots, metadata, on='movie_id')

# remove unnescessary columns
dataset = dataset[set(['movie_id','plot_summary', 'movie_name', 'genres'])]

# reorder columns
dataset = dataset[['movie_id', 'movie_name', 'plot_summary', 'genres']]

# extract genres
genres_column = []

for val in dataset['genres']:
  genres = list(json.loads(val).values())
  genres_column.append(genres)

dataset['genres'] = genres_column
print(dataset.shape)
dataset.head()

# Preprocessing Dataset

## Cleaning Dataset

In [None]:
# remove those with no genres from dataset.

rowsToDelete = []
for i in range(len(dataset)):
  if (len(dataset['genres'][i]) == 0):
      rowsToDelete.append(i)

# print(len(rowsToDelete))

dataset.drop(index = rowsToDelete, inplace = True)
# print(dataset.shape)

## Text Preprocessing

- Remove stopwords, lemmatize, etc

In [None]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

stop_words = set(stopwords.words('english')) 

def clean_sentence(sentence):
  sentence = sentence.lower()

  # remove whitespaces
  sentence = re.sub("\\W", " ", sentence)

  # remove non-alphabets
  sentence = re.sub("[^a-zA-Z]"," ",sentence)

  return sentence

def remove_stopwords(sentence):
  return ' '.join([word for word in sentence.split() if not word in stop_words])

def lemmatize_text(sentence):
  lemmatizer = WordNetLemmatizer()

  lemmatized_words = []

  for word, tag in pos_tag(word_tokenize(sentence)):
    wntag = tag[0].lower()
    wntag = wntag if wntag in ['a','r','n','v'] else None

    if wntag:
      lemmatized_words.append(lemmatizer.lemmatize(word,wntag))
    else:
      lemmatized_words.append(word)

  return ' '.join(lemmatized_words)

def stem_text(sentence):
  stemmer = PorterStemmer()
  return ' '.join([stemmer.stem(word = word) for word in sentence])

""" Takes a pandas data column, and applies a series of functions to each value. Returns pandas column """
def clean_data(data_column, rm_stopwords, lemm, stem = False):
  if lemm and stem:
    raise "Either lemmatize or stem. Both can not be true."

  result = data_column

  # basic preprocessing, remove punctuation etc. 
  result = result.apply(lambda sentence : clean_sentence(sentence))

  if rm_stopwords:
    result = result.apply(lambda sentence : remove_stopwords(sentence))

  if lemm:
    result = result.apply(lambda sentence : lemmatize_text(sentence))

  if stem:
    result = result.apply(lambda sentence : stem_text(sentence))

  return result

dataset['plot_summary_cleaned'] = clean_data(dataset['plot_summary'], rm_stopwords = False, lemm = False, stem = False)
dataset.head()

# Exploratory Data Analysis

-- Visualize features of our dataset
    (word clouds for all genres in a dropdown) AND (frequency of genres)

In [None]:
from wordcloud import WordCloud
from nltk.tokenize import wordpunct_tokenize
import ipywidgets as wid
from IPython.display import clear_output

def filter_dataset_by_genre(genres_list):
  return dataset[pd.DataFrame(dataset.genres.tolist()).isin(genres_list).any(1).values]

def get_words_for_genre(genre):
  plots = filter_dataset_by_genre([genre])['plot_summary']
  return plots

def plot_word_map_for_genre(dataset, title, genre):
  plots = get_words_for_genre(genre)
  words = ''
  for plot in plots:
    tokens = wordpunct_tokenize(plot)
    for i in range(len(tokens)):
      tokens[i]   = tokens[i].lower()
    words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 800, height = 800, 
                  background_color ='white', 
                  stopwords = stop_words, 
                  min_font_size = 10).generate(words) 
                  
  plt.figure(figsize = (8, 8), facecolor = None) 
  plt.imshow(wordcloud) 
  plt.axis("off") 
  plt.tight_layout(pad = 0) 
    
  plt.show()

def event_handler(index):
  for i in range(10):
    clear_output(wait=True)
  dropdown_menu(index.new)
  plot_word_map_for_genre(dataset, title="Word Map for {} genre".format(index.new), genre=index.new)

def dropdown_menu(value):
  dropdown = wid.Dropdown(options = sorted(list(set([item for sublist in dataset['genres'].tolist() for item in sublist]))), value=value)
  dropdown.observe(event_handler, names="value")
  display(dropdown)

dropdown_menu('Drama')
plot_word_map_for_genre(dataset, title="Word Map for {} genre".format('Drama'), genre='Drama')

In [None]:
import seaborn as sns

genres_list  = sum(dataset['genres'], [])
# len(set(genres_list))

genres_list  = nltk.FreqDist(genres_list)
freq_df = pd.DataFrame({'Genre name': list(genres_list.keys()), 'Frequency': list([(x/sum(genres_list.values()))*100 for x in genres_list.values()])})

freq_g = freq_df.nlargest(columns = "Frequency", n = 40)
plt.figure(figsize = (12,6))
ax = sns.barplot(data = freq_g,  x = "Genre name", y ="Frequency", palette='Reds_r')
ax.set(xlabel='Genres' , ylabel='Frequency in %')
plt.xticks(rotation = 90)
plt.show()

# Feature Generation

Vectorize data to features using different techniques:
- TF-IDF (bag-of-words)
- word2vec
- ElMO 

First, build input X and output Y

In [None]:
X = dataset['plot_summary_cleaned'] 
Y = dataset['genres']

all_genres = sorted(list(set([item for sublist in dataset['genres'].tolist() for item in sublist])))

Split data into train-valid-test, and one-hot encode the output:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

mlb = MultiLabelBinarizer().fit(Y)

y_train_ohe = mlb.transform(y_train)
y_test_ohe = mlb.transform(y_test)

### TF-IDF:
https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

TF_VOCAB_SIZE = 10000

tf_vec = TfidfVectorizer(max_features=TF_VOCAB_SIZE)
tf_feature_train = tf_vec.fit_transform(x_train)
tf_feature_test = tf_vec.transform(x_test)

### Word2Vec:

In [None]:
from gensim.models import Word2Vec
from time import time 
import logging 
logging.basicConfig(format = "%(levelname)s-%(asctime)s: %(message)s", datefmt = '%H:%M:%S', level = logging.INFO)

tokenized = []
for  i  in range(len(dataset)):
  row   = dataset.iloc[i]
  token_tagged_plot = row['plot_summary_cleaned'].split()+row['genres']
  tokenized.append(token_tagged_plot)

print(len(tokenized))

In [None]:
w2vec_model = Word2Vec(tokenized, 
                        min_count = 3,
                        window  = 10,
                        size  = 100)

In [None]:
t = time()

w2vec_model.train(tokenized, total_examples=w2vec_model.corpus_count, epochs=20, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
w2vec_model.wv.init_sims(replace = True)
w2vec_model.wv.most_similar(positive = ['Musical'], topn = 20)

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def tsne_plot(model, word, list_names):
  arrays = np.empty((0, 150), dtype='f')
  word_labels = [word]
  color_list  = ['red']

  arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
  close_words = model.wv.most_similar([word], topn = 20)
  
  # adds the vector for each of the closest words to the array
  for wrd_score in close_words:
      wrd_vector = model.wv.__getitem__([wrd_score[0]])
      word_labels.append(wrd_score[0])
      color_list.append('blue')
      arrays = np.append(arrays, wrd_vector, axis=0)
  
  for wrd in list_names:
      wrd_vector = model.wv.__getitem__([wrd])
      word_labels.append(wrd)
      color_list.append('green')
      arrays = np.append(arrays, wrd_vector, axis=0)
        
    # Reduces the dimensionality from 150 to 25 dimensions with PCA
  reduc = PCA(n_components=25).fit_transform(arrays)
  
  # Finds t-SNE coordinates for 2 dimensions
  np.set_printoptions(suppress=True)
  
  Y = TSNE(n_components=2, random_state=0, perplexity=20).fit_transform(reduc)
  
  # Sets everything up to plot
  df = pd.DataFrame({'x': [x for x in Y[:, 0]],
                      'y': [y for y in Y[:, 1]],
                      'words': word_labels,
                      'color': color_list})
  
  fig, _ = plt.subplots()

  p1 = sns.regplot(data=df,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df['color']
                                 }
                    )
    
  for line in range(0, df.shape[0]):
        p1.text(df["x"][line],
                df['y'][line],
                '  ' + df["words"][line].title(),
                horizontalalignment='left',
                verticalalignment='bottom', size='small',
                color=df['color'][line],
                weight='normal'
              ).set_size(20)

  
  plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
  plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
          
  plt.title('t-SNE visualization for {}'.format(word.title()))

In [None]:
# testing out with top 20 similar words vs. random words 
tsne_plot(w2vec_model, 'Musical', ['dog', 'cat', 'coffee', 'computer', 'table', 'bird'])

In [None]:
tsne_plot(w2vec_model, 'love', ['dog', 'cat', 'coffee', 'computer', 'table', 'bird'])

In [None]:
def buildWordVector(text, size):
  vec = np.zeros(size).reshape((1, size))
  count = 0.
  word = text.split()
  for word in text:
      try:
          vec += w2vec_model[word].reshape((1, size))
          count += 1.
          
      except KeyError:
          continue
  if count != 0:
      vec /= count
  return vec

from sklearn.preprocessing import scale
w2v_feature_train = np.concatenate([buildWordVector(z, 100) for z in x_train])
w2v_feature_train = scale(w2v_feature_train)
# w2vec_model.train(x_test)

w2v_feature_test = np.concatenate([buildWordVector(z, 100) for z in x_test])
w2v_feature_test = scale(w2v_feature_test)

y_train_w2v = [i[0] for i in y_train]
y_test_w2v = [i[0] for i in y_test]

### ElMO:

In [None]:
save = False

if save:
  pickle_out = open("elmo_train.pickle","wb")
  pickle.dump(elmo_feature_train, pickle_out)
  pickle_out.close()

  # save elmo_test_new
  pickle_out = open("elmo_test.pickle","wb")
  pickle.dump(elmo_feature_test, pickle_out)
  pickle_out.close()

else:
  pickle_in = open("elmo_train.pickle", "rb")
  elmo_feature_train = pickle.load(pickle_in)

# load elmo_train_new
  pickle_in = open("elmo_test.pickle", "rb")
  elmo_feature_test = pickle.load(pickle_in)

In [None]:
import tensorflow_hub as hub

url = "https://tfhub.dev/google/elmo/2"
elmo = hub.Module(url)

# we will define a function to generate elmo embeddings:
def get_elmo_embeddings(x):
  embeddings = elmo(x, signature="default", as_dict=True)['elmo']

  with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())

    return session.run(tf.reduce_mean(embeddings,1))

# Since elmo takes a long time, we'll train sequentially in batches of a 100 samples. 
train_batches = [x_train[i:i+50] for i in range(0, x_train.shape[0], 50)]
test_batches = [x_test[i:i+50] for i in range(0, x_test.shape[0],50)]

elmo_train_embeddings = [get_elmo_embeddings(batch) for batch in train_batches]
elmo_test_embeddings = [get_elmo_embeddings(batch) for batch in test_batches]

# concatenate
elmo_feature_train = np.concatenate(elmo_train_embeddings, axis=0)
elmo_feature_test = np.concatenate(elmo_test_embeddings, axis = 0)

# Models


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

THRESHOLD = 0.3

# take in a model, and parameters, then grid search and fit. 
def run_cv_fit_model(model, parameters, x_train, y_train):
  classifier_cv = OneVsRestClassifier(model)

  # tune hyperparameters with CV:
  classifier_cv = GridSearchCV(classifier, parameters, cv=5)

  return classifier_cv.fit(x_train, y_train)

# def inverse one-hot encode output to the genres
def inverse_to_genres(pred):
  return mlb.inverse_transform(pred)

## Logistic Regression

In [None]:
class LogisticReg:
  def __init__(self, param_grid, x_train, y_train):
    self.model = self.__compile__(param_grid, x_train, y_train)
    return

  def __compile__(self, param_grid, x_train, y_train):
    lr = LogisticRegression(penalty='l2')
    return run_cv_fit_model(lr, param_grid, x_train, y_train)

  def predict(self, x_test):
    return self.model.predict(x_test)

  def score(self, x_test, y_test):
    return self.model.score(x_test,y_test)

### Multinomial Naive Bayes

In [None]:
class MNB:
  def __init__(self, param_grid, x_train, y_train):
    self.model = self.__compile__(param_grid, x_train, y_train)
    return

  def __compile__(self, param_grid, x_train, y_train):
    mnb = MultinomialNB()
    return run_cv_fit_model(mnb, param_grid, x_train, y_train)

  def predict(self, x_test):
    return self.model.predict(x_test)

  def score(self, x_test, y_test):
    return self.model.score(x_test,y_test)

### SVM Classifier

In [None]:
class SVM:
  def __init__(self, param_grid, x_train, y_train):
    self.model = self.__compile__(param_grid, x_train, y_train)
    return

  def __compile__(self, param_grid, x_train, y_train):
    model = SVC(kernel='linear')
    return run_cv_fit_model(model, param_grid, x_train, y_train)

  def predict(self, x_test):
    return self.model.predict(x_test)

  def score(self, x_test, y_test):
    return self.model.score(x_test,y_test)

# Results

## Logistic Regression 

In [None]:
# define parameter grid for logistic regression
C = np.logspace(0,4,10)

log_reg_param_grid = [{
    'estimator__C' : C
}]

### TFIDF

In [None]:
lr_tfidf = LogisticReg(log_reg_param_grid, tf_feature_train, y_train_ohe)

In [None]:
y_pred = lr_tfidf.predict(tf_feature_test)
print(f1_score(y_test_ohe, y_pred, average='micro'))

### Word2Vec

In [None]:
lr_w2v = LogisticReg(log_reg_param_grid, w2v_feature_train, y_train_w2v)

In [None]:
y_pred = lr_w2v.predict(w2v_feature_test)
print(f1_score(y_test_w2v, y_pred, average='micro'))

### ELMo

In [None]:
lr_elmo = LogisticReg(log_reg_param_grid, elmo_feature_train, y_train_ohe)

In [None]:
y_pred = lr_elmo.predict(elmo_feature_test)
print(f1_score(y_test_ohe, y_pred, average='micro'))

## Multinomial Naive Bayes

In [None]:
nb_param_grid = [{  
'estimator__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)  
}]

### TF-IDF

In [None]:
mnb_tfidf = MNB(nb_param_grid, tf_feature_train, y_train_ohe)

In [None]:
y_pred = mnb_tfidf.predict(tf_feature_test)
print(f1_score(y_test_ohe,y_pred, average='micro'))

### Word2Vec

In [None]:
mnb_w2v = MNB(nb_param_grid, w2v_feature_train, y_train_w2v)

In [None]:
y_pred = mnb_w2v.predict(w2v_feature_test)
print(f1_score(y_test_w2v, y_pred, average='micro'))

### ElMo

In [None]:
mnb_w2v = MNB(nb_param_grid, elmo_feature_train, y_train_ohe)

In [None]:
y_pred = mnb_elmo.predict(tf_feature_test)
print(f1_score(y_test_ohe, y_pred, average='micro'))

## SVM Classifier

In [None]:
svm_param_grid = [{'C': [0.1,1, 10, 100], 
                   'gamma': [1,0.1,0.01,0.001]}
                  ]

### TF-IDF

In [None]:
svm_tfidf = SVM(svm_param_grid, tf_feature_train, y_train_ohe)

In [None]:
y_pred = svm_tfidf.predict(tf_feature_test)
print(f1_score(y_test_ohe,y_pred,average='micro'))

### Word2Vec

In [None]:
svm_w2v = SVM(svm_param_grid, w2v_feature_train, y_train_w2v)

In [None]:
y_pred = svm_tfidf.predict(w2v_feature_test)
print(f1_score(y_test_w2v,y_pred,average='micro'))

### ElMo

In [None]:
svm_elmo = SVM(svm_param_grid, elmo_feature_train, y_train_ohe)

In [None]:
y_pred = svm_elmo.predict(elmo_feature_test)
print(f1_score(y_test_w2v,y_pred,average='micro'))