# Suppose that we have just been employed as a data scientist to help improve online customer support at a leading airline company. In our first meeting, we have been tasked with providing a summary of the online support required on the Twitter social media platform. The manager of the help desk support team in particular would like a summary of main Twitter topics that would require a reply via Twitter. Even though tweets are short in length, the dataset would be very big and difficult to manually read! You require a Natural Language Processing (NLP) technique that can help you meet the deadline and that can be done using topic modelling.

# There are several topic modelling algorithms that can be used to perform assignment of different topics and their weightages to the documents (in this case each individual tweet is a document), called topic vector of each document or semantic vector of each document whose dimensionality will be equal to the number of topics which you want to assign to each document. We will be using different topic modelling algorithms on different datasets to convert tweets to semantic or topic vectors, such as NMF (Non Negative Matrix Factorization), LDA (Latent Dirichilet Allocation) and BERT (Bidirectional Encoder Representations from Transformers). 

# Topic modeling algorithms provide a way to find the main topics that are discussed in a text corpus. Topic modeling is an unsupervised technique that only requires the text corpus and is able to derive the topics without any manually labeled data.

### Let's first download the dataset from Kaggle. 

### To know more about the dataset, we can navigate to the folowing url: 
https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [None]:
cd /content/drive/MyDrive

In [None]:
! unzip ./archive.zip

In [None]:
! pip install octis

In [None]:
! pip install delayed

In [None]:
! pip install tweet-preprocessor

In [None]:
import numpy as np
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
import preprocessor as tweet_processor

###Although, we have been using a combination of NLTK as well as Spacy till now but we are going to use here only Spacy to perform text preprocessing. 

In [None]:
spacy_nlp = spacy.load('en')

# Add custom stopwords to Spacy
spacy_nlp.Defaults.stop_words |= {"...","---","makemeastopword"}

# Create a method called spacy_custom_tokenizer that performs custom tokenization using Spacy
def tokenizer(single_tweet):

    #Let's clean the tweet using tweet_processor
    single_cleaned_tweet = tweet_processor.clean(single_tweet)

    spacy_tweet = spacy_nlp(single_cleaned_tweet)

    # Remove stop words, punctuation and numbers
    # Return he lemma of the word as lowercase and remove extra spaces
    tokens = [word.lemma_.lower().strip() for word in spacy_tweet if (not word.is_stop and not word.is_punct and not word.like_num)]

    return tokens

### Let's read the Twitter Airline Sentiment dataset downloaded from Kaggle. 

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('Tweets.csv')

In [None]:
airline_data = data[data['airline']=='US Airways']

airline_tweets = airline_data['text'].tolist()

In [None]:
airline_data.head()

### Non Negative Matrix Factrization (NMF) is a Matrix Factroization technique similar to Singular Value Decomposition (SVD). In SVD, a matrix gets factorized into three matrices (U, $\lambda$ and V), whereas in NMF, a matrix gets factorized into two matrices, that is:

### \begin{equation}
X = H\quad\bullet\quad W
\end{equation}

### The word "Non Negative" here actually means that all the entries in the matrices $H$ and $W$ have positive or non negative entries. 

### Where, $X$ = Data Matrix of shape (Number of tweets, Number of words in vocabulary). It can be any kind of matrix such as TF, TF-IDF, Co-occurrence, PMI or PPMI matrix. 

### $H$ = Document-Topic Matrix of shape (Number of tweets or documents, Number of topics which needs to be selected by us). Each row vector in this matrix is a topic vector representation of each tweet or document, just like TF-IDF is also a vector representation of a tweet or document. The difference is that Topic vector representation is semantic in nature and is more meaningful whereas TF-IDF is simply frequency based and not meaningful.

### $W$ = Topic-Vocabulary matrix of shape (Number of topics, Number of words in vocabulary). Each column vector in this matrix is a topic or semantic vector of a word present in a vocabulary. 

### The number of topics (that is the dimensionality of topic vector) needs to be selected by us. More the dimensionality of topic vector, more accurate will be the semantic representation of a tweet (document) or a word in a vocabulary. The dimensionality of topic vector here is acting like a resolution of semantics behind a document or a word. More the dimensionality, more will be the resolution. 

### Let's format the calculated matrices in the form which is acceptable for OCTIS library. 

In [None]:
def format_NMF_output(nmf_H, nmf_W, vocab, number_top_words):
  topics = []
  for topic_idx, topic in enumerate(nmf_H):
    word_list = [vocab[i] for i in topic.argsort()[:-(number_top_words + 1):-1]]
    topics.append(word_list)

  octis_topic_dict = {}
  octis_topic_dict["topic-word-matrix"] = nmf_H
  octis_topic_dict["topics"] = topics
  octis_topic_dict["topic-document-matrix"] = np.array(nmf_W).transpose()
  return octis_topic_dict

In [None]:
def prep_dataset_for_octis(tweets):
  dataset_for_octis = []
  #Add your code here
  return dataset_for_octis

In [None]:
number_topics_list = [64, 128, 256] #As already told that the number of topics or the dimensionality needs to be selected by us and therefore we will be
#evaluating the quality of topic vectors for different dimensionalities starting from 64 dimensional to 512 dimensional
number_top_words = 100 #Set number of top words - will be used by Diversity and display_topics() method
number_top_documents = 100 #Set number of top documents - will be used by the display_topics() method

additional_stop_words = ['flight', '..', '...', 'help', 'thank', 'thanks','great', 'need', 'response', 'today']
stop_words = list(STOP_WORDS) + additional_stop_words

npmi = Coherence(texts=prep_dataset_for_octis(airline_tweets), topk=number_top_words, measure='c_npmi')
topic_diversity = TopicDiversity(topk=number_top_words)

max_iter = 500 #We will talk about this later what is this max_iter

#Add your code here

### Let's now apply NMF on our term-document matrix, X for different dimensionalities of topic vectors (64, 128, 256). It means that in the matrix factroization:

### \begin{equation}
X = H \quad \bullet \quad W
\end{equation}

### For different dimensionalities of topic vectors (64, 128, 256, 512), the matrices, $H$ and $W$ will be of shape:

### $H$ : (Number of Documents, 64), $W$ : $(64, 2000)$
### $H$ : (Number of Documents, 128), $W$ : $(128, 2000)$
### $H$ : (Number of Documents, 256), $W$ : $(256, 2000)$

### So, now the question arrises is that how NMF finds out these two matrices for different dimensionalities of topic vectors. Bascially, NMF tries to find out two non negative matrices, $H$ and $W$ such that:

### \begin{equation}
||X - H\bullet W||_2
\end{equation}

### gets minimized. And these two matrices are first randomly initialized and then updated through gradient descent algorithm where the gradients involved in gradient descent algorithm will be the gradients of $||X - H\bullet W||_2$ with respect to matrices $H$ and $W$. Now, we have selected the total number of iterations for which the gradient descent algorithm will run as 500 iterations.  

In [None]:
reconstruction_error_list = []
number_of_iterations_list = []
npmi_coherence_score_list = []
diversity_score_list = []

for number_topics in number_topics_list:

  curent_number_topics = number_topics
  nmf_model = NMF(n_components=curent_number_topics, init='nndsvd', max_iter=max_iter).fit(term_document_matrix)
  nmf_W = nmf_model.transform(term_document_matrix)
  nmf_H = nmf_model.components_

  reconstruction_error_list.append(nmf_model.reconstruction_err_)
  number_of_iterations_list.append(nmf_model.n_iter_)

  octis_topic_dict = format_NMF_output(nmf_H, nmf_W, vocab, number_top_words)

  npmi_coherence_score_list.append(npmi.score(octis_topic_dict))
  diversity_score_list.append(topic_diversity.score(octis_topic_dict))


print("reconstruction_error_list", reconstruction_error_list)
print("number_of_iterations_list", number_of_iterations_list)
print("npmi_coherence_score_list", npmi_coherence_score_list)
print("diversity_score_list", diversity_score_list)