# Topic Modeling

#### The first thing we need to do, as always, is to import the correct libraries. We will not be using all of these libraries while extracting the data, but we will need to use them all eventually. 

In [1]:
import pandas as pd 
import numpy as np
from numpy import linalg as LA
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from datetime import *
from dateutil.relativedelta import *
from sklearn.preprocessing import normalize
from time import time
from scipy.spatial import distance
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import text 
import boto3

#### We can now access our data by placing our desired csv file in an s3 bucket on AWS and downloading the file from that bucket. Then we will read in the file and save it as a dataframe. 

In [2]:
bucket = 'data-science-tutorials'
key = 'topic_modeling.csv'

s3 = boto3.resource('s3')

s3.Bucket(bucket).download_file(key,key)

df = pd.read_csv('./topic_modeling.csv')


#### So what exactly is topic modeling? Topic modeling is a process of discovering the abstract "topics" that occur in a collection of documents. TF-IDF,  used as a weighting factor in topic modeling, stands for term frequency-Inverse document frequency. It is a statistic that reflects how important a word is in a corpus. This is used to weight different words so that a model can (narrow down the word search when looking for which words are most important in a set of words). In this particular example, we will be applying an NMF model to our TF-IDF An NMF (Non-negative matrix factorization) model, included in the Skicit Learn library, is a technique used for topic modeling that finds topics in a text based on correlations using linear algebra. 

#### First, we can create a variable *n_top_words* that specifies the number of 'top' words we will be printing out for each topic. We also have the variable *no_topics* to specify the number of topics. In our case, we will be configuring 5 topics each with 5 top words. 

In [3]:
n_top_words = 5
no_topics = 5

#### Next, we will create a function *get_topics* that creates a dictionary that displays the topic key as well as the topic words for each key and adds it to a dataframe. Once it is in a dataframe, we can easily drop duplicate topics.  

In [4]:
def get_topics_df(model, feature_names, n_top_words):
    topic_index = []
    topics = []
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        topic_list = ([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        topic_list = " ".join(topic_list)
        topic_index.append(message)
        topics.append(topic_list)
    d = {'topic_index': topic_index, 'topics': topics}
    df = pd.DataFrame(data=d)
    df.drop_duplicates(subset='topics', inplace=True)
    return df

#### For the vectorizor, there are many different parameters we can implement in order to adjust the accuracy of the model. One parameter used is *stop_words*. This is a function that filters out useless word data such as “the”, “a”, “an”, “in”. To increase the accuracy of our model, we can add our own stop words so that we can make sure not to include words that provide no meaning to our model. We should get rid of words such as "hello", "yes", "bye" etc since they do not provide any insight into what the calls are about. Other parameters used were *max_features* which is set to *no_features*. *min_df* and *max_df* adjust the cutoffs for the words examined in the dataset. For example, and min_df of .05 mean that for the words to be examined, it must appear in at least 5 percent of the calls. A max_df of .7 means that any words that appear in over 70 percent of the calls will not be used. This is necessary because if a word appears too little or too often then it loses importance.  In this case, a very low *max_df* makes it a lot easier to extract more unique, meaningful words. Next we will fit and transform the vectorizor with adjusted parameters onto our transcript data. (tokenize and count the word occurrences of a corpus of transcripts)

In [5]:
print("Extracting tf-idf features for NMF...")
my_added_stop_words = ['yeah','yes','hey','hi','good','bye','like',]
stop_words = text.ENGLISH_STOP_WORDS.union(my_added_stop_words)
tfidf_vectorizer = TfidfVectorizer( max_df=0.30, min_df=0.02, stop_words = stop_words)
t0 = time()
tfidf = tfidf_vectorizer.fit(df['transcript'])
tfidf_vector = tfidf.transform(df['transcript'])
print("done in %0.3fs." % (time() - t0), "\n")


Extracting tf-idf features for NMF...
done in 0.560s. 



#### We can now start fitting the NMF model using the tfidf features. Here, we will adjust the parameters of the NMF model according to the 'Frobenius norm' model. We will then fit the model to the already transformed vectorizor. Now we can use the sklearn function *get_feature_names* which will receive all of the words in the vectorizor. We will then apply the nmf model, feature names, and n_top_words to the get_topics_df function which will return a dataframe of the topics as well as the top words for each topic. 

In [6]:
# Fit the NMF model with Frobenius norm with tf-idf features
print("Fitting the NMF model (Frobenius norm) with tf-idf features")
t0 = time()
nmf_F = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.7, init='nndsvd').fit(tfidf_vector)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
topics_df = get_topics_df(nmf_F, tfidf_feature_names, n_top_words)
print(topics_df,"\n")

Fitting the NMF model (Frobenius norm) with tf-idf features
done in 0.471s.

Topics in NMF model (Frobenius norm):
  topic_index                                          topics
0  Topic #0:                    account phone alright don let
1  Topic #1:                  seven record tone finished hang
2  Topic #2:                extension dial party person enter
3  Topic #3:               leave reached soon possible return
4  Topic #4:   representative speak customer support existing 



#### As you can see, the words for each topic are not extremely cohesive, but can be genuinely distinguished by categories relating to voicemail, help, sales, marketing, etc. By raising the number of calls we use from 1,000 to 100,000, we should see an improvement, but it would take much longer to run. We could also improve the top words by adjusting the parameters. 

#### Now that we have a visual and qualitative measure of our data, we can work on getting quantitative data. We can quantitatively measure the scores for each call as well as the difference between each call using cosine similarity. By fitting and transforming the nmf model to the transformed vectorizor, we could get a matrix for each call that has a score for each of the topics. Instead of just printing out a list of matrices, we can write a code that compares the scores of the first call to the scores of all the other calls by comparing matrices to one another. 

#### In order to do this, we first must initialize call number 1 so we have a call to compare the rest of the calls to. Call 1 needs to be in the form of a matrix rather than a transcript so we need to vectorize the transcript by transforming the call to the fitted transcript. Then, we can transform the nmf model to the initial call 1 transcript. Now we will create a dataframe that includes the cosine similarities as well as the call ID for each call we are comparing Call 1 to. To compare call 1 to all other calls, we will create a for loop that vectorizes the transcript and applies the nmf model, same as what we did for Call 1.  Then, we can use the sklearn function *cosine_similarity*  

In [7]:
#get scores
nmf_F.fit(tfidf_vector)

#compare the scores of the first call to the scores of the other calls using cosine similarity
#create a dataframe with the call ID and the cosine similarity for each call
#sort in ascending order by cosine similarity
call1 = df['transcript'][0]
vect_0 = tfidf.transform([call1]) #vectorize transcript
call1 = nmf_F.transform(vect_0)
F_df = pd.DataFrame(columns=['sid','cos_sim'])


#### We can also find the scores for all the topics in each call by fitting the tfidf vectorizer trascriptions to the nmf model. Then we can transform the tfidf onto the nmf model which will show a matrix for each call that has 10 numbers in it for each of the topics. With these matrices we can preform something called 'cosine similarity' on each of the calls. Cosine similariy calculates the cosine of the angle between the two vectors. In this case, each vector is the matrix for each call, so we can just calculate the cosine similary by looping through the calls and comparing one call to all of the others. The smaller the number is, the more similar one call is to the next. 

In [8]:
for index,row in df.iterrows():
    vect = tfidf.transform([row['transcript']]) #vectorize transcript
    current_call = nmf_F.transform(vect) #nmf_KL.transform(tfidf) on single transcript
    cos_sim = cosine_similarity(call1, current_call) #topics matrix --> cosine distance
    sid = row['sid']
    cos_sim = float(cos_sim[0])
    F_df = F_df.append(pd.DataFrame({'sid': [sid], 'cos_sim': [cos_sim]}))
F_df = F_df.sort_values(by='cos_sim', ascending=False)
F_df.to_csv('./Ftest.csv')
print(F_df.head(),"\n")

                sid   cos_sim
0  180727c0b478f3a1  1.000000
0  1807302890ecfa47  0.999905
0  180724d9e2665237  0.999892
0  180726b27f0fa518  0.999687
0  1807273536a761f3  0.999659 



#### We can now do all of the same steps, but on a new NMF model with different parameters. In this case, we will adjust the parameters according to the 'Kullback-Leibler divergence' model. 

In [9]:
# Fit the NMF model with Kullback-Leibler divergence
print("Fitting the NMF model (generalized Kullback-Leibler divergence)")
t0 = time()
nmf_KL = NMF(n_components=no_topics, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=0,
          l1_ratio=.5).fit(tfidf_vector)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
topics_df = get_topics_df(nmf_KL, tfidf_feature_names, n_top_words)
print(topics_df)

#get scores
nmf_KL.fit(tfidf_vector)

#compare the scores of the first call to the scores of the other calls using cosine similarity
#create a dataframe with the call ID and the cosine similarity for each call
#sort in ascending order by cosine similarity

call1 = df['transcript'][0]
vect_0 = tfidf.transform([call1]) 
call1 = nmf_KL.transform(vect_0)
KL_df = pd.DataFrame(columns=['sid','cos_sim'])
for index,row in df.iterrows():
    vect = tfidf.transform([row['transcript']]) #vectorize transcript
    current_call = nmf_KL.transform(vect) #nmf_KL.transform(tfidf) on single transcript
    cos_sim = cosine_similarity(call1, current_call) #topics matrix --> cosine distance
    sid = row['sid']
    cos_sim = float(cos_sim[0])
    KL_df = KL_df.append(pd.DataFrame({'sid': [sid], 'cos_sim': [cos_sim]}))
KL_df = KL_df.sort_values(by='cos_sim', ascending=False)
KL_df.to_csv('./KLtest.csv')
print(KL_df.head())

Fitting the NMF model (generalized Kullback-Leibler divergence)
done in 0.964s.

Topics in NMF model (generalized Kullback-Leibler divergence):
  topic_index                                 topics
0  Topic #0:           alright mail thanks phone don
1  Topic #1:       tone available record seven pound
2  Topic #2:        extension party dial enter reach
3  Topic #3:      leave reached soon marketing right
4  Topic #4:   recorded speak maybe quality customer


  if (previous_error - error) / error_at_init < tol:


                sid   cos_sim
0  180727c0b478f3a1  1.000000
0  180725e9dc42a489  0.999895
0  180725477e9fbdee  0.999780
0  180726f4df694ecd  0.999689
0  180725fc2ef52735  0.999614


#### You can open up the files KL_df and F_df in order to see the cosine similarities for each model.

#### Overall, when choosing the number of topics and evaluating the interpretability of your topic model, it is important to look at both qualitative and quantitative factors. The cosine similarities are just one quantitative measure of topic modeling. In a Dialogtech hypothetical use case of topic modeling on phone call transcripts, it’s quantitatively useful to have topic categories to classify a call and understand general trends. It’s also qualitatively useful to drill into those topics and understand the nuances of each caller's individual requests.

