# Delivery n°5 : Topic Modelling

*Mathematics and Big Data - Mathias Lommel*

In this 5th delivery, we will apply the topic modelling notions seen in class on Quora Questions & answers, using different packages of Python, such as NTLK.

## Library importations

As always, we have to import fiew libraries that will be important for our work.

In [1]:
import pandas as pd
import wordcloud

import nltk
nltk.download('vader_lexicon')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews

import string
import re

import numpy as np

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Definition of the functions

Now, let's define the functions that we will use to solve our problem.

### Reading file

Again, I made this delivery on Google Colab. Then, I have created 2 different codes that can be used to read the .csv file.

In [2]:
# Using google Colab
def open_document_with_Drive(path):
  """
  This function read a csv file from Drive.

    Input :
        path : string - path of the file
    Output :
       data  : pd.DataFrame - extracted data
  """
  from google.colab import drive
  drive.mount('/content/drive')

  data = pd.read_csv(path)

  return data

In [3]:
# If the document is stored locally on the computer
def open_document(path):
  """
  This function read a csv file stored locally.

    Input :
        path : string - path of the file
    Output :
        data  : pd.DataFrame - extracted data
  """
  data = pd.read_csv(path)

  return data

### Pre-processing

As we have done in the previous deliveries, we have to pre-process our data.

Here, we will re-use the functions defined for the 4th delivery. In other words, we create 2 functions, one to preprocess one question (one line of the dataframe), and another one, that uses this first function to preprocess the whole database.

In [4]:
# Function that cleans the data
def preprocess_line(data):
  """
  This function preprocess a text.

    Input :
        review : string - question to preprocess
    Output :
        review : string - preprocessed question
  """
  # Change to lower case
  data = data.lower()

  # Remove URLs (http and https)
  data = re.sub("http?:\/\/.*[\r\n]*", "", data)
  data = re.sub("https?:\/\/.*[\r\n]*", "", data)

  # Remove emails
  data = re.sub(r'\b[A-Za-z0-9._-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b','',data)

  # Remove mentions
  data = re.sub("@\S+", "", data)

  # Remove punctuations, commas and special characters
  punctuation = string.punctuation
  translation_table = str.maketrans('', '', punctuation)

  data = data.translate(translation_table)

  # Remove numbers
  data = re.sub(r'\d+', '', data)

  return data

def preprocess_data(data):
  """
  This function preprocess the whole database.

    Input :
        data          : pd.DataFrame - database to preprocess
    Output :
        cleaned_data  : pd.DataFrame - preprocessed database
  """
  # We apply the preprocessing function to each review
  cleaned_reviews = data['Question'].apply(preprocess_line)

  # We build a new dataframe, with preprocessed reviews
  cleaned_data = pd.DataFrame({'Question' : cleaned_reviews})

  print("Pre-processing successfully computed.")

  return cleaned_data


## Application of our functions

Now, we are going to apply our functions on the dataset of study, in order to try to solve Quora's problems.

### Reading of the file

In [5]:
# Reading of the file
## With Google Colab
path = '/content/drive/My Drive/quora.csv'
data = open_document_with_Drive(path)

## Without Colab
#path = "path/of/the/file"
#data = open_document(path)

# Let's have a look at the dataframe created : quite simple, with only one column (the questions)
data.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


We can then find quite easily the number of questions asked in the database.

In [6]:
print("There are", len(data), "questions asked in this database.")

There are 404289 questions asked in this database.


### Preprocessing

Here, we will apply our preprocessing function to the created dataframe.

In [7]:
cleaned_data = preprocess_data(data)

Pre-processing successfully computed.


### Document Term Matrix

We can now, count the number of different words in the entire database.

This number gives us the number of columns of the Document Term Matrix

In [8]:
# We concatenate the questions, and we convert it into a set (to avoid repetitions)
unique_words = set(' '.join(cleaned_data['Question']).split())

# We then convert it into a list, for convenience
unique_words = list(unique_words)
print("There are",len(unique_words),"unique terms in the entire database.")
print("Then, the dimension of the Document Term Matrix should be",len(data),"x",len(unique_words),".")

There are 76719 unique terms in the entire database.
Then, the dimension of the Document Term Matrix should be 404289 x 76719 .


What is interesting here is that, among the hundred of thousand of words in the database, most of them are often repeated from question to question. Then, we obtain less than 80 000 unique words.

To reassure ourselves, we can print the first 30 words of the list, in order to verify that the words are not repeated.

In [9]:
print(unique_words[:30])

['overthrown', 'slinky', 'fbis', 'cpubound', 'trinkt', 'energyintensive', 'glider', 'tonk', 'injection', 'colou', 'timesthen', 'वह', 'ingest', 'sructure', 'björling', 'mourinho', 'belog', 'monastery', 'codenvy', 'unreadable', 'mieghem', 'davos', 'shloka', 'seethrough', 'ryerson', 'pilaris', 'laps', 'atlas', 'opiates', 'xwing']


As we can see with this little check, we don't see any repeated word, so we can be more confident in what we have done.

So now, we can compute the Document Term Matrix.

In [10]:
# We use TF-IDF Vectorization to create a vectorized document term matrix
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(cleaned_data['Question'])

In [11]:
print("The shape of the document term matrix is :", dtm.shape[0],"x",dtm.shape[1],".")

The shape of the document term matrix is : 404289 x 39679 .


As we can see, the dimension of the document term matrix is quite different from what we have expected. This is just because during this process, we have deleted the english stop words, which decrease the number of columns of the matrix.

This observation shows us that this operaton is quite interesting, since we have almost divided by 2 the number of columns of the matrix.

### Number of words for each document

Now, using the Document Term Matrix, we can find the number of words for each question of our database.

In [12]:
# We determine the words present for each document
tab = dtm > 0
# We initialize our result
number_words = np.zeros((len(cleaned_data),1))

# For each document, we compute the number of 1
for i in range(len(cleaned_data)):
  number_words[i] = tab[tab[i]].shape[1]

# Show the result
number_words_per_document = pd.DataFrame({'Number of words':number_words[:,0]})
print("Number of words for each document : ")
print(number_words_per_document)

Number of words for each document : 
        Number of words
0                   6.0
1                   3.0
2                   6.0
3                   3.0
4                   9.0
...                 ...
404284              6.0
404285              3.0
404286              1.0
404287              9.0
404288              3.0

[404289 rows x 1 columns]


### Topic modelling

Now, from our data, we can extract the topics they are about.

Here, we will use the *Latent Dirichlet Allocation*. This method builds a topic per document and words per topic model, modeled as Dirichlet distributions.

In [13]:
# We apply the LDA model
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

In [14]:
print("From the database of study, we have extracted",LDA.components_.shape[0],"different topics.")

From the database of study, we have extracted 7 different topics.


Then, for each topic, we can determine the 10 most common words. This step will help us to understand what the different topics are about.

In [15]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 10 WORDS FOR TOPIC #{index}')
    names = cv.get_feature_names_out()
    print([names[i] for i in topic.argsort()[-10:]])
    print('\n')

THE TOP 10 WORDS FOR TOPIC #0
['free', 'buy', 'used', 'android', 'movies', 'facebook', 'number', 'use', 'phone', 'best']


THE TOP 10 WORDS FOR TOPIC #1
['war', 'does', 'question', 'donald', 'notes', 'questions', 'people', 'trump', 'world', 'quora']


THE TOP 10 WORDS FOR TOPIC #2
['read', 'difference', 'india', 'think', 'books', 'things', 'life', 'know', 'does', 'people']


THE TOP 10 WORDS FOR TOPIC #3
['learn', 'google', 'does', 'account', 'iphone', 'examples', 'instagram', 'improve', 'english', 'difference']


THE TOP 10 WORDS FOR TOPIC #4
['business', 'thing', 'engineering', 'good', 'india', 'job', 'learn', 'start', 'way', 'best']


THE TOP 10 WORDS FOR TOPIC #5
['women', 'im', 'time', 'sex', 'girl', 'love', 'feel', 'best', 'does', 'like']


THE TOP 10 WORDS FOR TOPIC #6
['lose', 'ways', 'weight', 'mean', 'online', 'best', 'india', 'money', 'make', 'does']




### Mapping with the questions

Now, we can use the LDA model to map each question to the right topic.

To do that, we will make a dataframe, containing 3 columns :
  - The document number
  - The list of its main topics : the ones with a probability greater than 0.1
  - The number of topics associated with the document
  - Its main topic : the one with the highest probability

In [16]:
# Linking the topics and documents
topic_results = LDA.transform(dtm)

In [17]:
document_main_topic = np.zeros((len(cleaned_data),1),dtype=int)
document_topics = []
document_nb_topics = []
document = np.linspace(1,len(cleaned_data),len(cleaned_data),dtype=int)

for i in range(len(cleaned_data)):
  document_topics.append(np.where(np.array(topic_results[i])>0.1))
  document_nb_topics.append(sum(np.array(topic_results[i])>0.1))
  document_main_topic[i] = topic_results[i].argmax()

link_topics = pd.DataFrame({'Document':document,'Topics': document_topics,'Nb Topics':document_nb_topics,'Main topic':document_main_topic[:,0]})
print(link_topics)

        Document              Topics  Nb Topics  Main topic
0              1        ([1, 3, 4],)          3           4
1              2              ([0],)          1           0
2              3           ([0, 5],)          2           0
3              4        ([1, 4, 5],)          3           4
4              5           ([0, 6],)          2           0
...          ...                 ...        ...         ...
404284    404285              ([4],)          1           4
404285    404286              ([2],)          1           2
404286    404287              ([5],)          1           5
404287    404288  ([1, 3, 4, 5, 6],)          5           4
404288    404289              ([5],)          1           5

[404289 rows x 4 columns]


### Interesting topics

Then, we can also determine the topic people are mostly interested in, and the least interesting one, considering that an interesting topic is mentioned in many questions.

$\to$ We are searching for the most recurrent topic, and the least frequent one.


To answer this problem, we will use 2 different approaches.

$\to$ We can, for example, for each topic, add the probabilities, among all the documents. Then, we select the topic with the max/min sum of probabilities.

$\to$ We can also, for each topic, count the number of documents having it as a main topic.

In the following, we will compute those 2 approaches.

In [18]:
# First Approach
topic_results_df = pd.DataFrame(topic_results)
## We sum all the probabilities, sort the table, and get the first/last topic
table = topic_results_df.apply(sum, axis = 0).argsort()
main_topic_1 = table[topic_results.shape[1]-1]
worst_topic_1 = table[0]

In [19]:
# Second approach
## We group documents depending on their main topic
data_by_topic = link_topics.groupby('Main topic')
number_questions_per_topic = np.zeros((topic_results.shape[1],1),dtype=int)

## Iteratively, we compute, for each topic, the number of documents having it as main topic
for key in list(data_by_topic.groups.keys()):
    # We get the part of the data frame dedicated to the product being studied
    data_topic = data_by_topic.get_group(key)
    # We get the number of documents having this main topic
    number_questions_per_topic[key] = len(data_topic)

## Then, we sort the table, and select the first/last
main_topic_2 = number_questions_per_topic[:,0].argsort()[topic_results.shape[1]-1]
worst_topic_2 = number_questions_per_topic[:,0].argsort()[0]


In [20]:
print("APPROACH 1 : ")
print("     - Topic people are mostly interested in : TOPIC",main_topic_1)
print("     - Least interesting topic for people :    TOPIC",worst_topic_1,"\n")

print("APPROACH 2 : ")
print("     - Topic people are mostly interested in : TOPIC",main_topic_2)
print("     - Least interesting topic for people :    TOPIC",worst_topic_2,"\n")

APPROACH 1 : 
     - Topic people are mostly interested in : TOPIC 5
     - Least interesting topic for people :    TOPIC 6 

APPROACH 2 : 
     - Topic people are mostly interested in : TOPIC 5
     - Least interesting topic for people :    TOPIC 6 



As we can see, in the end, our 2 approaches give the same result : according to our study, TOPIC 5 is the most interesting one, and TOPIC 6 is the least interesting.

## Conclusion

During this 5th delivery, as we did last week, we have used the theoritical notions learnt in class to answer business problems.

Re-using functions defined for the 4th delivery, and writing new ones, we achieved to answer the different questions, this time, for Quora.