## Topic Modelling using LDA -(Data Source: American sexual health association )

In [1]:
import pandas as pd
import numpy as np


The Dataset which is scraped from 'American Sexual Health Association(ASHA)' on which the Topic Modelling model is to be built is loaded into a pandas dataframe.The data set contains user posts and replies for various sexually transmitted diseases. We will use LDA to group the replies into 5 categories.

In [2]:
replies_dataset = pd.read_csv(r'F:\std\allreplies.csv')
replies_dataset = replies_dataset.head()


### Overview of the dataset

In [3]:
replies_dataset.dropna()
replies_dataset.head()

Unnamed: 0,Username,Post and replies
0,Brian,"Dear members,\n\nMany of us, including myself,..."
1,yarnkitty,"As a former nurse and too frequent patient, my..."
2,bb45694,I think security should be tighter.\nOne thing...
3,Hbc2115,"Throughout my surgeries and chemo, I had very ..."
4,queencitywalker,"Patient centered, not dollar centered.\nSmalle..."


In [4]:
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("http", " ", case = False) 
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("www", " ", case = False) 
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("org", " ", case = False) 
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("https", " ", case = False)
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("ashasexualhealth", " ", case = False)
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("thanks", " ", case = False)
replies_dataset['Post and replies']= replies_dataset['Post and replies'].str.replace("don", " ", case = False)

Some of the frequently repeating words which does not add value to the topics are removed.
### Creating a  vocabulary of all the words

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(replies_dataset['Post and replies'].values.astype('U'))

In the script above we use the CountVectorizer class from the sklearn.feature_extraction.text module to create a document-term matrix. We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents. We also remove all the stop words as they do not really contribute to topic modeling.

Now let's look at our document term matrix
### Document Term matrix 

In [7]:
doc_term_matrix

<5x21 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

Our vocabulary has 10285 words

### Using LDA to create topics
Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic.

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=5, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In the script above we use the LatentDirichletAllocation class from the sklearn.decomposition library to perform LDA 
on our document-term matrix. The parameter n_components specifies the number of categories, or topics,that we want our text 
to be divided into.

### Fetching Random words from our Vocabulary
The following script randomly fetches 10 words from our vocabulary

In [9]:
import random

for i in range(10):
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

first_topic = LDA.components_[0]
top_topic_words = first_topic.argsort()[-10:]


help
help
ways
want
getting
better
health
care
nurses
ways


Let's find 10 words with the highest probability for the first topic. To get the first topic, we can use the
components_ attribute and pass a 0 index as the value.

In [10]:
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])
    

like
think
just
staff
focus
doctors
patients
hospital
run
getting


In [11]:
top_topic_words = first_topic.argsort()[-10:]

These indexes can then be used to retrieve the value of the words from the count_vect object, which can be done like this

In [12]:
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

like
think
just
staff
focus
doctors
patients
hospital
run
getting


The words show that the first topic might be about herpes and feeling low due to its existence.

### Finding words with highest probabilities in all the topics 
Let's print the 10 words with highest probabilities for all the five topics

In [13]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['like', 'think', 'just', 'staff', 'focus', 'doctors', 'patients', 'hospital', 'run', 'getting']


Top 10 words for topic #1:
['like', 'think', 'just', 'staff', 'focus', 'doctors', 'patients', 'hospital', 'run', 'getting']


Top 10 words for topic #2:
['nurses', 'work', 'hospital', 'ways', 'help', 'doctors', 'care', 'focus', 'staff', 'patients']


Top 10 words for topic #3:
['care', 'good', 'think', 'just', 'like', 'health', 'want', 'hospital', 'better', 'hospitals']


Top 10 words for topic #4:
['just', 'doctors', 'hospital', 'patients', 'nurses', 'work', 'good', 'run', 'getting', 'trained']




### Relevant Description
Topic 0 :Discussions about herpes (most prevelant disease)<br>
Topic 1 :Discussions about HPV(human papillomavirus)<br>
Topic 2 :Discussions on transmission mode of the herpes virus<br>
Topic 3 :Discussions about the forum(Inspire)(American Sexual health association)<br>
Topic 4 :Discussions about doctors,patients and treatments
        

As a final step, we will add a column to the original data frame that will store the topic for the text. To do so, we can use
LDA.transform() method and pass it our document-term matrix. This method will assign the probability of all the topics to 
each document.

In [14]:
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

(5, 5)

The following script adds a new column for topic in the data frame and assigns the topic value to each row in the column

In [15]:
replies_dataset['Topic'] = topic_values.argmax(axis=1)

In [16]:
replies_dataset.to_csv(r'F:\std\topic_all_replies_LDA.csv')

### Topics assigned to each post

In [18]:
replies_dataset.head()

Unnamed: 0,Username,Post and replies,Topic
0,Brian,"Dear members,\n\nMany of us, including myself,...",3
1,yarnkitty,"As a former nurse and too frequent patient, my...",2
2,bb45694,I think security should be tighter.\nOne thing...,3
3,Hbc2115,"Throughout my surgeries and chemo, I had very ...",4
4,queencitywalker,"Patient centered, not dollar centered.\nSmalle...",3
