In [1]:
import pandas as pd
import numpy as np


In [2]:
replies_dataset = pd.read_csv(r'F:\std\allreplies.csv')
replies_dataset = replies_dataset.head(350)


The Dataset which is scraped from 'American Sexual Health Association(ASHA)' on which the Topic Modelling model is to be built is loaded into a pandas dataframe.The data set contains user posts and replies for various sexually transmitted diseases. We will use LDA to group the replies into 5 categories.

In [3]:
replies_dataset.dropna()
replies_dataset.head()

Unnamed: 0.1,Unnamed: 0,REPLIES
0,Brian,"Dear members,\n\nMany of us, including myself,..."
1,yarnkitty,"As a former nurse and too frequent patient, my..."
2,bb45694,I think security should be tighter.\nOne thing...
3,Hbc2115,"Throughout my surgeries and chemo, I had very ..."
4,queencitywalker,"Patient centered, not dollar centered.\nSmalle..."


In [4]:
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("http", " ", case = False) 
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("www", " ", case = False) 
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("org", " ", case = False) 
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("https", " ", case = False)
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("ashasexualhealth", " ", case = False)
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("thanks", " ", case = False)
replies_dataset['REPLIES']= replies_dataset['REPLIES'].str.replace("don", " ", case = False)

Some of the frequently repeating words which does not add value to the topics are removed. 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(replies_dataset['REPLIES'].values.astype('U'))

In the script above we use the CountVectorizer class from the sklearn.feature_extraction.text module to create a document-term matrix. We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents. We also remove all the stop words as they do not really contribute to topic modeling.

Now let's look at our document term matrix:

In [6]:
doc_term_matrix

<350x1717 sparse matrix of type '<class 'numpy.int64'>'
	with 10285 stored elements in Compressed Sparse Row format>

Our vocabulary has 10285 words

Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic.

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=5, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In the script above we use the LatentDirichletAllocation class from the sklearn.decomposition library to perform LDA 
on our document-term matrix. The parameter n_components specifies the number of categories, or topics,that we want our text 
to be divided into.

The following script randomly fetches 10 words from our vocabulary

In [8]:
import random

for i in range(10):
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

first_topic = LDA.components_[0]
top_topic_words = first_topic.argsort()[-10:]


minimal
shoes
wrote
influence
diabetes
preparing
marriage
empathic
unless
benefit


Let's find 10 words with the highest probability for the first topic. To get the first topic, we can use the
components_ attribute and pass a 0 index as the value.

In [9]:
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])
    

patients
patient
ve
doctor
symptoms
think
people
like
just
know


In [10]:
top_topic_words = first_topic.argsort()[-10:]

These indexes can then be used to retrieve the value of the words from the count_vect object, which can be done like this

In [11]:
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

patients
patient
ve
doctor
symptoms
think
people
like
just
know


The words show that the first topic might be about herpes and feeling low due to its existence.

Let's print the 10 words with highest probabilities for all the five topics

In [12]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['patients', 'patient', 've', 'doctor', 'symptoms', 'think', 'people', 'like', 'just', 'know']


Top 10 words for topic #1:
['let', 'herpes', 'know', 'admin', 'post', 'read', 'com', 'hsv', 'health', 'inspire']


Top 10 words for topic #2:
['inspire', 'treatment', 'think', 'link', 'information', 'medical', 'years', 'know', 'right', 'just']


Top 10 words for topic #3:
['people', 'want', 'sex', 'feel', 'just', 'child', 'like', 'hsv', 'life', 'herpes']


Top 10 words for topic #4:
['really', 'know', 'testing', 'hi', 'time', 'dormant', 'people', 'thank', 'hpv', 'nan']




Topic 0 :Discussions about herpes (most prevelant disease)<br>
Topic 1 :Discussions about HPV(human papillomavirus)<br>
Topic 2 :Discussions on transmission mode of the herpes virus<br>
Topic 3 :Discussions about the forum(Inspire)(American Sexual health association)<br>
Topic 4 :Discussions about doctors,patients and treatments
        

As a final step, we will add a column to the original data frame that will store the topic for the text. To do so, we can use
LDA.transform() method and pass it our document-term matrix. This method will assign the probability of all the topics to 
each document.

In [13]:
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

(350, 5)

The following script adds a new column for topic in the data frame and assigns the topic value to each row in the column

In [14]:
replies_dataset['Topic'] = topic_values.argmax(axis=1)

In [15]:
replies_dataset.to_csv(r'F:\std\topic_all_replies_LDA.csv')

In [16]:
replies_dataset.head()

Unnamed: 0.1,Unnamed: 0,REPLIES,Topic
0,Brian,"Dear members,\n\nMany of us, including myself,...",0
1,yarnkitty,"As a former nurse and too frequent patient, my...",0
2,bb45694,I think security should be tighter.\nOne thing...,0
3,Hbc2115,"Throughout my surgeries and chemo, I had very ...",0
4,queencitywalker,"Patient centered, not dollar centered.\nSmalle...",1
