<h1> Motivation: </h1>
<p>This notebook is designed to serve as supplementary material for my post on how to visuallize text data which can be read <span style="color:blue">[__HERE__.](https://www.sammywealth.com/blog/visualizing-text-data) </span>

Import the neccessary libraries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline


Read the data

In [2]:
df = pd.read_excel('MyArticle.xlsx')
df.tail()

Unnamed: 0,Document,Class,Source
50,"Historically, the political spectrum was...",PoliticalScience,https://learn.saylor.org/course/polsc108
51,"this unit, we will look at the state, a ...",PoliticalScience,https://learn.saylor.org/course/polsc109
52,This unit looks at the various forms of gove...,PoliticalScience,https://learn.saylor.org/course/polsc110
53,The Max Planck Manual has a global perspecti...,PoliticalScience,https://learn.saylor.org/course/polsc111
54,This unit traces the emergence of a world sy...,PoliticalScience,https://learn.saylor.org/course/polsc112


## Preprocess the data

We will now tokenize the documents. There are many libraries out there that we can use for this purpose. We could use SpaCy, nltk, TextBlob or even create our own function using regular expressions. However, would use `textblob` to tokenize these documents. I will also remove English stopwords and words made of less than 3 characters.

In [3]:
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')

In [4]:
from textblob import TextBlob

In [5]:
' '.join([word for word in TextBlob('How cool is Ame? reca came.\t on 10/15/2018!').words if (not word in my_stopwords and len(word) > 2)])

'How cool Ame reca came 10/15/2018'

In [6]:
df['CleanDoc'] = df['Document'].apply(lambda x: ' '.join([word.lower() for word in TextBlob(x).words if (not word.lower() in my_stopwords and len(word) > 2)]))

### Let's compare one cleaned document with its raw version

In [7]:
print(df.iloc[0,0])

The novel opens with Mrs. Bennet trying to persuade Mr. Bennet to visit Mr. Bingley, a rich and eligible bachelor who has arrived in the neighbourhood. After some verbal sparring with Mr. Bennet baiting his wife, it transpires that this visit has already taken place at Netherfield, Mr. Bingley's rented house. The visit is followed by an invitation to a ball at the local assembly rooms that the whole neighbourhood will attend.


In [8]:
print(df.iloc[0,3])

novel opens mrs bennet trying persuade bennet visit bingley rich eligible bachelor arrived neighbourhood verbal sparring bennet baiting wife transpires visit already taken place netherfield bingley rented house visit followed invitation ball local assembly rooms whole neighbourhood attend


### Great! Let's use the TfidfVectorizer of sklearn to transform our cleaned documents.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df= 0, max_df = 0.04)

In [10]:
X = tv.fit_transform(df['CleanDoc'])
print(X.shape)
vocab = tv.get_feature_names()
print('Length of Vocabulary: {}'.format(len(vocab)))

(55, 1096)
Length of Vocabulary: 1096


### Convert the transformed documents to a matrix and export it to excel for use in Gephi

In [11]:
matrix = pd.DataFrame(X.toarray(), index = df['Class'], columns=vocab)

In [12]:
matrix.head()

Unnamed: 0_level_0,100,15,1565,1648,1830s,300,40,able,absorb,accept,...,work,working,works,worth,writes,written,york,young,younger,zones
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PrideAndPrejudice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PrideAndPrejudice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PrideAndPrejudice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PrideAndPrejudice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PrideAndPrejudice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
matrix.to_excel('topics matrix.xlsx', index = True)

In [14]:
matrix.shape


(55, 1096)

<h2>If you are interested in the Gephi tutorial part only, <span style="color:red"> STOP HERE AND GO TO GEPHI.</span> otherwise <span style="color:BLUE"> KEEP READING.</span></h2>
<br></br>
<p></p>
<h2> The following section is about <span style="color:GREEN"> HOW TO VISUALIZE LDA TOPIC MODELS.</span></h2>
<p> There are many algorithms for doing topic modelling but <span style="color:blue"> gensim</span> is one of the most common ones. <p>

In [15]:
import gensim



### Since we had already preprocessed out text above, all we need to do now is [tokenize](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) the cleaned documents.

In [16]:
df['Tokenized'] = df['CleanDoc'].apply(lambda x: x.split())
df.head()

Unnamed: 0,Document,Class,Source,CleanDoc,Tokenized
0,The novel opens with Mrs. Bennet trying to per...,PrideAndPrejudice,https://en.wikipedia.org/wiki/Pride_and_Prejudice,novel opens mrs bennet trying persuade bennet ...,"[novel, opens, mrs, bennet, trying, persuade, ..."
1,"At the ball, Mr. Bingley is open and cheerful,...",PrideAndPrejudice,https://en.wikipedia.org/wiki/Pride_and_Prejudice,ball bingley open cheerful popular guests appe...,"[ball, bingley, open, cheerful, popular, guest..."
2,"When Jane visits Miss Bingley, she is caught i...",PrideAndPrejudice,https://en.wikipedia.org/wiki/Pride_and_Prejudice,jane visits miss bingley caught rain shower wa...,"[jane, visits, miss, bingley, caught, rain, sh..."
3,"Mr. Collins, a cousin of Mr. Bennet and heir t...",PrideAndPrejudice,https://en.wikipedia.org/wiki/Pride_and_Prejudice,collins cousin bennet heir longbourn estate vi...,"[collins, cousin, bennet, heir, longbourn, est..."
4,Elizabeth and her family meet the dashing and ...,PrideAndPrejudice,https://en.wikipedia.org/wiki/Pride_and_Prejudice,elizabeth family meet dashing charming george ...,"[elizabeth, family, meet, dashing, charming, g..."


<h3>They are 3 main steps in creating a gensim corpus.</h3>

<ul>
    <li>Create Corpus: This is a list of list of tokenized documents. This can easily become too large.</li>
    <li>Create token dictionary: This dictionary assignes an index to each token in the entire corpus of documents..</li>
    <li>Term Document Frequency: stores the index wordcount pair of each token in the token dictionary.</li>
</ul>

In [17]:
import gensim.corpora as corpora

# Create Corpus
texts = df['Tokenized']

# Create Dictionary
id2word = corpora.Dictionary(df['Tokenized'])

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1)]]


## Train an LDA model.

In [18]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

### Get the most dominant topic for each document.

In [19]:
my_topics = [sorted(lda_model[bow][0], key = lambda x: x[1], reverse = True)[0][0] for bow in corpus]
[(i, my_topics.count(i)) for i in set(my_topics)]
df['PredictedTopic'] = ['Topic {}'.format(topicNumber + 1) for topicNumber in my_topics]
mat =pd.DataFrame(X.toarray(), index=my_topics, columns = vocab)
mat.to_excel('mat5.xlsx')
[(i, my_topics.count(i)) for i in set(my_topics)]

[(0, 18), (1, 13), (2, 6), (3, 18)]

Checkout the distribution of topics in our various classes.

In [20]:
df.groupby(by = ['Class', 'PredictedTopic']).agg({'Document': 'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Document
Class,PredictedTopic,Unnamed: 2_level_1
PoliticalScience,Topic 1,9
PoliticalScience,Topic 3,1
PoliticalScience,Topic 4,2
PrideAndPrejudice,Topic 1,6
PrideAndPrejudice,Topic 2,3
PrideAndPrejudice,Topic 3,4
PrideAndPrejudice,Topic 4,3
StockTrading,Topic 3,1
StockTrading,Topic 4,13
Weather,Topic 1,3


It appears that political science is distributed between topics 1, 3 and 4 while PrideAndPrejudice is in all topics. Stock trading and weather seem to be unique to topics 4 and 2.
<p>Any one who has read [Pride And Prejudice by Jane Austen](https://en.wikipedia.org/wiki/Pride_and_Prejudice) will agree that it is a mixture of politics and weather and wealth.</p>

#### Let's retire to seeing the topics. First I get the top 40 words that define each topic and then I use the amzing `pyLDAvis` library to plot the topics.

In [21]:
for topic in lda_model.show_topics(formatted=False, num_words=40):
    print('Topic {}'.format(str(topic[0]+1))+ ': ' + '\n'+' | '.join([word[0] for word in topic[1]]) + '\n'+ '--'*78)

Topic 1: 
political | states | unit | state | government | course | perspective | elizabeth | politics | united | darcy | issues | look | forms | contemporary | world | others | ideologies | end | general | across | one | questions | jane | principles | study | science | various | institutions | role | global | century | twentieth | seen | creation | learn | first | governance | among | populations
------------------------------------------------------------------------------------------------------------------------------------------------------------
Topic 2: 
weather | earth | temperature | surface | different | air | lower | differences | system | due | one | pressure | solar | systems | occur | atmosphere | scale | changes | large | sun | climate | atmospheric | phenomena | effect | causes | affect | years | known | higher | part | spot | moisture | sunlight | cell | jet | stream | altitudes | cause | however | caused
---------------------------------------------------------------

In [22]:
for topic in lda_model.show_topics(formatted=False, num_words=40):
    print(topic)

(0, [('political', 0.030811898), ('states', 0.020873385), ('unit', 0.020143062), ('state', 0.017383704), ('government', 0.0130092), ('course', 0.010391863), ('perspective', 0.009877052), ('elizabeth', 0.009265046), ('politics', 0.00878077), ('united', 0.008147428), ('darcy', 0.007319779), ('issues', 0.0064352266), ('look', 0.006275257), ('forms', 0.006274406), ('contemporary', 0.005577479), ('world', 0.005574635), ('others', 0.00495894), ('ideologies', 0.004958465), ('end', 0.0046034767), ('general', 0.00459834), ('across', 0.004597168), ('one', 0.004315701), ('questions', 0.0042195744), ('jane', 0.0040715877), ('principles', 0.0040077767), ('study', 0.004007536), ('science', 0.003983518), ('various', 0.0037095603), ('institutions', 0.0037095598), ('role', 0.003708071), ('global', 0.0036281978), ('century', 0.003562005), ('twentieth', 0.003562005), ('seen', 0.0035601829), ('creation', 0.0035514492), ('learn', 0.0035409431), ('first', 0.0035409431), ('governance', 0.0035404793), ('among

In [23]:
from gensim.models import CoherenceModel
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.281403321042627

Coherence Score:  0.6346605815698426


In [24]:
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

  """
  """
  """


In [25]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


<p> <span style="color:blue"> [Stay Updated! Always checkout my blog for more posts!](https://www.sammywealth.com/blog/visualizing-text-data) </span> </p>