# Topic Modeling on Social Media posts

## Sources: 
1. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
2. https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=
3. https://pyldavis.readthedocs.io/en/latest/modules/API.html#pyLDAvis.save_html
4. https://towardsdatascience.com/nlp-embedding-techniques-51b7e6ec9f92

## Task: What are the common topics being discussed by the netizens in Twitter on the Presidential Candidates?

## Scope:
- Presidential Candidates
- Twitter
- Past few days

## Steps:
1. Gather data based on the set parameters
2. Load Data
3. Data Enggineering (process and clean data)
    1. Remove special characters
    2. Remove usernames
    3. Remove stop words
4. Modeling
    1. Choose the methods
    2. Choose the parameters of that method (e.g. number of topics)
    3. Run the model on the dataset
    4. Save
    5. Visualize
5. Analyze Results

## Code

## Test


# 1. Gather data

You can use tools available online to extract/gather data from social media platforms
- Twitter

# 2. Load dataset

In [11]:
import warnings; warnings.filterwarnings("ignore")
import pandas as pd

In [12]:
!pip install python-docx



In [21]:
# fname='./data/nlp/survey/RWA survey - Responses.xlsx'
# df=pd.read_excel(fname)

import docx

doc = docx.Document('./data/nlp/socmed/Leni Robredo 2021 tweets (4-12-21).docx')
docs = [p.text for p in doc.paragraphs if p.text]


docs_1 = []
for p in doc.paragraphs:
    if p.text:
        docs_1.append(p.text)
#docs

docs_1

['IMPORTANT ANNOUNCEMENT',
 'Ini-launch natin noong Miyerkules ang Bayanihan E-konsulta. Sobrang salamat sa lahat ng nag-volunteer—mga doktor, mga tumatao sa call center—pati rin sa mga simpleng messages of support. Overwhelmingly positive ang naging response sa initiative natin.',
 'Sobrang overwhelming ang support, at sobrang overwhelming din ang dami ng requests na pinaabot sa atin. Patunay ito na kailangang natin ng ganitong uri ng serbisyo.',
 'Hindi kami nagpapigil kahit ang daming limitasyon, at kailangan namin ngayon ng kaunting oras para habulin at tugunan muna ang lahat ng pumasok na request.',
 'Kaya, despite our reluctance to do so, tomorrow, Monday, April 12, we will not be receiving new requests first.',
 'Isang araw lang po ito. Kailangan lang po namin ng panahon para ma integrate na yung external volunteers at maayos lahat na technical issues at backlogs. Hindi po namin inaasahan ang dagsa ng nagkukunsulta kaya kailangan lalong paghusayan ang ating platform at process f

# 3. Data Engineering

In [26]:
## library of functions on regular expressions
import re
#regex : regular expressions

## This removes the word that starts with '@'
twts = [re.sub(r'(\s)@\w+', r'', a) for a in docs]


## beautiful soup -- parsing HTML formats

In [27]:
twts = pd.DataFrame(twts)
twts

Unnamed: 0,0
0,IMPORTANT ANNOUNCEMENT
1,Ini-launch natin noong Miyerkules ang Bayaniha...
2,"Sobrang overwhelming ang support, at sobrang o..."
3,Hindi kami nagpapigil kahit ang daming limitas...
4,"Kaya, despite our reluctance to do so, tomorro..."
...,...
154,*so
155,[A] Vice President Leni Robredo on the termina...
156,Full text here:
157,"2021 na, ang trolls mahilig pa rin mag-recycle..."


## Retrieve desired column (for feedback)

In [25]:
## This retrieves the column for processing

'''
Twitter dataset: this actually only has comments 
therefore based on the dataframe generated
For other datasets: identify the desired column

This is the first column so use: twts[0]
Rename the column title with the correct name: Comments
'''
l_dataset=pd.DataFrame({'Comments':twts[0].str.lower()}) # the dataset that we need
l_dataset

Unnamed: 0,Comments
0,important announcement
1,ini-launch natin noong miyerkules ang bayaniha...
2,"sobrang overwhelming ang support, at sobrang o..."
3,hindi kami nagpapigil kahit ang daming limitas...
4,"kaya, despite our reluctance to do so, tomorro..."
...,...
154,*so
155,[a] vice president leni robredo on the termina...
156,full text here:
157,"2021 na, ang trolls mahilig pa rin mag-recycle..."


# 4. NLP
## 4.1. Load NLP libraries

In [30]:
!pip install gensim
!pip install nltk



In [31]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk

nltk.download('wordnet')

stemmer = SnowballStemmer('english') # added

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/isabelsaludares/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 4.2. Load stop words for Filipino

In [32]:
# stop words for Filipino
dff=pd.read_csv('./data/nlp/survey/stopwords_tl.txt',sep='\t',names=['stopword'],dtype=str)
stop_filipino=list(dff['stopword'].values)
stop_filipino

['akin',
 'aking',
 'ako',
 'alin',
 'am',
 'amin',
 'aming',
 'ang',
 'ano',
 'anumang',
 'apat',
 'at',
 'atin',
 'ating',
 'ay',
 'ba',
 'bababa',
 'bago',
 'bakit',
 'bawat',
 'bilang',
 'dahil',
 'dalawa',
 'dapat',
 'din',
 'dito',
 'doon',
 'gagawin',
 'gayunman',
 'ginagawa',
 'ginawa',
 'ginawang',
 'gumawa',
 'gusto',
 'habang',
 'hanggang',
 'hindi',
 'huwag',
 'iba',
 'ibaba',
 'ibabaw',
 'ibig',
 'ikaw',
 'ilagay',
 'ilalim',
 'ilan',
 'inyong',
 'isa',
 'isang',
 'itaas',
 'ito',
 'iyo',
 'iyon',
 'iyong',
 'ka',
 'kahit',
 'kailangan',
 'kailanman',
 'kami',
 'kanila',
 'kanilang',
 'kanino',
 'kanya',
 'kanyang',
 'kapag',
 'kapwa',
 'karamihan',
 'katiyakan',
 'katulad',
 'kay',
 'kaya',
 'kaysa',
 'ko',
 'kong',
 'kulang',
 'kumuha',
 'kung',
 'laban',
 'lahat',
 'lamang',
 'likod',
 'lima',
 'maaari',
 'maaaring',
 'maging',
 'mahusay',
 'makita',
 'marami',
 'marapat',
 'masyado',
 'may',
 'mayroon',
 'mga',
 'minsan',
 'mismo',
 'mula',
 'muli',
 'na',
 'nabanggit'

## 4.3. stop words = stop words for English + stop words for Filipino

In [33]:
stop_english = list(gensim.parsing.preprocessing.STOPWORDS)

In [34]:
stop_words = frozenset(stop_english + 
                     stop_filipino + 
                     ['us','none','n/a'])

In [36]:
# preprocess texts
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in stop_words and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

### 4.3.1 processed docs

Image taken from [Sejal Dua Medium Page](https://towardsdatascience.com/nlp-preprocessing-and-latent-dirichlet-allocation-lda-topic-modeling-with-gensim-713d516c6c7d)
<div>
<img src="topic_model_illus.png" width="600"/>
</div>

In [43]:
processed_docs = l_dataset['Comments'].map(preprocess)

## 4.4. Bag of Words on the Data set
We then create a dictionary from `processed_docs` containing the number of times a word appears in the training set.

In [44]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 15:
        break

0 announc
1 import
2 bayanihan
3 center
4 doktor
5 initi
6 konsulta
7 launch
8 messag
9 miyerkul
10 natin
11 noong
12 overwhelm
13 pati
14 posit
15 respons


### Word Embeddings
- a technique of word representation that allows words with similar meaning to be understood by machine learning algorithms
- it is a mapping of words into vectors of real numbers
- this can be computed/mapped via: neural network, probabilistic model, or dimension reduction on word co-occurrence matrix.

### Techniques (some)
reference: [Rabeh Ayari Medium Page](https://towardsdatascience.com/nlp-embedding-techniques-51b7e6ec9f92)
#### 1. Bag-of-Words (BOW)

- text is represented as a bag containing plenty of words
- grammar and word order are neglected while the frequency is kept the same
- feature generated by bag-of-words is a vector where n is the number of words in the input documents vocabulary

#### 2. Term Frequency-Inverse Document Frequency

- words are given importance (represented as weights) by TF-IDF importance instead of only frequency
- statistical measure to evaluate the importance of words with respect to the document in a collection 
    - many commonly used words for each dataset that appear many times in the document but do not provide any important information. 
    - higher weight for higher number of occurences on the word in the document collection, in realtion to the whole corpus, therefore higher weights are attributed if the term is more EXCLUSIVELY used for that DOCUMENT CATEGORY

#### 3. Word2Vec

- one of the most efficient techniques
- learned model: computationally efficient predictive model for learning word embeddings from raw text
- plots the words in a multi-dimensional vector space, where similar words tend to be close to each other, surrounding words of a word provide the context to that word

#### 4. Doc2Vec

- creates an embedding of a document irrespective to its length
- computes a feature vector for every document in the corpus
- Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input

### 4.5. Gensim doc2bow
We create a dictionary for each document; this contains how many words and how many times those words appear. <br>
We save this to `bow_corpus`.

In [45]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [46]:
# sample output for the first document
bow_doc_0=bow_corpus[0]
for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], dictionary[bow_doc_0[i][0]], bow_doc_0[i][1]))

Word 0 ("announc") appears 1 time.
Word 1 ("import") appears 1 time.


### 4.6. TF-IDF
Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document

In [47]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf: # preview TF-IDF scores for the first document
    pprint(doc)
    break

[(0, 0.6569394759215659), (1, 0.7539433168189094)]


### 4.7. LDA using Bag of Words

#### 4.7.1. num of topics = 5

In [48]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=2, workers=2)

In [49]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.027*"test" + 0.020*"trace" + 0.014*"contact" + 0.013*"immedi" + 0.013*"posit" + 0.011*"antigen" + 0.010*"swab" + 0.008*"symptom" + 0.008*"join" + 0.008*"suggest"
Topic: 1 
Words: 0.017*"volunt" + 0.010*"natin" + 0.010*"sobrang" + 0.010*"overwhelm" + 0.009*"help" + 0.007*"medic" + 0.007*"request" + 0.007*"inspir" + 0.007*"amid" + 0.006*"lang"
Topic: 2 
Words: 0.036*"test" + 0.013*"posit" + 0.013*"covid" + 0.010*"need" + 0.009*"isol" + 0.008*"area" + 0.007*"number" + 0.007*"swab" + 0.007*"natin" + 0.006*"lang"
Topic: 3 
Words: 0.020*"isol" + 0.014*"vaccin" + 0.012*"test" + 0.011*"need" + 0.008*"work" + 0.007*"hope" + 0.006*"number" + 0.006*"help" + 0.006*"posit" + 0.006*"swab"
Topic: 4 
Words: 0.049*"test" + 0.018*"posit" + 0.014*"peopl" + 0.011*"malabon" + 0.010*"communiti" + 0.008*"health" + 0.007*"year" + 0.007*"isol" + 0.007*"natin" + 0.007*"thank"


### 4.8. LDA using TF-IDF
#### 4.8.1. num of topics = 5

In [50]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=5, 
                                             id2word=dictionary, 
                                             passes=2, 
                                             workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.006*"announc" + 0.005*"medic" + 0.005*"test" + 0.004*"natin" + 0.004*"maram" + 0.004*"trace" + 0.004*"packag" + 0.004*"contact" + 0.004*"salamat" + 0.004*"immedi"
Topic: 1 Word: 0.012*"test" + 0.009*"posit" + 0.006*"hope" + 0.006*"vaccin" + 0.005*"target" + 0.005*"area" + 0.005*"tabl" + 0.005*"rate" + 0.005*"number" + 0.005*"case"
Topic: 2 Word: 0.005*"swab" + 0.005*"kayanatinph" + 0.005*"symptom" + 0.005*"cab" + 0.004*"display" + 0.004*"thank" + 0.004*"medic" + 0.004*"muna" + 0.004*"lang" + 0.004*"provid"
Topic: 3 Word: 0.007*"individu" + 0.007*"join" + 0.006*"test" + 0.006*"prevent" + 0.006*"page" + 0.005*"text" + 0.005*"reintegr" + 0.004*"ulit" + 0.004*"free" + 0.004*"maram"
Topic: 4 Word: 0.014*"test" + 0.012*"inocul" + 0.010*"isol" + 0.008*"trace" + 0.007*"sunday" + 0.007*"center" + 0.006*"immedi" + 0.006*"suggest" + 0.006*"antigen" + 0.005*"swab"


### 4.9. check for a specific doc

In [51]:
print(l_dataset[:10])

                                            Comments
0                             important announcement
1  ini-launch natin noong miyerkules ang bayaniha...
2  sobrang overwhelming ang support, at sobrang o...
3  hindi kami nagpapigil kahit ang daming limitas...
4  kaya, despite our reluctance to do so, tomorro...
5  isang araw lang po ito. kailangan lang po nami...
6  magreresume ulit tayo ng pagtanggap ng bagong ...
7  humihingi ako ng paumanhin at pang-unawa sa la...
8  gusto pa nating magpatuloy ang programa, at ma...
9  naiimagine ko nga—kung kami na nag-o-augment l...


#### 3.9.1. say we choose the one with index 14

In [52]:
ind=14
for index, score in sorted(lda_model[bow_corpus[ind]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8661197423934937	 
Topic: 0.027*"test" + 0.020*"trace" + 0.014*"contact" + 0.013*"immedi" + 0.013*"posit" + 0.011*"antigen" + 0.010*"swab" + 0.008*"symptom" + 0.008*"join" + 0.008*"suggest"

Score: 0.033656250685453415	 
Topic: 0.020*"isol" + 0.014*"vaccin" + 0.012*"test" + 0.011*"need" + 0.008*"work" + 0.007*"hope" + 0.006*"number" + 0.006*"help" + 0.006*"posit" + 0.006*"swab"

Score: 0.03350177779793739	 
Topic: 0.049*"test" + 0.018*"posit" + 0.014*"peopl" + 0.011*"malabon" + 0.010*"communiti" + 0.008*"health" + 0.007*"year" + 0.007*"isol" + 0.007*"natin" + 0.007*"thank"

Score: 0.03338038921356201	 
Topic: 0.036*"test" + 0.013*"posit" + 0.013*"covid" + 0.010*"need" + 0.009*"isol" + 0.008*"area" + 0.007*"number" + 0.007*"swab" + 0.007*"natin" + 0.006*"lang"

Score: 0.03334185108542442	 
Topic: 0.017*"volunt" + 0.010*"natin" + 0.010*"sobrang" + 0.010*"overwhelm" + 0.009*"help" + 0.007*"medic" + 0.007*"request" + 0.007*"inspir" + 0.007*"amid" + 0.006*"lang"


In [53]:
for index, score in sorted(lda_model_tfidf[bow_corpus[ind]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8642151951789856	 
Topic: 0.006*"announc" + 0.005*"medic" + 0.005*"test" + 0.004*"natin" + 0.004*"maram" + 0.004*"trace" + 0.004*"packag" + 0.004*"contact" + 0.004*"salamat" + 0.004*"immedi"

Score: 0.03490997850894928	 
Topic: 0.007*"individu" + 0.007*"join" + 0.006*"test" + 0.006*"prevent" + 0.006*"page" + 0.005*"text" + 0.005*"reintegr" + 0.004*"ulit" + 0.004*"free" + 0.004*"maram"

Score: 0.03392622247338295	 
Topic: 0.012*"test" + 0.009*"posit" + 0.006*"hope" + 0.006*"vaccin" + 0.005*"target" + 0.005*"area" + 0.005*"tabl" + 0.005*"rate" + 0.005*"number" + 0.005*"case"

Score: 0.03352729603648186	 
Topic: 0.014*"test" + 0.012*"inocul" + 0.010*"isol" + 0.008*"trace" + 0.007*"sunday" + 0.007*"center" + 0.006*"immedi" + 0.006*"suggest" + 0.006*"antigen" + 0.005*"swab"

Score: 0.033421337604522705	 
Topic: 0.005*"swab" + 0.005*"kayanatinph" + 0.005*"symptom" + 0.005*"cab" + 0.004*"display" + 0.004*"thank" + 0.004*"medic" + 0.004*"muna" + 0.004*"lang" + 0.004*"provid"


## 5. Save model

In [54]:
lda_model_tfidf.save('lda_tfidf.model')

OSError: [Errno 28] No space left on device

In [55]:
lda_loaded = gensim.models.LdaMulticore.load('lda_tfidf.model')
lda_loaded.show_topics()



AttributeError: 'NoneType' object has no attribute 'get_lambda'

## 6. Visualization using pyLDAvis

In [56]:
#!pip install pyLDAvis

In [57]:
#!python -m pip install -U pyLDAvis

import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()

viz = pyLDAvis.gensim_models.prepare(lda_model_tfidf, corpus_tfidf, dictionary)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


FileNotFoundError: [Errno 2] No usable temporary directory found in ['/var/folders/ym/8xddr3cs4tzc0g3rm5n1rwhm0000gn/T/', '/tmp', '/var/tmp', '/usr/tmp', '/Users/isabelsaludares/Documents/Python']

In [58]:
pyLDAvis.display(viz)

  and should_run_async(code)


NameError: name 'viz' is not defined

### 6.1. Save viz

In [59]:
pyLDAvis.save_html(viz, 'leni.html')

  and should_run_async(code)


NameError: name 'viz' is not defined

In [None]:
isabel.saludares@neuralmechanics.net