Latent Dirichlet Allocation
===

Preparación
----

In [1]:
import pandas as pd

scopus = pd.read_csv("https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/scopus-abstracts.csv")
scopus['Abstract'].head()

0    Mobility is one of the fundamental requirement...
1    The recent rise of the political extremism in ...
2    The power of the press to shape the informatio...
3    Identifying influential nodes in a network is ...
4    To complement traditional dietary surveys, whi...
Name: Abstract, dtype: object

Descripción del problema
---

Uno de los principales problemas abordados en minería de texto consiste en la extracción de los temas o tópicos a los que pertenece documento. Por ejemplo, una noticia podría pertener simultáneamente a los temas de religión y economía (el escándalo por el manejo de fondos del Vaticano). Cuando se tiene un conjunto de documentos, se desea extraer los tópicos subyacentes sobre los que tratan los documentos.

Scikit-learn contiene una implementación de la metodología Latent Dirichlet Allocation, la cual permite extraer los tópicos de un conjunto de documentos. Véase https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Utilice esta metodología para extraer los tópicos subyacentes en los abstracts de los artículos. Tenga en cuenta que:

1. Debe establecer como obtener el número apropiado de tópicos a obtener.

2. Debe eliminar las stop-words.

3. En T-Lab sugieren reducir las palabras a sustantivos, adjetivos, verbos y adverbios únicamente. Cómo podría realizar esto en su código=?

4. Cómo podría verificar si la cantidad de temas es apropiada desde el punto de vista de su contenido (las palabras que contiene y los temas que trata)?


In [3]:
!pip install update sklearn
import numpy as np
import re, nltk, spacy, gensim
import warnings
import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

warnings.filterwarnings('ignore')
%matplotlib inline



In [4]:
df = scopus['Abstract'].values.tolist()
df = [re.sub(r'\S*@\S*\s?', '', sent) for sent in df]
df = [re.sub(r'\s+', ' ', sent) for sent in df]
df = [re.sub(r"\'", "", sent) for sent in df]
pprint(df[:1])

['Mobility is one of the fundamental requirements of human life with '
 'significant societal impacts including productivity, economy, social '
 'wellbeing, adaptation to a changing climate, and so on. Although human '
 'movements follow specific patterns during normal periods, there are limited '
 'studies on how such patterns change due to extreme events. To quantify the '
 'impacts of an extreme event to human movements, we introduce the concept of '
 'mobility resilience which is defined as the ability of a mobility system to '
 'manage shocks and return to a steady state in response to an extreme event. '
 'We present a method to detect extreme events from geo-located movement data '
 'and to measure mobility resilience and transient loss of resilience due to '
 'those events. Applying this method, we measure resilience metrics from '
 'geo-located social media data for multiple types of disasters occurred all '
 'over the world. Quantifying mobility resilience may help us to asse

In [5]:
def sent_to_words(oraciones):
  for oracion in oraciones:
    yield(gensim.utils.simple_preprocess(str(oracion), deacc=True))

data_words = list(sent_to_words(df))
print(data_words[:1])

[['mobility', 'is', 'one', 'of', 'the', 'fundamental', 'requirements', 'of', 'human', 'life', 'with', 'significant', 'societal', 'impacts', 'including', 'productivity', 'economy', 'social', 'wellbeing', 'adaptation', 'to', 'changing', 'climate', 'and', 'so', 'on', 'although', 'human', 'movements', 'follow', 'specific', 'patterns', 'during', 'normal', 'periods', 'there', 'are', 'limited', 'studies', 'on', 'how', 'such', 'patterns', 'change', 'due', 'to', 'extreme', 'events', 'to', 'quantify', 'the', 'impacts', 'of', 'an', 'extreme', 'event', 'to', 'human', 'movements', 'we', 'introduce', 'the', 'concept', 'of', 'mobility', 'resilience', 'which', 'is', 'defined', 'as', 'the', 'ability', 'of', 'mobility', 'system', 'to', 'manage', 'shocks', 'and', 'return', 'to', 'steady', 'state', 'in', 'response', 'to', 'an', 'extreme', 'event', 'we', 'present', 'method', 'to', 'detect', 'extreme', 'events', 'from', 'geo', 'located', 'movement', 'data', 'and', 'to', 'measure', 'mobility', 'resilience', 

In [12]:
from spacy.cli.download import download
download(model="en_core_web_sm")

✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [13]:
def lemmatization(texts, allowed_postags=['NOUN', "ADJ" , 'VERB' 'ADV']):
  texts_out = []
  for sent in texts:
    doc = nlp(" ".join(sent))
    texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
  return texts_out

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:2])

['mobility fundamental requirement human life significant societal impact include productivity economy social wellbeing adaptation change climate so human movement follow specific pattern normal period limited study such pattern change extreme event quantify impact extreme event human movement introduce concept mobility resilience define ability mobility system manage shock return steady state response extreme event present method detect extreme event locate movement datum measure mobility resilience transient loss resilience event apply method measure resilience metric locate social medium datum multiple type disaster occur all world quantify mobility resilience help assess high order socio economic impact extreme event guide policy develop resilient infrastructure as well nation overall disaster resilience strategy author', 'recent rise political extremism western country spur renew interest psychological moral appeal political extremism empirical support psychological explanation us

In [14]:
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,# minimum reqd occurences of a word 
                             stop_words='english', # remove stop words
                             lowercase=True, # convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                             # max_features=50000,    # max number of uniq words    
                            )
data_vectorized = vectorizer.fit_transform(data_lemmatized)

In [15]:
terms = vectorizer.get_feature_names()
len(terms)

1922

In [16]:
lda_model = LatentDirichletAllocation(n_components=20,           
                                      max_iter=10,               
                                      learning_method='online',  
                                      random_state=100,          
                                      # batch_size=128,           
                                      evaluate_every = -1,       
                                      n_jobs = -1,               
                                     )

lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)

LatentDirichletAllocation(learning_method='online', n_components=20, n_jobs=-1,
                          random_state=100)


In [17]:
print("Perplexity: ", lda_model.perplexity(data_vectorized))
pprint(lda_model.get_params())

Perplexity:  898.8162130082914
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


In [18]:
search_params = {'n_components': [5, 15, 20], 'learning_decay': [.5, .7, .9]}

lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0, n_jobs=-1)

model = GridSearchCV(lda, param_grid= search_params)

model.fit(data_vectorized)

GridSearchCV(estimator=LatentDirichletAllocation(learning_method='online',
                                                 learning_offset=50.0,
                                                 max_iter=5, n_jobs=-1,
                                                 random_state=0),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [5, 15, 20]})

In [19]:
best_lda_model = model.best_estimator_

print("Best Model's Params: ", model.best_params_)

print("Best Log Likelihood Score: ", model.best_score_)

print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 5}
Best Log Likelihood Score:  -221185.4868462772
Model Perplexity:  959.4663154284198


In [21]:
lda_output = best_lda_model.transform(data_vectorized)
# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
# index names
docnames = ["Doc" + str(i) for i in range(len(df))]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic
# Styling
def color_green(val):
 color = 'green' if val > .1 else 'black'
 return 'color: {col}'.format(col=color)
def make_bold(val):
 weight = 700 if val > .1 else 400
 return 'font-weight: {weight}'.format(weight=weight)
# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,dominant_topic
Doc0,0.0,0.55,0.0,0.44,0.0,1
Doc1,0.0,0.4,0.0,0.33,0.26,1
Doc2,0.08,0.0,0.0,0.91,0.0,3
Doc3,0.0,0.0,0.05,0.95,0.0,3
Doc4,0.0,0.33,0.0,0.67,0.0,3
Doc5,0.0,0.14,0.0,0.5,0.36,3
Doc6,0.0,0.21,0.0,0.78,0.0,3
Doc7,0.0,0.72,0.0,0.27,0.0,1
Doc8,0.1,0.16,0.0,0.74,0.0,3
Doc9,0.0,0.0,0.0,0.7,0.29,3


In [22]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

Unnamed: 0,ability,able,absence,absolute,abstract,abstraction,academia,academic,accelerate,accept,...,worker,workflow,workload,world,worldwide,write,xml,year,yield,zone
Topic0,0.909355,0.43553,0.537895,0.369771,0.391856,0.402318,0.474055,0.395465,0.436576,0.402029,...,0.356935,0.42543,0.398075,0.77837,0.521665,0.506597,0.384371,1.425275,0.8132,0.999218
Topic1,50.411479,78.782597,8.718895,3.929591,10.831954,10.446974,7.528221,40.303384,7.054471,12.866006,...,11.993762,40.866757,18.457291,156.05923,13.371734,18.750486,26.929522,129.495695,29.909533,9.396929
Topic2,0.52223,0.773577,0.540997,0.375916,20.23758,0.778936,0.405473,0.482674,0.800995,1.110195,...,0.457531,0.821699,1.915001,6.55305,0.478576,0.964505,1.573274,0.941511,0.802976,0.559309
Topic3,11.912606,5.363951,4.184853,2.219846,0.445316,0.485597,0.625307,1.172708,5.207437,3.759635,...,5.827326,0.399765,0.439451,18.492856,6.728381,1.691497,0.351588,26.123051,7.090182,14.716786
Topic4,0.68787,0.777564,0.991189,15.018366,0.494508,0.383417,0.399504,0.449376,1.149956,0.589491,...,0.400032,0.466397,0.479599,4.335735,2.342602,0.585452,0.542717,20.53831,1.509716,0.420081


In [23]:
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,stock,investor,financial,opinion,probability,datum,news,home,price,regional,analyst,trade,behavior,code,research
Topic 1,datum,use,model,method,base,propose,paper,information,result,time,research,approach,analysis,big,study
Topic 2,graph,query,database,datum,method,file,core,available,edge,cube,large,abstract,repository,schema,propose
Topic 3,urban,use,model,study,network,city,economic,different,result,author,social,paper,analysis,area,customer
Topic 4,geomagnetic,observatory,variable,datum,magnetic,polar,observation,international,project,year,record,magnetometer,field,solar,satellite
