### The process
- We pick the number of topics ahead of time even if we’re not sure what the topics are.
- Each document is represented as a distribution over topics.
- Each topic is represented as a distribution over words.

### Cargar librerías, me quedo solo con las noticias en común

In [1]:
%load_ext autotime

# Importar librerias
import pandas as pd

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
pd.set_option('display.expand_frame_repr', True)

In [2]:
#Importar los datasets
url_reddit = 'https://raw.githubusercontent.com/jjiguaran/text_mining/master/Data/RedditNews.csv'
url_combined = 'https://raw.githubusercontent.com/jjiguaran/text_mining/master/Data/Combined_News_DJIA.csv'
RedditNews = pd.read_csv(url_reddit)
CombinedNews = pd.read_csv(url_combined)


RedditNews['Date'] =  pd.to_datetime(RedditNews['Date'], format='%Y-%m-%d')
CombinedNews['Date'] =  pd.to_datetime(CombinedNews['Date'], format='%Y-%m-%d')


time: 4.62 s


In [3]:
## Nos quedamos con las fechas del dataset que está etiquetado
RedditNews = RedditNews[RedditNews['Date'].isin(CombinedNews['Date'])]

display(
    CombinedNews['Date'].nunique(),
    RedditNews['Date'].nunique() )

1989

1989

time: 41 ms


In [4]:
## Hay error en la codificación de caracteres especiales, encontré ese, pero hay que ver que otros surgen
RedditNews['News'] = RedditNews['News'].str.replace('&amp;', '&')

time: 154 ms


In [5]:
## Hay error en la codificación de caracteres especiales, encontré ese, pero hay que ver que otros surgen
index_review = RedditNews[(RedditNews['News'].str.startswith('b"')) |
                         (RedditNews['News'].str.startswith("b'"))].index

display(RedditNews[RedditNews['News'].str.startswith('b"')].head(),
        RedditNews[RedditNews['News'].str.startswith("b'")].head())


RedditNews['News'] = RedditNews['News'].str.replace('^b\"', " ", regex=True)
RedditNews['News'] = RedditNews['News'].str.replace("^b\'", " ", regex=True)

Unnamed: 0,Date,News
54798,2010-06-30,"b""South Korea's parliament votes to legalize c..."
54804,2010-06-30,"b""The German economy is rapidly improving, wit..."
54819,2010-06-30,"b""BBC News - Russian spy suspect missing in Cy..."
54821,2010-06-30,"b""Iraq inquiry: secret documents showing Tony ..."
54822,2010-06-30,"b""Apartheid loves apartheid: Israel's secret r..."


Unnamed: 0,Date,News
54799,2010-06-30,b'Pope rebukes cardinal who exposed abuse: \nP...
54800,2010-06-30,b'This depression is similar to the Great Pani...
54801,2010-06-30,b'The Niger Delta has experienced oil spills o...
54802,2010-06-30,b'G20 Toronto - So Black Block get green light...
54803,2010-06-30,b'Half of Afghanistans 476 women prisoners wer...


time: 586 ms


In [6]:
RedditNews[RedditNews.index.isin(index_review)].head(10)

Unnamed: 0,Date,News
54798,2010-06-30,South Korea's parliament votes to legalize ch...
54799,2010-06-30,Pope rebukes cardinal who exposed abuse: \nPu...
54800,2010-06-30,This depression is similar to the Great Panic...
54801,2010-06-30,The Niger Delta has experienced oil spills on...
54802,2010-06-30,G20 Toronto - So Black Block get green light ...
54803,2010-06-30,Half of Afghanistans 476 women prisoners were...
54804,2010-06-30,"The German economy is rapidly improving, with..."
54805,2010-06-30,Two Egyptian police officers charged with bru...
54806,2010-06-30,Russia sending armoured vehicles to West Bank'
54807,2010-06-30,Police Lied About Law Demanding G20 Protester...


time: 22 ms


In [7]:
##Una función para limpieza básica, podemos añadir cosas, es un solo una base, luego lo adaptamos más a nuestro uso

import re

REPLACE_BY_SPACE_RE = re.compile(r'[/(){}\[\]\|@,;-]')
BAD_SYMBOLS_RE = re.compile(r'[^0-9a-z #+_]')

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub(' ', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    return text
    
RedditNews['News'] = RedditNews['News'].apply(clean_text)
RedditNews

Unnamed: 0,Date,News
0,2016-07-01,a 117 year old woman in mexico city finally re...
1,2016-07-01,imf chief backs athens as permanent olympic host
2,2016-07-01,the president of france says if brexit won so...
3,2016-07-01,british man who must give police 24 hours not...
4,2016-07-01,100+ nobel laureates urge greenpeace to stop o...
5,2016-07-01,brazil huge spike in number of police killing...
6,2016-07-01,austria s highest court annuls presidential el...
7,2016-07-01,facebook wins privacy case can track any belg...
8,2016-07-01,switzerland denies muslim girls citizenship af...
9,2016-07-01,china kills millions of innocent meditators fo...


time: 362 ms


### Text Cleaning
We use the following function to clean our texts and return a list of tokens:

In [8]:
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

time: 6.63 s


We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root wor

In [9]:
import nltk
# nltk.download('wordnet')

time: 9.48 s


In [10]:
from nltk.corpus import wordnet as wn

## Lemmatizer
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

time: 2 ms


In [11]:
for w in ['dogs', 'ran', 'discouraged']:
    print(w, get_lemma(w), get_lemma2(w))

dogs dog dog
ran run ran
discouraged discourage discouraged
time: 2.7 s


Filter stop words

In [12]:
# nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

time: 6 ms


Now we can define a function to prepare the text for topic modelling:

In [13]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

time: 102 ms


Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.

In [14]:
text_data = []
            
for row in RedditNews['News'].values:
    tokens = prepare_text_for_lda(row)
    text_data.append(tokens)

time: 52.2 s


### LDA with Gensim
First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [15]:
from gensim import corpora
## Dictionary encapsulates the mapping between normalized words and their integer ids.
dictionary = corpora.Dictionary(text_data)

time: 7.97 s


In [16]:
## Converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples
corpus = [dictionary.doc2bow(text) for text in text_data]

time: 984 ms


In [17]:
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

time: 240 ms


### Try 13 topics

In [18]:
import gensim
NUM_TOPICS = 13

ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics = NUM_TOPICS, 
                                           id2word=dictionary, 
                                           passes=15, # how many times the algorithm is supposed to pass over the whole corpus
                                           random_state = 10291, 
                                           chunksize = 2000 # number of documents to consider at once (affects the memory consumption)
                                          )
ldamodel.save('model10.gensim')

time: 3min 56s


In [19]:
topics = ldamodel.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.024*"found" + 0.022*"court" + 0.017*"indian" + 0.016*"mexico" + 0.013*"japan"')
(1, '0.025*"internet" + 0.023*"wikileaks" + 0.019*"secret" + 0.017*"government" + 0.014*"break"')
(2, '0.054*"russia" + 0.037*"russian" + 0.032*"afghanistan" + 0.027*"military" + 0.025*"force"')
(3, '0.079*"china" + 0.034*"chinese" + 0.020*"pirate" + 0.011*"history" + 0.011*"world"')
(4, '0.061*"right" + 0.034*"human" + 0.026*"german" + 0.015*"international" + 0.015*"european"')
(5, '0.036*"north" + 0.036*"korea" + 0.035*"nuclear" + 0.033*"minister" + 0.021*"power"')
(6, '0.051*"woman" + 0.050*"child" + 0.033*"death" + 0.022*"afghan" + 0.019*"school"')
(7, '0.028*"south" + 0.023*"president" + 0.021*"africa" + 0.019*"build" + 0.017*"house"')
(8, '0.029*"protest" + 0.026*"people" + 0.022*"canada" + 0.020*"thousand" + 0.016*"britain"')
(9, '0.082*"israel" + 0.060*"israeli" + 0.044*"attack" + 0.035*"kill" + 0.033*"palestinian"')
(10, '0.054*"world" + 0.033*"million" + 0.021*"years" + 0.020*"billion" + 0.

In [20]:
new_doc = 'Israel inquiry finds Gaza aid flotilla raid was legal'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(79, 1), (750, 1), (964, 1), (1375, 1), (8616, 1)]
[(0, 0.17948857), (1, 0.012821766), (2, 0.012821766), (3, 0.012821766), (4, 0.012821897), (5, 0.012821766), (6, 0.012821874), (7, 0.012821766), (8, 0.012821813), (9, 0.51282966), (10, 0.1794638), (11, 0.012821821), (12, 0.012821766)]
time: 13 ms


### pyLDAvis

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The left panel visualise the topics as circles in the two-dimensional plane whose centres are determined by computing the Jensen–Shannon divergence between topics, and then by using multidimensional scaling to project the inter-topic distances onto two dimensions. Each topic’s overall prevalence is encoded using the areas of the circles.

The right panel depicts a horizontal bar chart whose bars represent the individual terms that are the most useful for interpreting the currently selected topic on the left. A pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term.

The λ slider allows to rank the terms according to term relevance. By default, the terms of a topic are ranked in decreasing order according their topic-specific probability ( λ = 1 ). Moving the slider allows to adjust the rank of terms based on much discriminatory (or "relevant") are for the specific topic. The suggested “optimal” value of λ is 0.6.

 0 ≤ λ ≤ 1, as λ log(p(term | topic)) + (1 - λ) log(p(term | topic)/p(term)). The red bars represent the frequency of a term in a given topic, (proportional to p(term | topic)), and the blue bars represent a term's frequency across the entire corpus, (proportional to p(term)). Change the value of λ to adjust the term rankings -- small values of λ (near 0) highlight potentially rare, but exclusive terms for the selected topic, and large values of λ (near 1) highlight frequent, but not necessarily exclusive, terms for the selected topic. A user study described in our paper suggested that setting λ near 0.6 aids users in topic interpretation, although we expect this to vary across topics and data sets (hence our tool, which allows you to flexiby adjust λ).

- Saliency: a measure of how much the term tells you about the topic.
- Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.
- The size of the bubble measures the importance of the topics, relative to the data.

In [23]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pyLDAvis.gensim

lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary)
pyLDAvis.display(lda_display10)

time: 13min 7s


### OTROS

In [43]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics = NUM_TOPICS, 
                                           id2word=dictionary, 
                                           passes=15, # how many times the algorithm is supposed to pass over the whole corpus
                                           random_state = 10291, 
                                           chunksize = 2000 # number of documents to consider at once (affects the memory consumption)
                                          )
ldamodel.save('model5.gensim')

In [45]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(2012, 1), (6758, 1), (8852, 1), (10200, 1)]
[(0, 0.58071643), (1, 0.045033406), (2, 0.045051895), (3, 0.045068514), (4, 0.2841298)]


In [48]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.015*"israel" + 0.011*"israeli" + 0.008*"attack" + 0.008*"president"')
(1, '0.013*"police" + 0.010*"woman" + 0.010*"child" + 0.009*"death"')
(2, '0.015*"world" + 0.010*"china" + 0.008*"country" + 0.007*"government"')


In [46]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')

In [47]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [50]:
lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
