# Text Mining

## Contents <a id=ov>
1. [Data Set](#data)
2. [Wordbook sentiment](#wordbook)
3. [Topic Modeling](#lda)





## Data Set <a id=data>
[Back to Content Overview](#ov)

The data set consists of all speeches of high ranking ECB representatives. (https://www.ecb.europa.eu/press/key/html/downloads.en.html)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from tqdm import tqdm

In [None]:
#Import data from a excel_file
df=pd.read_csv('all_ECB_speeches.csv',sep='|')
print(df)

In [None]:
# Take a small sample to test you code efficiently.
df=df.sample(frac=0.1)

In [None]:
# Change the index to apply time series methods
df.index=pd.to_datetime(df['date'])
print(df)

In [None]:
print(df.resample('M').count())

In [None]:
#Plot the Articles per Month
plt.plot(df.resample('M').count()['date'])
plt.title('Articles per Month')

## Word book sentiment

The easiest way to measure the sentiment of texts is to count prelabeled keywords:

In [None]:
# Load the wordbook file
word_book=pickle.load(open('newwordbook.p','rb'))
print(word_book)

<span style="color:blue"><b>Task:</b></span> Convert this word_book object in a more useful data structure!

<span style="color:blue"><b>Task:</b></span> Count the 'negative', 'positive', 'uncertainty' words in the texts and save the sum of hits in separate columns!

In [None]:
# Aggregate the data monthly
df_m=df.resample('M').sum()
print(df_m)

<span style="color:blue"><b>Task:</b></span> Calculate the monthly tone. (``TONE = (#POS - #NEG) / (#POS + #NEG)``)

<span style="color:blue"><b>Task:</b></span> Plot the monthly tone.

<span style="color:blue"><b>Task:</b></span> Plot the 12-month rolling window mean of the monthly tone.

## Topic Modeling <a id=lda>
[Back to Content Overview](#ov)

### Document Frequency Matrix
The Document Frequency Matrix has the dimensions D X V, where D is the number of documents and V the size of the vocabulary (Number of unique word in the Corpus).
It saves the total count of every word in every document and is usually very sparse.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(min_df=2)
df=df.dropna(subset=['contents'])
dtm = cv.fit_transform(df['contents'])
print('DTM created')

In [None]:
print(dtm)

In [None]:
print(cv.get_feature_names())

### Estimate LDA Model

The most frequently used topic model is the Latent Dirichlet Allocation (LDA).

#### The algorithm
Look into lda_gibbs.py

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
K=12
LDA = LatentDirichletAllocation(n_components=K,n_jobs=-1,max_iter=100,verbose=1)
LDA.fit(dtm)

### Word Clouds

In [None]:
max_words=100
voc=cv.get_feature_names()
print(voc)

In [None]:
import sys
!{sys.executable} -m pip install wordcloud
from wordcloud import WordCloud

In [None]:
# Create Word Clouds
for t, topic in enumerate(LDA.components_):
    top_word_dict={voc[index]:topic[index] for index in np.argsort(-topic)[:max_words]}
    print({word: round(value,2) for word, value in top_word_dict.items()})
    wordcloud = WordCloud(max_words=max_words,
                          background_color="white",
                          collocations=False,
                          width=1920,
                          height=1080).generate_from_frequencies(top_word_dict)
    wordcloud.to_file("wordcloud_topic_"+str(t)+".pdf")
    plt.title("topic_"+str(t))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

## Spacy <a id=spacy>
[Back to Content Overview](#ov)

Spacy is a powerful library based on pre-trained language models. 

In [None]:
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_sm
import spacy

In [None]:
# Define your language processing model
nlp = spacy.load("en_core_web_sm")
print(nlp)

In [None]:
print(df['contents'][0][:1000])

In [None]:
# Define the document object
doc=nlp(df['contents'][0])

In [None]:
#Iterate over sentences
for sent in doc.sents:
    print(sent)
    print('\n'*3)

In [None]:
#Iterate over tokens
for token in doc[:25]:
    print((token.text+' '*15)[:15], (token.lemma_+' '*15)[:15], token.pos_, token.tag_, (token.dep_+' '*8)[:8],
            token.shape_, token.is_alpha, token.is_stop, sep='\t')

<span style="color:blue"><b>Task:</b></span> Create a string with only the nouns remaining in the article.

<span style="color:blue"><b>Task:</b></span> Lemmatize the text. Delete all non alpha and stop words.

### Excursus: Mulitprocessing
Spacy tokenizing is relatively slow, but the process can  be parallelized:

In [None]:
from joblib import Parallel, delayed

First we have to split the data into batches, as loading the model in a sub-process for each document would be too costly.

In [None]:
def make_batches(_list,n_batches=10):
    len_batch=len(_list)//n_batches+1
    return [_list[i*len_batch:min((i+1)*len_batch,len(_list))] for i in range(n_batches)]

In [None]:
batches=make_batches(df['contents'],36)
print(batches)

Then we need to define a function that does the work in each sub-process.

In [None]:
def lemmatizer(texts:list)->list:
    nlp = spacy.load("en_core_web_sm")
    return [' '.join([token.lemma_ for token in  nlp(text) if token.is_alpha]) for text in texts]

Now we can run the task in ``n_jobs`` parallel tasks. Select ``n_jobs`` just below your available threads (usually the number of CPU cores times two).

In [None]:
results=Parallel(n_jobs=12,verbose=50)(delayed(lemmatizer)(batch) for batch in batches)
print(results)

In [None]:
#Unpack the nested results
df['lemma'] =[lemma for batch in results for lemma in batch]

In [None]:
print(df['lemma'])

<span style="color:blue"><b>Task:</b></span> Count the 'negative', 'positive', 'uncertainty' words in the texts using muliple threads. (Batches are not required for optimal performance.)