In [10]:
# Common imports 
from ast import literal_eval

import gensim
import numpy as np
import pandas as pd

from categorical_em import CategoricalEM
import sys
print(sys.version)

3.6.9 |Anaconda custom (64-bit)| (default, Jul 30 2019, 13:42:17) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


## 1. Hyperparameters


In [11]:
K = 5 # Number of mixture components
I = 120 # Number of words in the dictionary
N = None # Number of documents

## 2. Load and preprocess the data

First, we need to load the data from the csv. This file contains the documents already processed and cleaned after applying the following steps:

1. Tokenization
2. Homogeneization, which includes:
    1. Removing capitalization.
    2. Removing non alphanumeric tokens (e.g. punktuation signs)
    3. Stemming/Lemmatization.
3. Cleaning
4. Vectorization


We load it as a `pandas` dataframe.


In [6]:
df = pd.read_csv('tweets_cleaned.csv')
df.drop_duplicates(subset="tweet", inplace=True)

df['tokens'] = df['tokens'].apply(literal_eval) #Transform the string into a list of tokens
X_tokens = list(df['tokens'].values)


In [7]:
print('Columns: {}\n'.format(' | '.join(df.columns.values)))

print('Tweet:\n{}'.format(df.loc[1, 'tweet']))
print('Tweet cleaned:\n{}'.format(df.loc[1, 'tweets_clean']))
print('Tweet tokens:\n{}'.format(X_tokens[1]))

Columns: tweet_id | timestamp | user_id | tweet | tweets_clean | tokens

Tweet:
OSINT people - please retweet, if possible. My friend is looking for women involved in OSINT. https://twitter.com/manisha_bot/status/1181594280336531457 …
Tweet cleaned:
osint people   please retweet  if possible  my friend is looking for women involved in osint
Tweet tokens:
['osint', 'peopl', 'retweet', 'possibl', 'friend', 'look', 'woman', 'involv', 'osint']


### Create the dictionary

Up to this point, we have transformed the raw text collection in a list of documents stored in `X_tokens`, where each document is a collection 
of the words that are most relevant for semantic analysis. Now, we need to convert these data (a list of token lists) into 
a numerical representation (a list of vectors, or a matrix). To do so, we will start using the tools provided by the `gensim` library. 

As a first step, we create a dictionary containing all tokens in our text corpus, and assigning an integer identifier to each one of them.



In [8]:
dictionary = gensim.corpora.Dictionary(X_tokens)

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=I)


NameError: name 'gensim' is not defined

### Create Bag of Words (BoW): Numerical version of documents
In the second step, let us create a numerical version of our corpus using the `doc2bow` method. In general, 
`D.doc2bow(token_list)` transforms any list of tokens into a list of tuples `(token_id, n)`, one per each token in 
`token_list`, where `token_id` is the token identifier (according to dictionary `D`) and `n` is the number of occurrences 
of such token in `token_list`. 

*Exercise:* Apply the `doc2bow` method from gensim dictionary `D`, to all tokens in every document in `X_tokens`. 
The result must be a new list named `X_bow` where each element is a list of tuples `(token_id, number_of_occurrences)`.

In [9]:
X_bow = list()
keep_tweet = list()
for tweet in X_tokens:
    tweet_bow = dictionary.doc2bow(tweet)
    if len(tweet_bow) > 1:
        X_bow.append(tweet_bow)
        keep_tweet.append(True)
    else:
        keep_tweet.append(False)

df_data = df[keep_tweet]

N = len(df_data)


NameError: name 'dictionary' is not defined

Finally, we transform the BoW representation `X_bow` into a matrix, namely `X_matrix`, in which the i-th row and j-th column represents the 
number of occurrences of the j-th word of the dictionary in the i-th document. This will be the matrix used in the algorithm.

In [7]:
X_matrix = np.zeros([N, I])
for i, doc_bow in enumerate(X_bow):
    word_list = list()
    for word in doc_bow:
        X_matrix[i, word[0]] = word[1]


## 3. Categorical Mixture Model with Expectation Maximization

### Exercise 1: Analytical forms of the E and M steps for the EM-Algorithm
1. Write the joint distribution: $p(\{\mathbf{x}_n, z_n\}| \Theta) = ?$
2. Write the analytical expression for $Q(\Theta, \Theta^{\text{old}}) = ?$
3. Write the MLE for $\Theta$


#### Exercise 1.1

\begin{align}
p(\{\mathbf{x}_n, z_n\}| \Theta) =  \cdots
\end{align}

#### Exercise 1.2

\begin{align}
Q(\Theta, \Theta^{\text{old}}) = \cdots
\end{align}

#### Exercise 1.3
\begin{align}
\hat{\pi}_k = \cdots
\end{align}

\begin{align}
\hat{\theta}_{km} = \cdots
\end{align}

### Exercise 2: Data anlysis task
#### Exercise 2.1

In [8]:
K = 5 # Number of mixture components
i_theta = 5 # Dirichlet parameter from which the parameter is sampled for initialization
i_pi = 0 # Dirichlet parameter from which the parameter is sampled for initialization

model = CategoricalEM(K, I, N, delta=0.01, epochs=200, init_params={'theta': i_theta, 'pi': i_pi})
model.fit(X_matrix)

ITER: 0 Q= -113938.6894 diff= 200
ITER: 5 Q= -106270.54 diff= 874.9335
ITER: 10 Q= -103845.55 diff= 229.4907
ITER: 15 Q= -103360.947 diff= 62.7847
ITER: 20 Q= -103129.4652 diff= 43.7452
ITER: 25 Q= -102945.6175 diff= 25.4653
ITER: 30 Q= -102867.4741 diff= 11.5916
ITER: 35 Q= -102823.8343 diff= 6.9968
ITER: 40 Q= -102793.0763 diff= 6.5606
ITER: 45 Q= -102749.1686 diff= 10.122
ITER: 50 Q= -102690.746 diff= 9.0267
ITER: 55 Q= -102655.0962 diff= 6.1814
ITER: 60 Q= -102625.4566 diff= 6.262
ITER: 65 Q= -102590.2887 diff= 7.2504
ITER: 70 Q= -102554.5986 diff= 7.013
ITER: 75 Q= -102523.1185 diff= 5.6888
ITER: 80 Q= -102497.527 diff= 4.9639
ITER: 85 Q= -102469.6639 diff= 5.8711
ITER: 90 Q= -102443.7946 diff= 4.6469
ITER: 95 Q= -102423.6168 diff= 3.6374
ITER: 100 Q= -102407.3007 diff= 3.0841
ITER: 105 Q= -102393.0427 diff= 2.7363
ITER: 110 Q= -102379.6224 diff= 2.6735
ITER: 115 Q= -102366.1641 diff= 2.7039
ITER: 120 Q= -102352.8545 diff= 2.5975
ITER: 125 Q= -102341.0105 diff= 2.176
ITER: 130 Q= 

#### Exercise 2.2

In [9]:
# TODO
def AIC():
    return

#### Exercise 2.3

Some useful packages:
- matplotlib https://matplotlib.org/
- seaborn https://github.com/mwaskom/seaborn
- wordcloud https://github.com/amueller/word_cloud
- probvis https://github.com/psanch21/prob-visualize



In [10]:
tweet_array = np.array(df_data['tweet'].values)

# Show the 10 most representative words for each topic using a cloud of words

In [11]:
# Show the 10 most relevant documents for each topic.

In [18]:
# Show the evolution of Q over the epochs