# PLSA
_Probabilistic latent semantic analysis_

## Preliminaries
#### Import dependencies

In [12]:
import sys
import matplotlib.pyplot as plt
import pandas as pd

#### Set the plotting environment

In [13]:
%matplotlib notebook

#### Put the actual `plsa` package onto the _python path_

In [14]:
sys.path.append('..')

#### Import main classes from the `plsa` package

In [15]:
from plsa import Corpus, Pipeline, Visualize
from plsa.pipeline import DEFAULT_PIPELINE
from plsa.algorithms import PLSA
from plsa.preprocessors import tokenize

## Data Sources
As they can be quite large, no actual text corpus is included with the `plsa` package. Two nice examples to play with could be
- [Economic News Article Tone and Relevance](https://www.figure-eight.com/data-for-everyone/)
- [Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.html)
We are assuming here that you have downloaded one of them (or both) and placed them under a `data` folder under the
root of your clone of the `PLSA` [GitHub repository](https://github.com/yedivanseven/PLSA).

In [16]:
csv_file = '../data/my/dataset_full.csv'
# directory = '../data/blogs'
# df = pd.read_csv('../data/my/dataset_full.csv')
# df.dropna().to_csv('../data/my/dataset_full.csv')

## Set Up the Corpus
#### Define pre-processing pipeline
Depending on there source, actual, real-world text documents are "dirty", and need to be "cleaned up" through a series of pre-processing steps. The `plsa` submodule `preprocessors` contains several of them
(see the [API documentation](https://probabilistic-latent-semantic-analysis.readthedocs.io/en/latest/plsa.preprocessors.html)). For convenience, they are assembled into a default pipeline that should help you to get
some results out-of-the-box.

In [17]:
pipeline = Pipeline(tokenize)
pipeline

Pipeline:
0: tokenize

#### Load corpus
Execute either this cell ...

In [18]:
# corpus = Corpus.from_xml(directory, pipeline)
# corpus

... or that cell:

In [19]:
corpus = Corpus.from_csv(csv_file, pipeline,col=-1, encoding="utf-8", max_docs=100)
corpus

Corpus:
Number of documents: 100
Number of words:     2390

In [20]:
corpus.vocabulary

{0: '出来る',
 1: '嘘',
 2: '無い',
 3: '優しい',
 4: 'あれる',
 5: '痛み',
 6: '感じる',
 7: '思える',
 8: '正しい',
 9: 'さ',
 10: '失',
 11: 'くすい',
 12: '悲しみ',
 13: '出会う',
 14: '水平線',
 15: '光る',
 16: '朝',
 17: '希望',
 18: '崩れ落ちる',
 19: '風',
 20: '飛ばす',
 21: 'れる',
 22: '欠片',
 23: '誰か',
 24: '綺麗',
 25: '呟く',
 26: 'てる',
 27: '悲しい',
 28: '声',
 29: '歌う',
 30: 'いつしか',
 31: '海',
 32: '流れ着く',
 33: '見る',
 34: '背中',
 35: '見える',
 36: 'の',
 37: '恥ずかしい',
 38: 'がる',
 39: '尋ねる',
 40: 'いう',
 41: '心',
 42: '大事',
 43: 'する',
 44: '重なる',
 45: '会える',
 46: 'なる',
 47: '透き通る',
 48: '淡い',
 49: '夜',
 50: '夢',
 51: '叶う',
 52: '歓声',
 53: '拍手',
 54: '悲鳴',
 55: '隠れる',
 56: 'いる',
 57: '耐える',
 58: '理由',
 59: '探す',
 60: '答え',
 61: '抱える',
 62: '悩む',
 63: '知る',
 64: '残る',
 65: '焼き付く',
 66: 'ない',
 67: '今日',
 68: '雑音',
 69: '足音',
 70: '奥',
 71: '叫ぶ',
 72: '差し出す',
 73: '分かる',
 74: '偽物',
 75: '導き出す',
 76: '今すぐ',
 77: '追う',
 78: '日々',
 79: 'ぬくもり',
 80: 'すり抜ける',
 81: '諦める',
 82: '側',
 83: '勝ち負け',
 84: '生温い',
 85: '世界',
 86: '待つ',
 87: '怒れる',
 88: '

## Run PLSA

#### Choose the number of topics

In [21]:
n_topics = 4

#### Instantiate a PLSA model

In [22]:
plsa = PLSA(corpus, n_topics, True)
plsa

PLSA:
====
Number of topics:     4
Number of documents:  100
Number of words:      2390
Number of iterations: 0

Notice that we did not do any iterations yet.

#### Fit a PLSA model

In [23]:
result = plsa.fit()
plsa

PLSA:
====
Number of topics:     4
Number of documents:  100
Number of words:      2390
Number of iterations: 45

Now we indeed did do some iterations.

#### Find the best PLSA model of many
As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior to the first iteration step. Therefore, calling the ``fit`` method of two different `PLSA` instances operating on the _same_ corpus with the _same_ number of topics potentially leads to (slightly) different results, corresponding to different local minima of the Kullback-Leibler divergence between the true document-word probability and its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.

Be patient, this may take a while ...

In [24]:
result = plsa.best_of(4)

#### Examine the results
Feel free to explore the attributes of the `result` object. See the [API documentation](https://probabilistic-latent-semantic-analysis.readthedocs.io/en/latest/plsa.algorithms.result.html) for more information.

For example, we could see the relative prevalence of the individual topics we found.

In [25]:
P_z = result.topic #P(z)
P_zd = result.topic_given_doc #P(z|d)
P_wz = result.word_given_topic #P(w|z)
# P_zw = result.topic_given_word #P(z|w)
P_wz

((('君', 0.005824478732651397),
  ('好き', 0.0050206491202792036),
  ('有事', 0.004895318549742524),
  ('せる', 0.0048022414926908565),
  ('いい', 0.004514507259441596),
  ('電話', 0.003912580888327125),
  ('そこをなんとか', 0.0038417931941109097),
  ('ルル', 0.0038417931941109097),
  ('踊る', 0.003748716136931689),
  ('もう', 0.003654535686974973),
  ('何度も', 0.003132767408740344),
  ('離す', 0.0031297287160771104),
  ('みる', 0.003057607176182622),
  ('言う', 0.002961894588859709),
  ('僕', 0.002957275458318002),
  ('ん', 0.002841975766725348),
  ('の', 0.002762917133128289),
  ('気付く', 0.0027521071133023134),
  ('綺麗', 0.0027336608049945434),
  ('しれる', 0.0027196214165236244),
  ('惑う', 0.0027196214165236244),
  ('もしも', 0.0027196214165236244),
  ('雪', 0.0027196196243115013),
  ('れる', 0.002715734300842846),
  ('なれる', 0.0027137849792699563),
  ('見える', 0.0027007852912191146),
  ('無い', 0.002656449277788822),
  ('くじける', 0.0025611954627406063),
  ('免', 0.0025611954627406063),
  ('くださる', 0.0025611954627406063),
  ('起動', 0.0025

Or, we could predict the topic mixture of an entirely new document.

In [26]:
# new_doc = 'Hello! This is the federal humpty dumpty agency for state funding.'

# topic_components, number_of_new_words, new_words = result.predict(new_doc)

# print('Relative topic importance in new document:', topic_components)
# print('Number of previously unseen words in new document:', number_of_new_words)
# print('Previously unseen words in new document:', new_words)

And, of course, we can look at individual topics, that is, how important which word is for which topic. Let's look at the top-10 words of the first topic.

In [27]:
result.word_given_topic[3][:20] 

(('涙', 0.005358387871586745),
 ('奥', 0.00514890881402067),
 ('君', 0.005128743399960293),
 ('降る', 0.004980539606649416),
 ('幸せ', 0.004874782093649988),
 ('想い', 0.0047118303044259464),
 ('もっともっと', 0.0046638327710986595),
 ('片想い', 0.004552724898711833),
 ('いい', 0.004170381903975645),
 ('抱きしめる', 0.0038779575699630295),
 ('深い', 0.0038626665861908145),
 ('湛', 0.003828138634494191),
 ('輝き', 0.003828138634494191),
 ('たた', 0.003828138634494191),
 ('ない', 0.0037667106501409753),
 ('やる', 0.003536899849706603),
 ('隠す', 0.003445324774409243),
 ('もう', 0.003389325469308406),
 ('繋ぐ', 0.00335929887491862),
 ('欲しい', 0.003316620973503857))

## Visualize the Results

In [28]:
visualize = Visualize(result)
visualize

Visualize:
Number of topics:    4
Number of documents: 100
Number of words:     2390

#### Convergence
Since PLSA uses an iterative expectation-maximization (EM) style algorithm, let's make sure we have achieved reasonable convergence.

In [29]:
fig, ax = plt.subplots()
_ = visualize.convergence(ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

#### Relative topic importance
How important are the topics we found in the corpus?

In [30]:
fig, ax = plt.subplots()
_ = visualize.topics(ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

#### The topics
The most interesting part is probably the topics themselves, We can visualize them as word clouds.

In [31]:
fig = plt.figure(figsize=(9.4, 10))
_ = visualize.wordclouds(fig)

<IPython.core.display.Javascript object>

#### Relative topic importance in a document
Also interesting is the mixture of topics in each document. Let's look at the first one.

In [32]:
fig, ax = plt.subplots()
_ = visualize.topics_in_doc(0, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

Let's compare this with what the prediction would look like, pretending that this document wasn't seen before.

In [33]:
for first in corpus.raw:
    if first:
        break
        
fig, ax = plt.subplots()
_ = visualize.prediction(first, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

Similar, but not quite the same. This is the very nature of matrix factorization algorithms, to which PLSA can be seen to belong. We try to approxmiate the original counts of each word in each document with a lower-dimensional representation of the data. That's why the topic composition get's somewhat "blurred".

#### Prediction for a new document
We can also visualize the predicited topic composition for a new document.

In [34]:
new_doc = '悲しい'

fig, ax = plt.subplots()
_ = visualize.prediction(new_doc, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>