# PLSA
_Probabilistic latent semantic analysis_

## Preliminaries
#### Import dependencies

In [1]:
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#### Set the plotting environment

In [2]:
%matplotlib notebook

#### Put the actual `plsa` package onto the _python path_

In [3]:
sys.path.append('..')

#### Import main classes from the `plsa` package

In [4]:
from plsa import Corpus, Pipeline, Visualize
from plsa.pipeline import DEFAULT_PIPELINE
from plsa.algorithms import PLSA
from plsa.preprocessors import tokenize

## Data Sources
As they can be quite large, no actual text corpus is included with the `plsa` package. Two nice examples to play with could be
- [Economic News Article Tone and Relevance](https://www.figure-eight.com/data-for-everyone/)
- [Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.html)
We are assuming here that you have downloaded one of them (or both) and placed them under a `data` folder under the
root of your clone of the `PLSA` [GitHub repository](https://github.com/yedivanseven/PLSA).

In [8]:
csv_file = '../data/my/research.csv'
# df = pd.read_csv(csv_file)
# df = df.dropna()
# df.to_csv("../data/my/dataset_full_drop.csv")
# _df = df.sample(frac =1)
# _df.to_csv("../data/my/dataset_full_shufflue.csv")

## Set Up the Corpus
#### Define pre-processing pipeline
Depending on there source, actual, real-world text documents are "dirty", and need to be "cleaned up" through a series of pre-processing steps. The `plsa` submodule `preprocessors` contains several of them
(see the [API documentation](https://probabilistic-latent-semantic-analysis.readthedocs.io/en/latest/plsa.preprocessors.html)). For convenience, they are assembled into a default pipeline that should help you to get
some results out-of-the-box.

In [9]:
pipeline = Pipeline(tokenize)
pipeline

Pipeline:
0: tokenize

#### Load corpus
Execute either this cell ...

In [50]:
# corpus = Corpus.from_xml(directory, pipeline)
# corpus

... or that cell:

In [10]:
corpus = Corpus.from_csv(csv_file, pipeline,col=-1, encoding="utf-8", max_docs=10000)
corpus

Corpus:
Number of documents: 3000
Number of words:     3000

In [52]:
tf = corpus.get_word(False) # Term Frequency
number_of_all_words = corpus.n_occurrences # 全単語の出現回数
number_of_occurances = tf * number_of_all_words # 全文書における単語wの出現回数
number_of_occurances = number_of_occurances.astype(np.int64) # numpy.float64 => numpy.int64
# number_of_occurances

In [53]:
vocabulary = corpus.vocabulary
# corpus.vocabulary

In [54]:
words_number_of_occurances = dict(zip(vocabulary.values(),number_of_occurances))
# words_number_of_occurances
sorted_words_number_of_occurances = {k: v for k, v in sorted(words_number_of_occurances.items(), key=lambda item: item[1], reverse=True)}
sorted_words_number_of_occurances

{'いい': 3902,
 '心': 3799,
 '見る': 3394,
 '言う': 3166,
 '行く': 2947,
 '愛': 2830,
 '夢': 2780,
 '笑う': 2579,
 '今日': 2520,
 '知る': 2506,
 'いく': 2452,
 '生きる': 2384,
 '世界': 2162,
 'しまう': 2095,
 '言葉': 2095,
 '明日': 2092,
 '涙': 2043,
 'くる': 2033,
 '胸': 2024,
 '泣く': 2023,
 '夜': 2000,
 '僕ら': 1998,
 '空': 1967,
 '好き': 1964,
 '変わる': 1905,
 '思う': 1843,
 '声': 1766,
 '二人': 1764,
 '見える': 1758,
 '忘れる': 1756,
 '風': 1749,
 '来る': 1689,
 '恋': 1646,
 'ゆく': 1640,
 '信じる': 1620,
 '誰か': 1586,
 'みる': 1496,
 '消える': 1476,
 '日々': 1447,
 '強い': 1409,
 '未来': 1385,
 '気持ち': 1367,
 '終わる': 1359,
 '愛す': 1349,
 'わかる': 1348,
 '想い': 1322,
 '待つ': 1319,
 '歩く': 1309,
 '幸せ': 1286,
 '全て': 1265,
 '笑顔': 1264,
 '探す': 1245,
 '続ける': 1212,
 '感じる': 1159,
 '場所': 1156,
 '無い': 1137,
 '光': 1137,
 '顔': 1125,
 '街': 1068,
 '一人': 1049,
 '合う': 1041,
 '雨': 1036,
 '出す': 1030,
 '抱きしめる': 1013,
 '言える': 980,
 '欲しい': 968,
 '届く': 952,
 '歌う': 930,
 '出来る': 921,
 '聞く': 913,
 '居る': 894,
 '輝く': 888,
 '遠い': 881,
 '音': 881,
 '意味': 870,
 '優しい': 838,
 '分かる': 822,
 '気付く':

In [32]:
number_of_all_words

657359

## Run PLSA

#### Choose the number of topics

In [10]:
n_topics = 4

#### Instantiate a PLSA model

In [11]:
plsa = PLSA(corpus, n_topics, True)
plsa

PLSA:
====
Number of topics:     4
Number of documents:  100
Number of words:      2378
Number of iterations: 0

Notice that we did not do any iterations yet.

#### Fit a PLSA model

In [12]:
result = plsa.fit()
plsa

PLSA:
====
Number of topics:     4
Number of documents:  100
Number of words:      2378
Number of iterations: 47

Now we indeed did do some iterations.

#### Find the best PLSA model of many
As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior to the first iteration step. Therefore, calling the ``fit`` method of two different `PLSA` instances operating on the _same_ corpus with the _same_ number of topics potentially leads to (slightly) different results, corresponding to different local minima of the Kullback-Leibler divergence between the true document-word probability and its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.

Be patient, this may take a while ...

In [13]:
result = plsa.best_of(4)

#### Examine the results
Feel free to explore the attributes of the `result` object. See the [API documentation](https://probabilistic-latent-semantic-analysis.readthedocs.io/en/latest/plsa.algorithms.result.html) for more information.

For example, we could see the relative prevalence of the individual topics we found.

In [14]:
P_z = result.topic #P(z)
P_zd = result.topic_given_doc #P(z|d)
P_wz = result.word_given_topic #P(w|z)
# P_zw = result.topic_given_word #P(z|w)
P_wz

((('ずっと', 0.004826084740848369),
  ('ない', 0.004721592409264531),
  ('欲しい', 0.004698394207986642),
  ('好き', 0.004520046987782689),
  ('離す', 0.004391692947968238),
  ('みる', 0.004160878858773738),
  ('ほしい', 0.0038260206430373255),
  ('姿', 0.0038226117158256406),
  ('いい', 0.003556820272954272),
  ('見る', 0.0035170282320107836),
  ('良い', 0.0034997723507161644),
  ('できる', 0.0034860788081727454),
  ('想い', 0.0033788519795401916),
  ('想う', 0.003354369190467822),
  ('気付く', 0.003300722153489656),
  ('くれる', 0.0032348710434942315),
  ('不安', 0.0031974522280360366),
  ('恋しい', 0.00319745222726326),
  ('思う', 0.0030639976560601005),
  ('強い', 0.003036424061784061),
  ('言える', 0.002975167371294467),
  ('見える', 0.0029508216336470196),
  ('知る', 0.0029441336549277507),
  ('すぎる', 0.0029425336103096437),
  ('待つ', 0.002896247723614394),
  ('いる', 0.0028816220809463025),
  ('ある', 0.0028790323960107577),
  ('の', 0.0028502060346880928),
  ('ちゃんと', 0.0027536616472920465),
  ('言う', 0.0027535858199021343),
  ('怖い', 0.002

Or, we could predict the topic mixture of an entirely new document.

In [15]:
# new_doc = 'Hello! This is the federal humpty dumpty agency for state funding.'

# topic_components, number_of_new_words, new_words = result.predict(new_doc)

# print('Relative topic importance in new document:', topic_components)
# print('Number of previously unseen words in new document:', number_of_new_words)
# print('Previously unseen words in new document:', new_words)

And, of course, we can look at individual topics, that is, how important which word is for which topic. Let's look at the top-10 words of the first topic.

In [16]:
result.word_given_topic[3][:20] 

(('最深', 0.006341529505785272),
 ('いい', 0.005688325155419675),
 ('悲鳴', 0.005386951165999748),
 ('ロンリネス', 0.004756147129338953),
 ('代わり', 0.004713654930224184),
 ('有事', 0.004356689239437888),
 ('二人', 0.0042653590191238525),
 ('最後', 0.004245185968396761),
 ('どう', 0.004216941429171904),
 ('行く', 0.004214347641108833),
 ('言う', 0.004157721816145625),
 ('ゆく', 0.004133518205351294),
 ('楽しい', 0.004119712238881087),
 ('やる', 0.004119666000021162),
 ('雪', 0.004040275654477872),
 ('窓', 0.00391284781019661),
 ('あいつ', 0.003878491766550176),
 ('笑う', 0.0037656477503295255),
 ('消える', 0.003684449142345685),
 ('変わる', 0.0036601064759369025))

## Visualize the Results

In [17]:
visualize = Visualize(result)
visualize

Visualize:
Number of topics:    4
Number of documents: 100
Number of words:     2378

#### Convergence
Since PLSA uses an iterative expectation-maximization (EM) style algorithm, let's make sure we have achieved reasonable convergence.

In [18]:
fig, ax = plt.subplots()
_ = visualize.convergence(ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

#### Relative topic importance
How important are the topics we found in the corpus?

In [19]:
fig, ax = plt.subplots()
_ = visualize.topics(ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

#### The topics
The most interesting part is probably the topics themselves, We can visualize them as word clouds.

In [20]:
fig = plt.figure(figsize=(9.4, 10))
_ = visualize.wordclouds(fig)

<IPython.core.display.Javascript object>

#### Relative topic importance in a document
Also interesting is the mixture of topics in each document. Let's look at the first one.

In [21]:
fig, ax = plt.subplots()
_ = visualize.topics_in_doc(0, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

Let's compare this with what the prediction would look like, pretending that this document wasn't seen before.

In [22]:
for first in corpus.raw:
    if first:
        break
        
fig, ax = plt.subplots()
_ = visualize.prediction(first, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>

Similar, but not quite the same. This is the very nature of matrix factorization algorithms, to which PLSA can be seen to belong. We try to approxmiate the original counts of each word in each document with a lower-dimensional representation of the data. That's why the topic composition get's somewhat "blurred".

#### Prediction for a new document
We can also visualize the predicited topic composition for a new document.

In [23]:
new_doc = '悲しい'

fig, ax = plt.subplots()
_ = visualize.prediction(new_doc, ax)
fig.tight_layout()

<IPython.core.display.Javascript object>