# Vauva.fi and an Introduction to Natural Language Processing

- Vauva.fi is the most popular online forum for a magazine in Finland
- But why? What do people discuss on the "Aihe vapaa" section?
- With approximately 250 000 posts scraped with the [vauvascrape](https://github.com/alkukampela/vauvascrape) tool, we'll hopefully find out


- Libraries:
    - [libvoikko](https://github.com/voikko/corevoikko): An open library for Finnish processing
    - [gensim](https://github.com/RaRe-Technologies/gensim): "Topic modeling for humans"
    - [pyLDAvis](https://github.com/bmabey/pyLDAvis): visualization


In [6]:
import itertools
import pgdb
from pre_processing import libvoikko
from tqdm import tqdm_notebook as tqdm
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning) 

DATABASE = 'vauvafi'
POST_QUERY = 'SELECT id, content FROM posts'
INSERT_NORMALIZED = 'INSERT INTO normalized_posts(id, content) VALUES(%s, %s)'
NORMALIZED_POST_QUERY = 'SELECT * FROM normalized_posts'

VOIKKO = libvoikko.Voikko('fi')

with open('pre_processing/stop_words.txt') as f:
    stop_words = [word.rstrip() for word in f]
    
with pgdb.connect(database=DATABASE) as db:
    with db.cursor() as cursor:
        cursor.execute('SELECT COUNT(id) FROM posts')
        POST_COUNT = cursor.fetchone().count
        cursor.execute('SELECT COUNT(id) FROM topics')
        TOPIC_COUNT = cursor.fetchone().count

## Pre-processing

* Steps:
    * removal of all punctuation and stop words
    * lemmatization of all words
    * phrase modeling

In [3]:
def tokenize(text):
    return VOIKKO.tokens(text)

tokenize('Onpa kivaa, kun on perjantai!')

[<Onpa,WORD>,
 < ,WHITESPACE>,
 <kivaa,WORD>,
 <,,PUNCTUATION>,
 < ,WHITESPACE>,
 <kun,WORD>,
 < ,WHITESPACE>,
 <on,WORD>,
 < ,WHITESPACE>,
 <perjantai,WORD>,
 <!,PUNCTUATION>]

In [4]:
def lemmatize(word):
    analysis = VOIKKO.analyze(word)
    try:
        return analysis[0]['BASEFORM'].lower()
    except (IndexError, KeyError):
        return word.lower()
    
lemmatize('perjantainakinkohan')

'perjantai'

In [5]:
def is_word(token):
    return token.tokenType == libvoikko.Token.WORD

def is_stop_word(word):
    return word in stop_words or len(word) <= 2

def normalize(text):
    for token in tokenize(text):
        if not is_word(token):
            continue
        word = lemmatize(token.tokenText)
        if not is_stop_word(word):
            yield word
            
list(normalize('Onpa kivaa, kun on perjantai!'))

['kiva', 'perjantai']

In [5]:
def get_rows(query, batch_size=100):
    with pgdb.connect(database=DATABASE) as db:
        with db.cursor() as cursor:
            cursor.execute(query)
            while True:
                rows = cursor.fetchmany(batch_size)
                if not rows:
                    break
                yield from rows
            
def get_posts():
    yield from tqdm(get_rows(POST_QUERY), total=POST_COUNT)
    
def split_to_sentences(text):
    return VOIKKO.sentences(text)

def get_normalized_sentences():
    for post in get_posts():
        for sentence in split_to_sentences(post.content):
            yield normalize(sentence.sentenceText)
            
for sentence in itertools.islice(get_normalized_sentences(), 49, 52):
    print('- ' + ' '.join(sentence))   

- tuntua yökylä ohjelma jakso avautuva paska elämä lapsi hankinta
- lapsi hankkia hoitaa tuntua vastenmielinen
- kirjotetaan tämmösestä tavallinen äiti isä arki



* a bag-of-words representation (a vector containing frequency of each possible word in non-specific order) is used for the documents
    &rightarrow; information about the position of the words is lost

* we want to preserve meaningful phrases that consist of multiple words, e.g. "new york" or "new york times"   
    &rightarrow; phrase modeling
    
* [gensim](https://radimrehurek.com/gensim/models/phrases.html) implements a simple algorithm for finding these kinds of phrases
    
\begin{equation*}
score(w_i, w_j) = \frac{count(w_i, w_j) - \delta}{count(w_i) \times count(w_j)}
\end{equation*}


In [6]:
def build_phrase_model(sentences):
    return Phraser(Phrases(sentences))

bigram = build_phrase_model(get_normalized_sentences())
trigram = build_phrase_model(bigram[get_normalized_sentences()])







In [7]:
def pre_process(text, bigram, trigram):
    return list(trigram[bigram[normalize(text)]])

' '.join(
    pre_process(
        '''
        Ammuin ilmapistoolilla koulukaveriani jalkaan.
        Sillä oli tiukat farkut jalassa, ei mennyt läpi niistäkään, mutta reiteen tuli mustelma.
        Elettiin kasarin alkua.
        ''',
        bigram,
        trigram
    )
)

'ampua ilmapistooli koulukaveri jalka tiukka_farkku jalka menty reisi mustelma elää kasari alku'

In [8]:
with pgdb.connect(database=DATABASE) as db:
    with db.cursor() as cursor:
        cursor.executemany(INSERT_NORMALIZED, (
            (post.id, pre_process(post.content, bigram, trigram))
            for post in get_posts()
        ))
    db.commit()




## Latent Dirichlet Allocation (LDA)
[Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003.](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

* An unsupervised algorithm for finding topics in a set of documents
* Assumes that a _document_ consists of _topics_, which then consist of _words_

![](img/lda-en.png)

* Generative process for a document:
    1. Choose the number of words $N \sim Poisson(\xi)$
    1. Choosethe distribution of topics $\theta \sim Dirichtlet(\alpha)$
    1. For each N words $w_n$:
        1. Choose topic $z_n \sim Multinomial(\theta)$
        1. Choose word $w_n \sim p(w_n | z_n, \beta)$ (multinomial distribution conditioned on the topic $z_n$. $\beta$ is a matrix representing the distributions of all words in all topics)

 &rightarrow; let's estimate the parameters of the model from the observed data!

In [9]:
def get_normalized_posts():
    for post in tqdm(get_rows(NORMALIZED_POST_QUERY), total=POST_COUNT):
        yield post.content

In [10]:
def build_dictionary():
    dictionary = Dictionary(get_normalized_posts())
    dictionary.filter_extremes(no_below=20, no_above=0.1)
    dictionary.compactify()
    return dictionary

dictionary = build_dictionary()




In [11]:
def get_post_bows(dictionary):
    for post in get_normalized_posts():
        yield dictionary.doc2bow(post)

def build_corpus(dictionary, fname):
    MmCorpus.serialize(fname, get_post_bows(dictionary))
    return MmCorpus(fname)

corpus = build_corpus(dictionary, 'vauvafi_corpus.mm')




In [14]:
%%time

lda = LdaMulticore(corpus, num_topics=30, id2word=dictionary, workers=3)

CPU times: user 1min 30s, sys: 12.4 s, total: 1min 42s
Wall time: 1min 42s


In [15]:
ldavis_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(ldavis_data)

## Conclusions

* Finnish is hard (especially when written poorly)
    * Should be taken into account in the pre-processing (lemmatization/stemming)
    * Tools or libraries may not be as readily available as for other languages (instead of Voikko, there are e.g. [nltk](http://www.nltk.org/) and [spaCy](https://spacy.io/) that work well for e.g. English)
    
    
* Not all of the topics make sense (there's no guarantee of meaningful topics)

* Why does any of this matter anyway?
    * Visualization and an understanding of a vast set of documents can often be useful on itself
    * The topic distribution can be used as a more compact (fewer dimensions) representation than the often sparse bag-of-words vectors when utilizing e.g. classification or regression
    * Topic modeling has seen a variety of uses outside text data, e.g. genetic data, images or social networks ([David M. Blei. 2012.](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf))