# 03. 파이썬을 이용한 토픽모델링(LDA)

* 싸이그래머 / 어바웃 파이썬
* 김무성

# 차례
* 토픽모델링 & LDA 
* DataSet
    - Data Download
    - Exploring the dataset
* LDA with Gensim
    - Loading the tokenizing the corpus
    - Creating the dictionary, and bag of words corpus
    - Fitting the LDA model
* Visualizing the model with pyLDAvis   

# 토픽모델링 & LDA
* [1] Topic Models : LDA and Correlated Topic Models - https://www.slideshare.net/clauwa/topic-models-lda-and-correlated-topic-models

# DataSet
* [2] 20 Newsgroups Dataset - http://qwone.com/~jason/20Newsgroups/

## Data Download

In [None]:
%%bash
mkdir -p 03_data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd

## Exploring the dataset

Each group dir has a set of files:

In [None]:
ls -lah 03_data/20news-bydate-train/sci.space | tail  -n 5

In [None]:
!head 03_data/20news-bydate-train/sci.space/61422 -n 20

# LDA with Gensim

* [3] An Introduction to gensim: "Topic Modelling for Humans" - https://www.slideshare.net/sandinmyjoints/an-introduction-to-gensim-topic-modelling-for-humans

## Loading the tokenizing the corpus

In [None]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [None]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(list(map(load_doc, glob('03_data/20news-bydate-train/*/*')))).set_index(['group','id'])
docs.head()

## Creating the dictionary, and bag of words corpus

<img src="03_figures/bow.jpg" width=600 />

In [None]:

def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus


In [None]:
dictionary, corpus = prep_corpus(docs['tokens'])

In [None]:
MmCorpus.serialize('03_data/newsgroups.mm', corpus)
dictionary.save('03_data/newsgroups.dict')

## Fitting the LDA model

In [None]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('03_data/newsgroups_50_lda.model')

In [None]:
# print the most contributing words for 20 randomly selected topics
lda.print_topics(num_topics=20, num_words=5)

# Visualizing the model with pyLDAvis


In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [None]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

* [1] Topic Models : LDA and Correlated Topic Models - https://www.slideshare.net/clauwa/topic-models-lda-and-correlated-topic-models
* [2] 20 Newsgroups Dataset - http://qwone.com/~jason/20Newsgroups/
* [3] An Introduction to gensim: "Topic Modelling for Humans" - https://www.slideshare.net/sandinmyjoints/an-introduction-to-gensim-topic-modelling-for-humans
* [3] Visualizing a Gensim model - http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb