# 03. 파이썬을 이용한 토픽모델링(LDA)

* 싸이그래머 / 어바웃 파이썬
* 김무성

# 차례
* 토픽모델링 & LDA 
* DataSet
    - Data Download
    - Exploring the dataset
* LDA with Gensim
    - Loading the tokenizing the corpus
    - Creating the dictionary, and bag of words corpus
    - Fitting the LDA model
* Visualizing the model with pyLDAvis   

# 토픽모델링 & LDA
* [1] Topic Models : LDA and Correlated Topic Models - https://www.slideshare.net/clauwa/topic-models-lda-and-correlated-topic-models

# DataSet
* [2] 20 Newsgroups Dataset - http://qwone.com/~jason/20Newsgroups/

## Data Download

In [4]:
%%bash
mkdir -p 03_data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd

The data has already been downloaded...
Lets take a look at the groups...
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


bash: line 2: pushd: data: No such file or directory
bash: line 13: popd: directory stack empty


## Exploring the dataset

Each group dir has a set of files:

In [3]:
ls -lah 03_data/20news-bydate-train/sci.space | tail  -n 5

-rw-r--r--   1 jovyan users 1.5K Mar 18  2003 61250
-rw-r--r--   1 jovyan users  889 Mar 18  2003 61252
-rw-r--r--   1 jovyan users 1.2K Mar 18  2003 61264
-rw-r--r--   1 jovyan users 1.7K Mar 18  2003 61308
-rw-r--r--   1 jovyan users 1.4K Mar 18  2003 61422


In [24]:
!head 03_data/20news-bydate-train/sci.space/61422 -n 20

From: ralph.buttigieg@f635.n713.z3.fido.zeta.org.au (Ralph Buttigieg)
Subject: Why not give $1 billion to first year-lo
Organization: Fidonet. Gate admin is fido@socs.uts.edu.au
Lines: 34

Original to: keithley@apple.com
G'day keithley@apple.com

21 Apr 93 22:25, keithley@apple.com wrote to All:

 kc> keithley@apple.com (Craig Keithley), via Kralizec 3:713/602


 kc> But back to the contest goals, there was a recent article in AW&ST
about a
 kc> low cost (it's all relative...) manned return to the moon.  A General
 kc> Dynamics scheme involving a Titan IV & Shuttle to lift a Centaur upper
 kc> stage, LEV, and crew capsule.  The mission consists of delivering two
 kc> unmanned payloads to the lunar surface, followed by a manned mission.
 kc> Total cost:  US was $10-$13 billion.  Joint ESA(?)/NASA project was


  chunks = self.iterencode(o, _one_shot=True)


# LDA with Gensim

* [3] An Introduction to gensim: "Topic Modelling for Humans" - https://www.slideshare.net/sandinmyjoints/an-introduction-to-gensim-topic-modelling-for-humans

## Loading the tokenizing the corpus

In [5]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [6]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(list(map(load_doc, glob('03_data/20news-bydate-train/*/*')))).set_index(['group','id'])
docs.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doc,tokens
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1
alt.atheism,49960,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, alt, atheism, ..."
alt.atheism,51060,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, alt, atheism, ..."
alt.atheism,51119,[From: I3150101@dbstu1.rz.tu-bs.de (Benedikt R...,"[from, #email, benedikt, rosenau, subject, gos..."
alt.atheism,51120,"[From: mathew <mathew@mantis.co.uk>\n, Subject...","[from, mathew, #email, subject, university, vi..."
alt.atheism,51121,"[From: strom@Watson.Ibm.Com (Rob Strom)\n, Sub...","[from, #email, rob, strom, subject, soc, motss..."


## Creating the dictionary, and bag of words corpus

<img src="03_figures/bow.jpg" width=600 />

In [7]:

def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus


In [8]:
dictionary, corpus = prep_corpus(docs['tokens'])

Building dictionary...
Building corpus...


In [10]:
MmCorpus.serialize('03_data/newsgroups.mm', corpus)
dictionary.save('03_data/newsgroups.dict')

## Fitting the LDA model

In [11]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('03_data/newsgroups_50_lda.model')

CPU times: user 13min 14s, sys: 7min 16s, total: 20min 31s
Wall time: 13min 12s


In [12]:
# print the most contributing words for 20 randomly selected topics
lda.print_topics(num_topics=20, num_words=5)

[(4,
  '0.034*"new" + 0.033*"april" + 0.026*"york" + 0.014*"energy" + 0.014*"massacre"'),
 (31,
  '0.062*"arms" + 0.052*"nuclear" + 0.024*"francisco" + 0.016*"inc" + 0.013*"newsletter"'),
 (32,
  '0.019*"ground" + 0.016*"light" + 0.016*"power" + 0.011*"wire" + 0.010*"one"'),
 (14,
  '0.011*"question" + 0.010*"one" + 0.010*"would" + 0.009*"evidence" + 0.008*"argument"'),
 (30,
  '0.065*"space" + 0.023*"program" + 0.023*"nasa" + 0.015*"jobs" + 0.012*"year"'),
 (10,
  '0.054*"window" + 0.026*"senate" + 0.016*"pgp" + 0.015*"win" + 0.015*"manager"'),
 (3,
  '0.021*"turkey" + 0.019*"men" + 0.014*"gay" + 0.014*"world" + 0.010*"muslims"'),
 (20,
  '0.019*"said" + 0.013*"one" + 0.012*"people" + 0.009*"went" + 0.008*"day"'),
 (17,
  '0.012*"software" + 0.011*"graphics" + 0.009*"color" + 0.007*"sun" + 0.007*"display"'),
 (41,
  '0.023*"god" + 0.014*"people" + 0.012*"jesus" + 0.009*"christian" + 0.008*"believe"'),
 (9,
  '0.037*"university" + 0.015*"computer" + 0.013*"would" + 0.013*"pittsburgh" +

# Visualizing the model with pyLDAvis


In [13]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

  chunks = self.iterencode(o, _one_shot=True)


In [14]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

  chunks = self.iterencode(o, _one_shot=True)


# 참고자료
* [1] Topic Models : LDA and Correlated Topic Models - https://www.slideshare.net/clauwa/topic-models-lda-and-correlated-topic-models
* [2] 20 Newsgroups Dataset - http://qwone.com/~jason/20Newsgroups/
* [3] An Introduction to gensim: "Topic Modelling for Humans" - https://www.slideshare.net/sandinmyjoints/an-introduction-to-gensim-topic-modelling-for-humans
* [3] Visualizing a Gensim model - http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb