# Topic Modelling of Ubuntu Dataset
- Time to complete: 7 hours

- In this notebook. we would like to do some topic modelling of the Ubuntu Dataset, which is a large corpus of multi-turn chat dialogues between users and tech supports for Ubuntu OS related issues.

- We'll do the following:
    - Vectorize a streamed corpus
    - Run topic modelling on streamed vectors, using gensim's Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) algoirthms. 
    - Determine Top 10 Topics of our training set ('dialogs/4' folder)
    - Evaluate topic models on a test set (30K files in 'dialogs/5' folder)
    - Create a Topic Predictor

In [1]:
from tqdm import tqdm
import itertools

import numpy as np
import gensim
import glob
import nltk

In [2]:
# Find all pathnames for chat dialogues in data/dialog/4 directory
chat_dialogues = glob.glob('data/dialogs/4/*.tsv')

In [3]:
chat_dialogues[:10]

['data/dialogs/4/1.tsv',
 'data/dialogs/4/10.tsv',
 'data/dialogs/4/100.tsv',
 'data/dialogs/4/1000.tsv',
 'data/dialogs/4/10000.tsv',
 'data/dialogs/4/100000.tsv',
 'data/dialogs/4/100001.tsv',
 'data/dialogs/4/100002.tsv',
 'data/dialogs/4/100003.tsv',
 'data/dialogs/4/100004.tsv']

## Load Example Converation

In [4]:
import csv

def read_tsv_file(chat_file):
    """
    Extract text from each row in the chat log file
    and return a list of dialogue containing the
    entire conversation.
    """
    dialogue = []
    with open(chat_file) as tsv_file:
        reader = csv.reader(tsv_file, delimiter='\t')
        for row in reader:
            dialogue.append(str(row[3]).strip())
    return dialogue

In [5]:
chat_example = read_tsv_file(chat_dialogues[2])
print chat_example

['Ahhh how to get fix nautilus from going slow in hoary?', 'when you upgraded, did you install gamin?', 'what does that do?', 'if you have an apt or sed pacakge there...']


## Ubuntu Corpus

Let's stream over an entire file directory of chat dialogues.

In [6]:
from gensim.parsing.preprocessing import strip_punctuation
from gensim.utils import simple_preprocess

STOPWORDS = nltk.corpus.stopwords.words('english')

def extract_text_from_chat_dialogue(chat_file):
    chat_dialogue = read_tsv_file(chat_file)
    return ' '.join(chat_dialogue)

def tokenize(text):
    return [token for token in simple_preprocess(strip_punctuation(text.strip())) if token not in STOPWORDS]

def iter_ubuntu(chat_files):
    for chat_file in chat_files[1:]:
        text = extract_text_from_chat_dialogue(chat_file)
        tokens = tokenize(text)
        yield tokens

In [7]:
stream = iter_ubuntu(chat_dialogues)
for i, tokens in enumerate(itertools.islice(stream, 10)):
    print "Document: {}".format(i+1)
    print "Tokens: {}".format(tokens[:10])
    print 

Document: 1
Tokens: [u'add', u'lines', u'xinetd', u'hup', u'load', u'server', u'hrm', u'installed', u'hotwayd', u'giving']

Document: 2
Tokens: [u'ahhh', u'get', u'fix', u'nautilus', u'going', u'slow', u'hoary', u'upgraded', u'install', u'gamin']

Document: 3
Tokens: [u'anyone', u'use', u'xorg', u'edgers', u'ppa', u'curiousity', u'ppas', u'unsupported', u'rd', u'party']

Document: 4
Tokens: [u'ssh', u'encryption', u'channels', u'available', u'freenode', u'connect', u'ssl', u'irc', u'freenode', u'net']

Document: 5
Tokens: [u'installed', u'ubuntu', u'server', u'would', u'need', u'install', u'get', u'graphical', u'application', u'run']

Document: 6
Tokens: [u'serious', u'wierd', u'problem', u'file', u'etc', u'resolv', u'conf', u'cannot', u'accessed', u'removed']

Document: 7
Tokens: [u'boot', u'command', u'line', u'grub', u'command', u'line', u'actual', u'ubuntu', u'command', u'line']

Document: 8
Tokens: [u'ffs', u'enough', u'people', u'problem', u'upgrading', u'get', u'error', u'downlo

## Dictionaries

We need a mapping of raw text tokens to numerical tokens becuase most machine learning algorithms rely on numerical libraries indexed by integers, rather than by strings, and have to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving Dictionary class a stream of tokenized documents:

In [8]:
chat_stream = (tokens for tokens in tqdm(iter_ubuntu(chat_dialogues)))

0it [00:00, ?it/s]

In [9]:
%time id2word_ubuntu = gensim.corpora.Dictionary(chat_stream)
print id2word_ubuntu

269022it [02:00, 2241.42it/s]

CPU times: user 1min 13s, sys: 16.3 s, total: 1min 29s
Wall time: 1min 59s
Dictionary(124597 unique tokens: [u'fawn', u'unsupportable', u'fawk', u'mdraid', u'userscripts']...)





Our dictionary mapping now contains all words that appeared in the corpus, along with how many times they appeared. Let's filter out both very infrequent words (stopwords) and very frequent words to clear up resources as well as remove noise.

In [10]:
id2word_ubuntu.filter_extremes(no_below=10, no_above=0.1)
print id2word_ubuntu

Dictionary(14492 unique tokens: [u'fawn', u'adviced', u'fucked', u'libmad', u'icmp']...)


## Vectorization


A streamed corpus and a dictionary is all we need to create bag-of-words vectors.

Let's wrap the entire dialogue directory, as a stream of bag-of-word vectors.

In [13]:
class UbuntuCorpus(object):
    def __init__(self, dialogue_directory, dictionary, clip_docs=None):
        """
        Parse the first `clip_docs` Ubuntu chat dialogues 
        from directory `dialogue directory`. Yield each 
        document in turn, as a list of tokens (unicode strings).
        """        
        self.directory = dialogue_directory
        self.dictionary = dictionary
        self.clip_docs = clip_docs
        
    def __iter__(self):
        self.titles = []
        chat_files = glob.glob(self.directory + '/*.tsv')
        for tokens in tqdm(itertools.islice(iter_ubuntu(chat_files), self.clip_docs)):
            yield self.dictionary.doc2bow(tokens)
            
    def len(self):
        return self.clip_docs
            

In [14]:
ubuntu_corpus = UbuntuCorpus('data/dialogs/4', id2word_ubuntu)
vector = next(iter(ubuntu_corpus))
print vector

0it [00:00, ?it/s]

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1)]





In [15]:
# what is the most common word in that first article?
most_index, most_count = max(vector, key=lambda (word_index, count): count)
print(id2word_ubuntu[most_index], most_count)

(u'giving', 2)


Let's store all those bag-of-words vectors into a file, so we don't have to parse through the chat logs
every time over and over.

In [17]:
%time gensim.corpora.MmCorpus.serialize('./data/gensim_models/ubuntu_bow.mm', ubuntu_corpus)

269022it [02:30, 1789.77it/s]


CPU times: user 1min 36s, sys: 21.5 s, total: 1min 57s
Wall time: 2min 30s


In [18]:
ubuntu_corpus = gensim.corpora.MmCorpus('./data/gensim_models/ubuntu_bow.mm')
print ubuntu_corpus

MmCorpus(269022 documents, 14492 features, 4560490 non-zero entries)


## Topic Modelling: Semantic Transformations

### Latent Dirichlet Allocation

In [25]:
# use fewer documents during experimentation (LDA is slow) to see if topic_extraction is working well or not

clipped_corpus = gensim.utils.ClippedCorpus(ubuntu_corpus, 10000)

%time lda_model = gensim.models.LdaModel(clipped_corpus, num_topics=25, id2word=id2word_ubuntu, passes=4)

CPU times: user 1min 4s, sys: 224 ms, total: 1min 5s
Wall time: 1min 5s


In [44]:
# Use all 270K+ documents for training our LDA model that will extract 25 latent topics
# from our Ubuntu corpus.

%time lda_model = gensim.models.LdaModel(ubuntu_corpus, num_topics=25, id2word=id2word_ubuntu, passes=4)

CPU times: user 21min 55s, sys: 3.06 s, total: 21min 58s
Wall time: 22min 2s


### Problem 1: LDA Top 10 Topics

In [56]:
def prettify_topic_representations(lda, corpus, num_topics, top_n):
    """
    Print representations of the top_n terms of the topic along
    with the coherence for each topic
    """
    top_topics = lda.top_topics(corpus, topn=top_n)[:num_topics]
    for i, t in tqdm(enumerate(top_topics)):
        topic_repr, coherence_score = t
        top_words = [t[1] for t in topic_repr]
        print "Topic: {}".format(i+1)
        print "Topic Representations: {}".format(" | ".join(top_words))
        print "Coherence Score: {0:0.3f}".format(coherence_score)
        print

In [57]:
prettify_topic_representations(lda_model, ubuntu_corpus, num_topics=10, top_n=10)

10it [00:00, 3321.17it/s]

Topic: 1
Topic Representations: windows | partition | drive | gb | xp | want | hard | linux | disk | boot
Coherence Score: -2.303

Topic: 2
Topic Representations: server | network | wireless | connect | internet | ip | connection | router | using | set
Coherence Score: -2.410

Topic: 3
Topic Representations: apt | package | sudo | list | packages | update | remove | installed | synaptic | manager
Coherence Score: -2.418

Topic: 4
Topic Representations: linux | work | would | well | ve | one | windows | think | really | good
Coherence Score: -2.496

Topic: 5
Topic Representations: cd | usb | live | dev | iso | drive | mount | dvd | image | boot
Coherence Score: -2.550

Topic: 6
Topic Representations: question | someone | ask | channel | please | hi | hello | tell | problem | guys
Coherence Score: -2.573

Topic: 7
Topic Representations: thanks | good | program | looking | way | want | would | software | something | one
Coherence Score: -2.594

Topic: 8
Topic Representations: gnome | desk




### Stacked Semantic Transformation: Tfidf + Latent Semantic Analysis

Here we'll train a TFIDF model, and then train Latent Semantic Analysis on top of TFIDF:

In [28]:
%time tfidf_model = gensim.models.TfidfModel(ubuntu_corpus, id2word=id2word_ubuntu)

CPU times: user 22.9 s, sys: 145 ms, total: 23.1 s
Wall time: 23.1 s


In [29]:
%time lsi_model = gensim.models.LsiModel(tfidf_model[ubuntu_corpus], id2word=id2word_ubuntu, num_topics=100)

CPU times: user 1min 27s, sys: 4 s, total: 1min 31s
Wall time: 54.6 s


### Problem 1: LSA Top 10 Topics

In [63]:
def prettify_lsa_top_topics(lsa, num_topics, num_words):
    """
    Print LSA's n most significant topics where n = num_topics.
    For each topic, print the topic's most m signficant words where m = num_words.
    """
    top_topics = lsa.show_topics(num_topics=num_topics, num_words=num_words, formatted=False)
    for idx, topic_repr in top_topics:
        top_words = [t[0] for t in topic_repr]
        print "Topic: {}".format(idx+1)
        print "Topic Representations: {}".format(" | ".join(top_words))
        print

In [64]:
prettify_lsa_top_topics(lsa=lsi_model, num_topics=10, num_words=10)

Topic: 1
Topic Representations: windows | apt | sudo | hi | file | linux | want | one | installed | cd

Topic: 2
Topic Representations: apt | sudo | windows | grub | boot | cd | partition | package | drive | paste

Topic: 3
Topic Representations: paste | http | com | please | punctuation | flood | sudo | enter | apt | root

Topic: 4
Topic Representations: root | grub | sudo | boot | paste | password | partition | bit | card | cd

Topic: 5
Topic Representations: apt | root | grub | password | upgrade | update | user | boot | package | cd

Topic: 6
Topic Representations: grub | nvidia | drivers | card | driver | xorg | boot | cd | conf | ati

Topic: 7
Topic Representations: gnome | kde | desktop | kubuntu | root | ask | sudo | linux | menu | grub

Topic: 8
Topic Representations: ask | grub | question | file | channel | hello | hi | nvidia | xorg | please

Topic: 9
Topic Representations: cd | windows | live | grub | file | upgrade | root | iso | linux | wine

Topic: 10
Topic Representatio

### Problem 1: Discussion

As we can see above, LDA provides more human interpretable topics than LSA. We can further evaluate the quality of our LDA model. This will be done later in the notebook.

By inspecting the top 10 topic reprsentations of our LDA model, we think the topics are the following:
    1. Partitioning Ubuntu Hard Drive With Windows XP or other Windows OS.
    2. Network / Wifi/ Internet Related Issues
    3. Apt Command Issues 
    4. Linux Suggestions
    5. Ubuntu Device Mounting
    6. Questions Asking for Help to Resolve Technical Problem (Too generalizable)
    7. Software Recommendation
    8. Linux Desktop Enviornments
    9. Root User Privileges
    10. Web Related Issues

## Transforming Unseen Documents 

In [34]:
text = extract_text_from_chat_dialogue('data/dialogs/5/100003.tsv')

bow_vector = id2word_ubuntu.doc2bow(tokenize(text))
print [(id2word_ubuntu[id], count) for id, count in bow_vector]

[(u'add', 1), (u'installed', 1), (u'apt', 1), (u'without', 1), (u'web', 2), (u'see', 1), (u'let', 1), (u'download', 1), (u'want', 1), (u'remove', 1), (u'already', 1), (u'everything', 2), (u'take', 1), (u'program', 2), (u'downloaded', 1), (u'var', 1), (u'cache', 1), (u'programs', 1), (u'archives', 1), (u'indeed', 1), (u'aptoncd', 1), (u'stored', 1)]


In [69]:
def prettify_model_vector_relevant_topics(model, vector, num_topics):
    """
    Print the n most signifcant topics of a model vector
    where n = num_topics.
    """
    most_prominent_topics = list(sorted(vector, key=lambda t: t[1], reverse=True))[:num_topics]
    for idx, t in enumerate(most_prominent_topics):
        topic_repr = model.show_topic(t[0])
        top_words = [t[0] for t in topic_repr]
        print "Topic: {}".format(idx+1)
        print "Topic Representations: {}".format(" | ".join(top_words))
        print

In [73]:
# transform into LDA space
lda_vector = lda_model[bow_vector]

# print the document's top 3 most prominent LDA topics
prettify_model_vector_relevant_topics(lda_model, lda_vector, num_topics=3)

Topic: 1
Topic Representations: apt | package | sudo | list | packages | update | remove | installed | synaptic | manager

Topic: 2
Topic Representations: thanks | good | program | looking | way | want | would | software | something | one

Topic: 3
Topic Representations: terminal | open | ssh | run | nautilus | port | file | virtualbox | virtual | machine



In [74]:
# transform into LSI space
lsi_vector = lsi_model[tfidf_model[bow_vector]]

# print the document's top 3 most prominent LSI topic (not interpretable like LDA!)
prettify_model_vector_relevant_topics(lsi_model, lsi_vector, num_topics=3)

Topic: 1
Topic Representations: windows | apt | sudo | hi | file | linux | want | one | installed | cd

Topic: 2
Topic Representations: work | im | update | using | package | installed | remove | doesn | log | see

Topic: 3
Topic Representations: apt | root | grub | password | upgrade | update | user | boot | package | cd



## Topic Model Evaluation

Our topic models are unsupervised models. Thus, we do not know apriori knowledge of what the topics ought to look like. This makes evalution difficult. Unlike in supervised learning where we simply compare predicted labels to actual labels, there are no labels in topic modelling.

Each topic model (LDA, LSA) has its own way of measuring internal quality of its predictions. The best way to evaluate quality of unsupervised taks is to evaluate how they improve the superior task, the one we're actually training them for.

For example, when the ultimate goal is retrieve semantically similar documents, we manually tag a set of similar documents and then see how a given semantic model maps those similar documents together.

For our evaluation task, we will use a semi-automated task to evaluate the quality of our topic models. We'll split each document in our test into two parts, and check that:
    1. Topics of the first half are similar to topics of the second.
    2. Halves of different documents are mostly dissimilar.
The similarity metric we will be using for our model evaluation is cosine similarity.

In [75]:
# evaluate on 30K documents not used in LDA / LSA training.
# we will be using the chat dialogs found in the dialogs/5 folder
test_chat_files = glob.glob('data/dialogs/5/*.tsv')
test_doc_stream = (tokens for tokens in iter_ubuntu(test_chat_files))
test_docs = list(itertools.islice(test_doc_stream, 5000, 35000))

In [79]:
def evaluate_model(model, test_docs, num_pairs=100000):
    # split each test document into two halves and compute topics for each half
    part1 = [model[id2word_ubuntu.doc2bow(tokens[: len(tokens) / 2])] for tokens in test_docs]
    part2 = [model[id2word_ubuntu.doc2bow(tokens[len(tokens) / 2 :])] for tokens in test_docs]
    
    # print computed similarities (uses cossim)
    similarity_1 = np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)])
    print "Average cosine similarity between corresponding parts (higher is better): {}".format(similarity_1)

    random_pairs = np.random.randint(0, len(test_docs), size=(num_pairs, 2))
    similarity_2 = np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs])
    print "Average cosine similarity between 100,000 random parts (lower is better): {}".format(similarity_2)

### LDA Evaluation

In [81]:
%time evaluate_model(lda_model, test_docs)

Average cosine similarity between corresponding parts (higher is better): 0.426623166567
Average cosine similarity between 100,000 random parts (lower is better): 0.178994328871
CPU times: user 56.8 s, sys: 192 ms, total: 57 s
Wall time: 57.1 s


### LSA Evaluation

In [83]:
%time evaluate_model(lsi_model, test_docs)

Average cosine similarity between corresponding parts (higher is better): 0.259041220771
Average cosine similarity between 100,000 random parts (lower is better): 0.0540667903812
CPU times: user 25.2 s, sys: 558 ms, total: 25.7 s
Wall time: 25.7 s


## Model Persistence

Save models to disk, so it can be re-used later (or be used on a different computer).

In [84]:
lda_model.save('./data/gensim_models/lda_ubuntu.model')
lsi_model.save('./data/gensim_models/lsa_ubuntu.model')
tfidf_model.save('./data/gensim_models/tfidf_ubuntu.model')
id2word_ubuntu.save('./data/gensim_models/ubuntu.dictionary')

## Problem 2: Create A Topic Detector

Since our LSA model has the best evaluation metrics, we now write a topic detector using our LSA topic model to generate a set of relevant topics for a given conversation (.tsv file).

In [101]:
import gensim
import nltk
from gensim.parsing.preprocessing import strip_punctuation
from gensim.utils import simple_preprocess

class UbuntuTopicDetector(object):
    def __init__(self):
        """
        Initialize our LDA model, gensim Dictionary object, and array for STOPWORDs.
        """
        self.model = gensim.models.LdaModel.load('./data/gensim_models/lda_ubuntu.model')
        self.id2word_ubuntu = gensim.corpora.Dictionary.load('./data/gensim_models/ubuntu.dictionary')
        self.STOPWORDS = nltk.corpus.stopwords.words('english')
        
    def extract_text_from_chat_dialogue(self, chat_file):
        """
        Extract Text from chat .tsv file.
        """
        chat_dialogue = self.read_tsv_file(chat_file)
        return ' '.join(chat_dialogue)
    
    def predict_topics(self, chat_file, top_n=3):
        """
        Predict n most relevant topics for a given
        chat file where n = top_n.
        """
        bow_vector = self.vectorize_document(chat_file)
        lda_vector = self.transform_into_semantic_space(bow_vector)
        print "Chat File: {}".format(chat_file)
        self.print_relevant_topics(lda_vector, num_topics=top_n)

    def print_relevant_topics(self, vector, num_topics):
        """
        Print the n most signifcant topics of a model vector
        where n = num_topics.
        """
        most_prominent_topics = list(sorted(vector, key=lambda t: t[1], reverse=True))[:num_topics]
        for idx, t in enumerate(most_prominent_topics):
            topic_repr = self.model.show_topic(t[0])
            top_words = [t[0] for t in topic_repr]
            print "Topic: {}".format(idx+1)
            print "Topic Representations: {}".format(" | ".join(top_words))
            print
        
    def read_tsv_file(self, chat_file):
        """
        Extract text from each row in the chat log file
        and return a list of dialogue containing the
        entire conversation.
        """
        dialogue = []
        with open(chat_file) as tsv_file:
            reader = csv.reader(tsv_file, delimiter='\t')
            for row in reader:
                dialogue.append(str(row[3]).strip())
        return dialogue
    
    def transform_into_semantic_space(self, vector):
        """
        Transform bag of words vector into LDA semantic
        space.
        """
        lda_vector = self.model[vector]
        return lda_vector

    def tokenize(self, text):
        """
        Remove punctuation and lowercase text, then
        generate tokens of our chat file.
        """
        return [token for token in simple_preprocess(strip_punctuation(text.strip())) if token not in self.STOPWORDS]
    
    def vectorize_document(self, chat_file):
        """
        Vectorize document into a bag of words vector.
        """
        text = self.extract_text_from_chat_dialogue(chat_file)
        bow_vector = self.id2word_ubuntu.doc2bow(self.tokenize(text))
        return bow_vector

In [102]:
topic_detector = UbuntuTopicDetector()

In [104]:
# Extract the 3 most relvant topics for a file in 'dialogs/5' folder
topic_detector.predict_topics(test_chat_files[5])

Chat File: data/dialogs/5/100000.tsv
Topic: 1
Topic Representations: kernel | source | files | find | delete | file | make | linux | code | build

Topic: 2
Topic Representations: command | file | sudo | root | user | etc | line | password | folder | home

Topic: 3
Topic Representations: http | com | org | www | php | java | html | google | page | ubuntuforums



In [105]:
topic_detector.predict_topics(test_chat_files[9])

Chat File: data/dialogs/5/100004.tsv
Topic: 1
Topic Representations: grub | error | boot | installed | problem | trying | fix | message | getting | usr

Topic: 2
Topic Representations: thanks | good | program | looking | way | want | would | software | something | one

Topic: 3
Topic Representations: screen | system | alt | back | restart | problem | panel | settings | ctrl | terminal

