This notebook uses Gensim's LDA implementation to do text classification on a news group dataset.

# 1. Get data from scikit learn dataset

fetch_20newsgroups is a dataset whose corpus are extracted from forum post and labels are extracted from the subforum the post belongs to.

In [1]:
import sklearn
sklearn.__version__
from sklearn.datasets import fetch_20newsgroups
# get train and test data
dataset = fetch_20newsgroups()
test_data = fetch_20newsgroups(subset='test')

## The forum is divided into 20 categories

In [2]:
print(dataset.target_names)
# construct a mapping from target id to target name
id2names = {id:name for id,name in enumerate(dataset.target_names)}

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
# get raw text and targets of train and test set
raw_text = dataset.data
test_text = test_data.data
target = dataset.target
test_target = test_data.target
# a sample 
print(raw_text[500])
print(dataset.target_names[target[500]])

From: bjorndahl@augustana.ab.ca
Subject: Re: document of .RTF
Organization: Augustana University College, Camrose, Alberta
Lines: 10

In article <1993Mar30.113436.7339@worak.kaist.ac.kr>, tjyu@eve.kaist.ac.kr (Yu TaiJung) writes:
> Does anybody have document of .RTF file or know where I can get it?
> 
> Thanks in advance. :)

I got one from Microsoft tech support.

-- 
Sterling G. Bjorndahl, bjorndahl@Augustana.AB.CA or bjorndahl@camrose.uucp
Augustana University College, Camrose, Alberta, Canada      (403) 679-1100

comp.os.ms-windows.misc


# 2. Preprocess text

- split metadata and text.
- For metadata

    - Pick only Subject from metadata.

- For text
    1. split into sentences
    2. lower case each sentence
    3. tokenize into words
    
- Tokenizer: 
    1. delete email address
    2. collection of numbers => NUM
    2. keep \$
    3. delete all other punctuations

- Maintain a word list for words that appear more than once.

In [4]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import re

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
# split metadata and text
def split_metadata(data):
    occurrence = data.find("\n\n")
    metadata = data[0:occurrence] + "\n"
    text = data[occurrence+2:]
    return metadata, text
# get subject from metadata
def get_subject(metadata):
    # subject start with Subject: end with \n
    regex = 'Subject: (.*)\n'
    match = re.search(regex, metadata)
    return match.group(1)
# delete email address
def del_email(text):
    regex = '\S*@\S*'
    return re.sub(regex, " EMAIL ", text)
# replace number collection
def replace_num(text):
    regex = '[0-9]+'
    return re.sub(regex, " NUM ", text)
# remove special characters
def remove(text):
    regex = '[^\w\s$]'
    return re.sub(regex," ", text)
# process raw text
def process_text(text):
    return [lemmatizer.lemmatize(word) for word in word_tokenize(remove(replace_num(del_email(text.lower()))))\
            if word not in stop_words]

def process_data(data):

    metadata, text = split_metadata(data)
    
    subject = get_subject(metadata)
    
    tokenized_text = []
    for sent in sent_tokenize(text):
        tokenized_text += process_text(sent)
        
    tokenized_subject = []
    for sent in sent_tokenize(subject):
        tokenized_subject += process_text(sent)
    return tokenized_subject, tokenized_text

## Extract Subject and content from the raw text

We extract subject and content and contenate them as a bag of words

In [45]:
tokenized_subject = []
tokenized_text = []
tokenized_data = []
for data in raw_text:
    subject, text = process_data(data)
    tokenized_subject.append(subject)
    tokenized_text.append(text)
    tokenized_data.append(subject+text)
print(tokenized_data[0])

['car', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'NUM', 'door', 'sport', 'car', 'looked', 'late', 'NUM', 'early', 'NUM', 'called', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'e', 'mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']


In [46]:
tokenized_test_subject = []
tokenized_test_text = []
tokenized_test_data = []
for data in test_text:
    subject, text = process_data(data)
    tokenized_test_subject.append(subject)
    tokenized_test_text.append(text)
    tokenized_test_data.append(subject+text)
print(tokenized_test_data[0])

['need', 'info', 'NUM', 'NUM', 'bonneville', 'little', 'confused', 'model', 'NUM', 'NUM', 'bonnevilles', 'heard', 'le', 'se', 'lse', 'sse', 'ssei', 'could', 'someone', 'tell', 'difference', 'far', 'feature', 'performance', 'also', 'curious', 'know', 'book', 'value', 'prefereably', 'NUM', 'model', 'much', 'le', 'book', 'value', 'usually', 'get', 'word', 'much', 'demand', 'time', 'year', 'heard', 'mid', 'spring', 'early', 'summer', 'best', 'time', 'buy', 'neil', 'gandler']


## Construct dictionary for words

Convert corpus to tokens

In [48]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
# Create a corpus from a list of texts
dictionary = Dictionary(tokenized_data)
dictionary.filter_extremes(no_below=3)
id2token = {dictionary.token2id[key]:key for key in dictionary.token2id}
train_corpus = [dictionary.doc2bow(data) for data in tokenized_data]
test_corpus = [dictionary.doc2bow(text) for text in tokenized_test_data]

# 3. Model training

## Train LDA

In [10]:
# It will take a while to train
# After train, you can save it as a file and load it next time you want to use it
lda = LdaModel(train_corpus, num_topics=80, passes=60, alpha='auto', eta='auto')
#lda = LdaModel(train_corpus).load("LDA.model")

  diff = np.log(self.expElogbeta)


In [54]:
lda.save("LDA.model")

### Show words related to latent topics

When number of latent topics becomes large, it's very hard to see what real topics are these latent topics corresponding to. But for some of these topics, we can get some sense. For example, topic 20 corresponds to middle-east situtation. Topic 28 corresponds to Christian topics etc.

In [11]:
for topicid in range(80):
    words = lda.get_topic_terms(topicid, 10)
    print("\nTOPIC", topicid)
    for word in words:
        print(id2token[word[0]], word[1])


TOPIC 0
drive 0.10090592
scsi 0.0466975
disk 0.037938867
mb 0.03108228
hard 0.025293235
controller 0.025189012
ide 0.021899153
floppy 0.016549204
data 0.013085397
system 0.012324662

TOPIC 1
test 0.10506905
routine 0.03162179
tool 0.026738605
marc 0.025632067
spin 0.02338653
curve 0.020570217
lyme 0.019696984
fast 0.018788008
assembly 0.01864737
d 0.015238313

TOPIC 2
chip 0.0284169
clipper 0.02686753
encryption 0.025204936
key 0.022711411
government 0.01849979
escrow 0.012972927
phone 0.01232325
technology 0.010776169
agency 0.010555687
device 0.010394449

TOPIC 3
new 0.039328814
gm 0.035340246
san 0.030861193
period 0.027927106
chicago 0.022816796
york 0.021558529
boston 0.0204662
ranger 0.02009307
montreal 0.019813212
st 0.017541356

TOPIC 4
lib 0.018368693
doug 0.012171152
division 0.012111455
methodology 0.010943945
band 0.009992175
stealth 0.00984297
libxmu 0.008918841
navy 0.008625456
naval 0.008307801
xmu 0.008120593

TOPIC 5
armenian 0.055902354
turkish 0.03936709
greek 0.023


TOPIC 58
point 0.034320455
tv 0.03283343
ad 0.0149672665
split 0.014615643
group 0.013619583
wave 0.011479736
traffic 0.011258529
illinois 0.011132371
sphere 0.010505528
ye 0.0101337675

TOPIC 59
gun 0.12169341
weapon 0.040746294
firearm 0.035097867
crime 0.034339022
control 0.022996297
criminal 0.021337925
handgun 0.017705085
rate 0.01617521
death 0.013879171
law 0.01364769

TOPIC 60
dave 0.035726402
jack 0.018975936
morris 0.015290907
company 0.014997512
nick 0.013790476
tom 0.011428905
name 0.011234978
phone 0.009969673
cf 0.009967811
corp 0.009730023

TOPIC 61
auto 0.041422985
safety 0.03277085
helmet 0.023475083
semi 0.022943381
automatic 0.0226548
license 0.022495115
accident 0.01618868
gang 0.015938828
section 0.013775008
dangerous 0.013760279

TOPIC 62
mouse 0.0633465
port 0.05813361
com 0.05713299
serial 0.03730819
dealer 0.029582646
irq 0.02309743
modem 0.02235522
timer 0.014811884
week 0.011190817
maple 0.009492347

TOPIC 63
wire 0.0426883
pin 0.036886644
ground 0.035539057

### Let's take a test data and see topics relevant to it



In [50]:
lda.get_document_topics(test_corpus[1])

[(12, 0.02131758),
 (17, 0.029484889),
 (22, 0.017023325),
 (24, 0.36867407),
 (27, 0.044134196),
 (30, 0.038956758),
 (37, 0.069273286),
 (39, 0.1343311),
 (50, 0.079914905),
 (54, 0.034327153),
 (73, 0.07550456),
 (77, 0.025523305)]

In [51]:
print(test_text[1])

From: Rick Miller <rick@ee.uwm.edu>
Subject: X-Face?
Organization: Just me.
Lines: 17
Distribution: world
NNTP-Posting-Host: 129.89.2.33
Summary: Go ahead... swamp me.  <EEP!>

I'm not familiar at all with the format of these "X-Face:" thingies, but
after seeing them in some folks' headers, I've *got* to *see* them (and
maybe make one of my own)!

I've got "dpg-view" on my Linux box (which displays "uncompressed X-Faces")
and I've managed to compile [un]compface too... but now that I'm *looking*
for them, I can't seem to find any X-Face:'s in anyones news headers!  :-(

Could you, would you, please send me your "X-Face:" header?

I *know* I'll probably get a little swamped, but I can handle it.

	...I hope.

Rick Miller  <rick@ee.uwm.edu> | <ricxjo@discus.mil.wi.us>   Ricxjo Muelisto
Send a postcard, get one back! | Enposxtigu bildkarton kaj vi ricevos alion!
          RICK MILLER // 16203 WOODS // MUSKEGO, WIS. 53150 // USA



## Now we use document topics as features and train a logistic model on it

In [14]:
def get_feature_topic(topics, number_topics):
    features = [0] * number_topics
    for topic, relevance in topics:
        features[topic] = relevance
    return features

In [15]:
train_feature = []
number_of_topics = lda.num_topics
for data in train_corpus:
    feature = get_feature_topic(lda.get_document_topics(data), number_of_topics)
    train_feature.append(feature)
test_feature = []
for data in test_corpus:
    feature = get_feature_topic(lda.get_document_topics(data), number_of_topics)
    test_feature.append(feature)

On test set the accuracy is 64.56%. For a multi-class classification problem, it is not bad.

In [52]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class='ovr')
model.fit(train_feature, target)
model.score(train_feature, target)



0.6696128690118437

In [53]:
model.score(test_feature, test_target)

0.6456452469463622

### We also trained a Support Vector Machine classifier

We tuned the SVM classifier

In [54]:
from sklearn.svm import SVC
model = SVC(C=40,gamma=0.2)
model.fit(train_feature, target)
model.score(train_feature, target)

0.7493371044723351

On test set the accuracy is 66.75%.

In [55]:
model.score(test_feature, test_target)

0.6658258098778544

# 4. Model evaluation

We can use confusion matrix to see what kind of mistakes is our model making in classifying topics.

In [56]:
from sklearn.metrics import confusion_matrix
import numpy
import pandas as pd
dataframe = pd.DataFrame(numpy.array(confusion_matrix(target, model.predict(train_feature))))
dataframe.columns = dataset.target_names
dataframe.index = dataset.target_names

In [58]:
dataframe

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
alt.atheism,295,0,0,0,0,1,0,5,6,0,1,2,3,1,4,90,11,8,16,37
comp.graphics,3,367,43,31,27,60,19,0,1,0,0,5,10,6,9,2,0,0,0,1
comp.os.ms-windows.misc,0,51,408,38,27,43,12,1,1,1,1,0,5,0,2,0,0,1,0,0
comp.sys.ibm.pc.hardware,0,26,44,349,100,11,26,5,2,1,0,4,17,4,1,0,0,0,0,0
comp.sys.mac.hardware,1,27,29,78,351,1,40,2,1,3,1,0,36,5,1,0,0,0,2,0
comp.windows.x,0,87,82,7,5,384,7,2,2,2,0,3,2,4,5,0,1,0,0,0
misc.forsale,0,11,7,20,24,4,454,12,8,3,2,2,22,3,9,1,1,0,2,0
rec.autos,2,7,5,0,3,1,22,458,20,7,0,4,31,14,5,1,3,4,5,2
rec.motorcycles,4,2,1,1,2,3,18,95,440,1,2,0,10,5,3,1,2,0,5,3
rec.sport.baseball,2,2,1,0,1,0,4,3,5,537,23,0,4,3,2,1,5,1,2,1


Each row of the confusion matrix represent true class label. Each column represent the label assigned by our model. From the confusion matrix, we can have several observations. For example:

1. Many mistakes made on alt.atheism is assigning the label to soc.religion.christian.
2. 60 threads of comp.graphics is assigned to comp.windows.x
3. 100 threads of comp.sys.ibm.pc.hardware is assigned to comp.sys.mac.hardware

etc.

We can see that a lot mistakes happen between threads both in the same parent subforum such as comp.os.ms-windows.misc and other threads from comp.

One reason for that is that LDA is not fine enough to distinguish between subforums. We may use finer features to represent a topic.

# 5. Further methods to try

Other topic modeling and text classification can be applied to get finer representation of these text. Possible choices are Latent Semantic Indexing and Deep learning methods.