# News classification with topic models in gensim
News article classification is a task which is performed on a huge scale by news agencies all over the world. We will be looking into how topic modeling can be used to accurately classify news articles into different categories such as sports, technology, politics etc.

Our aim in this tutorial is to come up with some topic model which can come up with topics that can easily be interpreted by us. Such a topic model can be used to discover hidden structure in the corpus and can also be used to determine the membership of a news article into one of the topics.

For this tutorial, we will be using the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF). The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the year 2000-2001.

### Requirements
In this tutorial we look at how different topic models can be easily created using [gensim](https://radimrehurek.com/gensim/).
Following are the dependencies for this tutorial:
    - Gensim
    - matplotlib
    - nltk.stopwords and nltk.wordnet
    - pyLDAVis

In [1]:
import os
import re
import operator
import matplotlib.pyplot as plt
import warnings
import gensim
import numpy as np
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

import nltk

from gensim.models import CoherenceModel, LdaModel
from gensim.corpora import Dictionary
from pprint import pprint

%matplotlib inline

import load_lee_background_corpus as load_texts

Analysing our corpus.
    - The first document talks about a bushfire that had occured in New South Wales.
    - The second talks about conflict between India and Pakistan in Kashmir.
    - The third talks about road accidents in the New South Wales area.
    - The fourth one talks about Argentina's economic and political crisis during that time.
    - The last one talks about the use of drugs by midwives in a Sydney hospital.
Our final topic model should be giving us keywords which we can easily interpret and make a small summary out of. Without this the topic model cannot be of much practical use.

In [2]:
lee_train_file = load_texts.get_lee_train_file()
with open(lee_train_file) as f:
    for n, l in enumerate(f):
        if n < 6:
            print([l])

/Users/home/Desenvolvimento/anaconda3/lib/python3.6/site-packages/gensim/test/test_data
['Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year\'s Eve in New South Wales, fire crews have bee

### Loading raw and preprocessed texts

In [3]:
raw_texts = load_texts.get_raw_texts()
print(raw_texts[5])
raw_texts = [nltk.tokenize.word_tokenize(raw_text) for raw_text in raw_texts]
print(raw_texts[5])

/Users/home/Desenvolvimento/anaconda3/lib/python3.6/site-packages/gensim/test/test_data
The Federal Government says it should be safe for Afghani asylum seekers in Australia to return home when the environment becomes secure. The Government has suspended their applications while the interim government is established in Kabul. The Foreign Affairs Minister Alexander Downer has refused to say for how long the claims process has been put on hold. But he says the major threat to most people seeking asylum is no longer there. "Many Afghans who have tried to get into Australia or for that matter into Britain and other countries in north-west Europe have claimed that they are fleeing the Taliban," he said. "Well, the Taliban is no longer in power in Afghanistan, the Taliban is finished." Meanwhile, there has been a mass airlift of detainees from Christmas Island to the Pacific Island of Nauru. In total, more than 300 people have been flown from the island in two operations using chartered airc

In [4]:
train_texts = load_texts.get_train_texts()
print(train_texts[5])

/Users/home/Desenvolvimento/anaconda3/lib/python3.6/site-packages/gensim/test/test_data


2017-11-22 02:13:40,434 : INFO : collecting all words and their counts
2017-11-22 02:13:40,436 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-11-22 02:13:40,477 : INFO : collected 20429 word types from a corpus of 19878 words (unigram + bigrams) and 300 sentences
2017-11-22 02:13:40,479 : INFO : using 20429 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


Returning 300 training texts
['federal_government', 'afghani', 'asylum_seekers', 'australia', 'home', 'environment', 'secure', 'government', 'application', 'government', 'kabul', 'foreign_affairs', 'minister_alexander', 'downer', 'claim', 'process', 'hold', 'threat', 'people', 'asylum', 'afghan', 'australia', 'matter', 'britain', 'country', 'europe', 'taliban', 'taliban', 'power', 'afghanistan', 'taliban', 'mass', 'airlift', 'detainee', 'christmas', 'island', 'pacific', 'island', 'nauru', 'people', 'island', 'operation', 'aircraft', 'airlift', 'today', 'asylum_seekers', 'nauru', 'processing', 'claim', 'visa', 'department', 'immigration', 'detainee', 'christmas', 'island', 'spokesman', 'decision', 'future']


### Create a dictionary and a corpus for each set of texts

### Create an LdaModel for each set of parameters, for both raw and processed texts
    1. Use chunksize as the total number of documents, since it's a small corpus
    2. Use no more than 5 passes, since the documents converge soon enough
    3. Use 400 iterations to make the documents converge appropriately
    4. Use 'auto' for eta and alpha parameters
    5. Use 10 topics for each model
Enable debugging to track the progress of the training

### Visualize each model trained with pyLDAvis

pyLDAvis is a great way to visualize an LDA model. To summarize in short, the area of the circles represent the prevelance of the topic. The length of the bars on the right represent the membership of a term in a particular topic. pyLDAvis is based on [this](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) paper.

### Calculate the topic coherence for each model. Which one got the better coherence?