<b> Step 1 : Load the Dataset </b>
    
Dataset: https://www.kaggle.com/datasets/akash14/news-category-dataset?select=Data_Train.csv

 About Dataset: Size of training set: 7,628 records Size of test set: 2,748 records. FEATURES: STORY: A part of the main content of the article to be published as a piece of news. SECTION: The genre/category the STORY falls in. There are four distinct sections where each story may fall in to. 
The Sections are labelled as follows : Politics: 0 Technology: 1 Entertainment: 2 Business: 3



In [1]:
import pandas as pd
import random

train_data_df = pd.read_csv("Data_train.csv", header = 0,encoding='cp1252') 
train_data_df

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3
...,...,...
7623,"Karnataka has been a Congress bastion, but it ...",0
7624,"The film, which also features Janhvi Kapoor, w...",2
7625,The database has been created after bringing t...,1
7626,"The state, which has had an uneasy relationshi...",0


In [2]:
test_data_df = pd.read_csv("Data_test.csv", header = 0,encoding='cp1252') 
test_data_df

Unnamed: 0,STORY
0,2019 will see gadgets like gaming smartphones ...
1,It has also unleashed a wave of changes in the...
2,It can be confusing to pick the right smartpho...
3,The mobile application is integrated with a da...
4,We have rounded up some of the gadgets that sh...
...,...
2743,"According to researchers, fraud in the mobile ..."
2744,The iPhone XS and XS Max share the Apple A12 c...
2745,"On the photography front, the Note 5 Pro featu..."
2746,UDAY mandated that discoms bring the gap betwe...


In [3]:
train_data_df.shape

(7628, 2)

In [4]:
'''
Previewing a random story from train data.
'''
train_data_df['STORY'][200]

'“The whole thing feels like a giant set, stately and ponderous and minus impact; the cast all costumed and perfumed and largely lifeless, sparking only in bits and pieces,” a section of her review reads\n\n\n On a star based rating, I would give this movie a blackhole"Kalank mints money overseasTrade analyst Taran Adarsh shared Kalank\'s overseas performance\n\n\n05 mn\nAustralia: A$ 620k"Sonakshi Sinha on failure: I don\'t lose hopeSonakshi Sinha has been having a bad run at the box office with her past few films including Kalank failing to impress the audience'

In [5]:
test_data_df.shape

(2748, 1)

<b> Step 2 : Data Preprocessing</b>
```
1.) Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
2.) Words that have fewer than 3 characters are removed.
3.) All stopwords are removed.
4.) Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.
5.) Words are stemmed - words are reduced to their root form.
```


In [6]:
'''
Loading Gensim and nltk libraries
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/pranjalimehta/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
stemmer = SnowballStemmer("english")

In [9]:
'''
Function to perform the pre processing steps on the entire dataset.
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [10]:
'''
Previewing a random story from train data after preprocessing.
'''
story_num = random.randint(0,train_data_df.shape[0])
story_sample = train_data_df['STORY'][story_num]
print("Original Story: ")
words = []
for word in story_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized Story: ")
print(preprocess(story_sample))

Original Story: 
['Apart', 'from', 'Aniston', 'and', 'Sandler,', 'Murder', 'Mystery', 'also', 'features', 'Luke', 'Evans,', 'Gemma', 'Arterton,', 'John', 'Kani,', 'Ólafur', 'Darri', 'Ólafsson', 'and', 'Terrence', 'Stamp', 'among', 'others', 'in', 'pivotal', 'roles']


Tokenized and lemmatized Story: 
['apart', 'aniston', 'sandler', 'murder', 'mysteri', 'featur', 'luke', 'evan', 'gemma', 'arterton', 'john', 'kani', 'ólafur', 'darri', 'ólafsson', 'terrenc', 'stamp', 'pivot', 'role']


In [11]:
processed_stories = []

for story in train_data_df['STORY']:
    processed_stories.append(preprocess(story))

In [12]:
'''
Preview 'processed_stories'
'''
print(processed_stories[:2])

[['pain', 'huge', 'revers', 'incom', 'unheard', 'privat', 'sector', 'lender', 'essenti', 'mean', 'bank', 'take', 'grant', 'fee', 'structur', 'loan', 'deal', 'pay', 'account', 'upfront', 'book', 'borrow', 'turn', 'default', 'fee', 'tie', 'loan', 'deal', 'fell', 'crack', 'gill', 'vow', 'shift', 'safer', 'account', 'practic', 'amort', 'incom', 'book', 'upfront', 'gill', 'mend', 'past', 'way', 'mean', 'nasti', 'surpris', 'futur', 'good', 'news', 'consid', 'investor', 'love', 'clean', 'imag', 'loath', 'uncertainti', 'gain', 'pain', 'promis', 'strong', 'stabl', 'balanc', 'sheet', 'come', 'sacrific', 'investor', 'hop', 'phenomen', 'growth', 'promis', 'kapoor'], ['formid', 'opposit', 'allianc', 'congress', 'jharkhand', 'mukti', 'morcha', 'jharkhand', 'vika', 'morcha', 'prajatantrik']]


<b> Step 3 : Bag of Words on the dataset</b>
<br>
I created a dictionary from 'processed_stories' containing the number of times a word appears in the training set. I have used 
```genism.coropa.Dictionary()``` for this. Followed by some more filtering out of data.

Steps Involved: 

1. Create dictionary from words present in the entire training data.
2. Remove very rare and very common words from the dictionary under consideration.
3. Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string.



In [13]:
'''
1.) Create a dictionary from 'processed_stories' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_stories)

In [14]:
'''
2.) Removing very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [15]:
'''
Create the Bag-of-words model for each document i.e for each story we create a dictionary reporting how many
words and how many times those words appear. Saved this to 'bow_corpus'.
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_stories]

In [16]:
'''
Preview BOW for our sample preprocessed stories
'''
random_story_num = random.randint(0,train_data_df.shape[0])
bow_doc_x = bow_corpus[random_story_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 13 ("fell") appears 3 time.
Word 15 ("gain") appears 2 time.
Word 57 ("currenc") appears 1 time.
Word 58 ("dollar") appears 2 time.
Word 59 ("index") appears 6 time.
Word 61 ("japanes") appears 2 time.
Word 72 ("south") appears 1 time.
Word 76 ("trade") appears 1 time.
Word 105 ("equiti") appears 1 time.
Word 112 ("rise") appears 1 time.
Word 114 ("spot") appears 1 time.
Word 145 ("world") appears 1 time.
Word 235 ("korea") appears 1 time.
Word 284 ("expect") appears 1 time.
Word 433 ("increas") appears 1 time.
Word 473 ("emerg") appears 2 time.
Word 605 ("countri") appears 1 time.
Word 619 ("nation") appears 1 time.
Word 660 ("give") appears 1 time.
Word 730 ("highest") appears 3 time.
Word 741 ("boost") appears 1 time.
Word 781 ("asia") appears 1 time.
Word 846 ("region") appears 1 time.
Word 954 ("week") appears 4 time.
Word 962 ("drop") appears 1 time.
Word 972 ("surg") appears 1 time.
Word 1010 ("msci") appears 3 time.
Word 1100 ("wider") appears 1 time.
Word 1131 ("pace") ap

<b> Step 4 : Running LDA using Bag of Words</b>
<br>
I will be training my lda model using ```gensim.models.LdaMulticore ```  Some of the parameters which I have tried to tweak are:

1. num_topics: is the number of requested latent topics to be extracted from the training corpus.
2. id2word: is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
3. workers: is the number of extra processes to use for parallelization. Uses all available cores by default.
4. alpha and eta: Hyperparameters affecting the sparsity of stories.
5. passes: No of training passes through the corpus.



In [17]:
'''
Training the lda model using gensim.models.LdaMulticore and saving it to 'lda_model'
For my use case I will be chosing num_topics = 4.
'''
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 4, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [18]:
'''
For each topic, we can see the words occuring in that topic and their relative weights.
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.013*"film" + 0.006*"bank" + 0.006*"trade" + 0.006*"growth" + 0.005*"actor" + 0.005*"month" + 0.005*"crore" + 0.005*"quarter" + 0.005*"investor" + 0.005*"stock"


Topic: 1 
Words: 0.023*"seat" + 0.015*"poll" + 0.014*"sabha" + 0.013*"vote" + 0.010*"contest" + 0.010*"candid" + 0.009*"constitu" + 0.009*"minist" + 0.009*"leader" + 0.008*"phase"


Topic: 2 
Words: 0.014*"smartphon" + 0.011*"appl" + 0.010*"phone" + 0.009*"camera" + 0.009*"devic" + 0.008*"featur" + 0.007*"samsung" + 0.007*"googl" + 0.007*"launch" + 0.007*"iphon"


Topic: 3 
Words: 0.011*"govern" + 0.010*"modi" + 0.006*"minist" + 0.006*"issu" + 0.006*"data" + 0.005*"secur" + 0.005*"polit" + 0.005*"nation" + 0.005*"countri" + 0.005*"facebook"




<b> Classification of Topics</b>
<br>

From the above results we can categorize the Topics as follows: 
* Topic 0: Business
* Topic 1: Politics
* Topic 2: Technology
* Topic 3: Entertainment



<b> Step 6 : Testing the mode</b>
<br>
WIP.

