## A Topic Segmentation Engine for Segregating News Content into Categories



*   	Pre-processed a data set containing news articles belonging to different categories like Sports, Politics, Business etc.
*   Built a classifier using multilayer LSTM to predict the category of the news content.

*   Achieved accuracy of 90 percent on classification, using the multi-class confusion matrix.

    The below steps descibe the programming methodology in detail






THe dataset consists of 2225 rows and two columns. Column 1 contains the news articles (textual data) and column 2 in the output variable taking values 1 -5 depending on the category of the news article in column 1.

In [0]:
import pandas as pd                   #importing the dataset prepared by extracting the articles.. 
df = pd.read_pickle('topic_bbc_all')  # ..from all .txt files (steps explained in xyz.ipynb)

****

In [0]:
import re
import numpy as np
#function to remove a certain pattern in the text
def remove_pattern(input_txt, pattern):  
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, ' ', input_txt)
        
    return input_txt
  
df['Text'] = np.vectorize(remove_pattern)(df['Text'], "\n")
df['Text'].head()

0    Ad sales boost Time Warner profit  Quarterly p...
1    Dollar gains on Greenspan speech  The dollar h...
2    Yukos unit buyer faces loan claim  The owners ...
3    High fuel prices hit BA's profits  British Air...
4    Pernod takeover talk lifts Domecq  Shares in U...
Name: Text, dtype: object

In [0]:
#Removing all characters beside alphabets and #
df['Text'] = df['Text'].str.replace("[^a-zA-Z]", " ") 
df['Text'].head()

0    Ad sales boost Time Warner profit  Quarterly p...
1    Dollar gains on Greenspan speech  The dollar h...
2    Yukos unit buyer faces loan claim  The owners ...
3    High fuel prices hit BA s profits  British Air...
4    Pernod takeover talk lifts Domecq  Shares in U...
Name: Text, dtype: object

In [0]:

df.groupby(['Label']).describe()

Unnamed: 0_level_0,Text,Text,Text,Text
Unnamed: 0_level_1,count,unique,top,freq
Label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,510,502,Singapore growth at in Singapore s ...,2
1,386,369,Famed music director Viotti dies Conductor Ma...,2
2,417,403,Schools to take part in mock poll Record numb...,2
3,511,505,Spain coach faces racism inquiry Spain s Foot...,2
4,401,347,Europe backs digital TV lifestyle How people ...,2


In [0]:
#removing stop words
import nltk
nltk.download("stopwords")
#from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

df['Text'] = df['Text'].apply(lambda x: ' '.join(w for w in x.split() if not w in stop_words))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
#removing words with length less than 3(these words have less strength )
df['Text'] = df['Text'].apply(lambda x: ' '.join(w for w in x.split() if len(w)>3))

In [0]:
df.head()

Unnamed: 0,Text,Label
0,sales boost Time Warner profit Quarterly profi...,0
1,Dollar gains Greenspan speech dollar highest l...,0
2,Yukos unit buyer faces loan claim owners embat...,0
3,High fuel prices profits British Airways blame...,0
4,Pernod takeover talk lifts Domecq Shares drink...,0


In [0]:
# Stemmming - to have a single word for words having the same meaning but used in different forms like play, playing, played etc. 

from nltk.stem.porter import *
stemmer = PorterStemmer()

df['Text'] = df['Text'].apply(lambda x: [stemmer.stem(i) for i in x.split()]) # stemming
df['Text'].head()

0    [sale, boost, time, warner, profit, quarterli,...
1    [dollar, gain, greenspan, speech, dollar, high...
2    [yuko, unit, buyer, face, loan, claim, owner, ...
3    [high, fuel, price, profit, british, airway, b...
4    [pernod, takeov, talk, lift, domecq, share, dr...
Name: Text, dtype: object

In [0]:
df['Text'] = df['Text'].apply( lambda x: ' '.join(x))

In [0]:
df.head()

Unnamed: 0,Text,Label
0,sale boost time warner profit quarterli profit...,0
1,dollar gain greenspan speech dollar highest le...,0
2,yuko unit buyer face loan claim owner embattl ...,0
3,high fuel price profit british airway blame hi...,0
4,pernod takeov talk lift domecq share drink foo...,0


In [0]:
#using Keras preprocessing functions for preparing the input dataset

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#tokenizing the processed text again
tk = Tokenizer(lower = True)
tk.fit_on_texts(df['Text'])               

#representing the word by a number equal to the frequency of its occcurence.
X_seq = tk.texts_to_sequences(df['Text'])   

#limiting the length of each sentence to 100. Padding with 0 if sentence is short
X_pad = pad_sequences(X_seq, maxlen=100, padding='post')

In [0]:
len(X_pad)

2225

In [0]:
#Binarizing the output label i.e. representing each label in binary format

from sklearn import preprocessing
y = df['Label']
lb = preprocessing.LabelBinarizer()
lb.fit(y)
m = lb.transform(y)

In [0]:
# Preparing train and test datasets

from sklearn.utils import shuffle
X_pad, m, y  = shuffle(X_pad, m, y, random_state = 0)
X_train = X_pad[:-800]
X_test  = X_pad[-800:]
y_train = m[:-800]
y_test = m[-800:]

In [0]:
def create_model():
  vocabulary_size = len(tk.word_counts.keys())+1
  max_words = 100
  embedding_size = 32
  model = Sequential()
  model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
  model.add(Dropout(0.2))
  model.add(Conv1D(100, kernel_size=8, activation='relu'))
  model.add(MaxPooling1D(pool_size=4))
  model.add(LSTM(200,return_sequences=True))
  model.add(LSTM(200))
  model.add(Dense(5, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

In [0]:
from keras.wrappers.scikit_learn import KerasClassifier
estimator = KerasClassifier(build_fn=create_model, epochs=10, batch_size=100, verbose=20)


In [0]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=3, shuffle=True, random_state=0)

In [0]:
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation, Input, Lambda

results = cross_val_score(estimator, X_pad, y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

W0731 06:17:52.324379 140380433893248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0731 06:17:52.369434 140380433893248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0731 06:17:52.381415 140380433893248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0731 06:17:52.403284 140380433893248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0731 06:17:52.413901 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Baseline: 87.14% (4.22%)


In [0]:
# using confusion matrix
model = create_model()

clf = model.fit(X_train, y_train, epochs = 30, batch_size = 500)

out = model.predict(X_test)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [0]:

#The predicted output will be floating point values and must be rounded to the nearest integers

out1 = np.round(out)
k = out1.astype(int)

In [0]:
from sklearn.metrics import multilabel_confusion_matrix

multilabel_confusion_matrix(y_test, k, labels = [0,1,2,3])

array([[[607,  15],
        [ 12, 166]],

       [[647,  13],
        [ 20, 120]],

       [[632,  15],
        [ 20, 133]],

       [[603,  11],
        [  9, 177]]])

classification accuracy is 90 percent