Approach-1: Baseline Model

As a baseline model we have used Keras Sequential Model for text classification task.

In [1]:
#Import required libraries
! pip install num2words

import nltk
nltk.download('punkt')

import bz2
import pandas as pd
import numpy as np
from keras_preprocessing.text import one_hot
from keras_preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split

exec(open('/content/preprocess_data.py').read())

Collecting num2words
[?25l  Downloading https://files.pythonhosted.org/packages/eb/a2/ea800689730732e27711c41beed4b2a129b34974435bdc450377ec407738/num2words-0.5.10-py3-none-any.whl (101kB)
[K     |███▎                            | 10kB 12.2MB/s eta 0:00:01[K     |██████▌                         | 20kB 19.1MB/s eta 0:00:01[K     |█████████▊                      | 30kB 19.3MB/s eta 0:00:01[K     |█████████████                   | 40kB 7.1MB/s eta 0:00:01[K     |████████████████▏               | 51kB 8.5MB/s eta 0:00:01[K     |███████████████████▍            | 61kB 9.0MB/s eta 0:00:01[K     |██████████████████████▋         | 71kB 8.9MB/s eta 0:00:01[K     |█████████████████████████▉      | 81kB 9.8MB/s eta 0:00:01[K     |█████████████████████████████   | 92kB 10.3MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 5.6MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10
[nltk_data] Downloading package punkt to /root/nltk_da

In [2]:
#Load AG news data
sample_data = pd.read_csv('/content/sample.csv')

In [3]:
#sample records of the imported data
sample_data.head()

Unnamed: 0,Class,Title,Article
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...


#### Text cleaning

Data Cleaning is performed for removal of junk, sepcial characters, extra white spaces and changing numbers to words.

In [4]:
for i in range(len(sample_data['Article'])):
  sample_data['Article'][i] = preprocess(sample_data['Article'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [5]:
sample_data.head()

Unnamed: 0,Class,Title,Article
0,3,Fears for T N pension after talks,unions representing workers at turner newall ...
1,4,The Race is On: Second Private Team Sets Launc...,space com toronto canada a second team of roc...
2,4,Ky. Company Wins Grant to Study Peptides (AP),ap a company founded by a chemistry researche...
3,4,Prediction Unit Helps Forecast Wildfires (AP),ap its barely dawn when mike fitzpatrick star...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),ap southern californias smog fighting agency ...


In [6]:
#dictionary for mapping labels
label_check = {1 : 0, 2: 1, 3 : 2, 4 : 3}

#### Modeling

In [7]:
# Create sentence and label lists
sentences = sample_data['Article']
labels = [label_check[c] for c in sample_data['Class']]

In [8]:
#apply one-hot for each sentence to convert into integer
vocab_size = 10000
encoding = [one_hot(sentence, vocab_size) for sentence in sentences]

In [9]:
sentences[0]

' unions representing workers at turner newall say they are disappointed after talks with stricken parent firm federal mogul'

In [10]:
encoding[0]

[7441,
 6530,
 1668,
 5736,
 5056,
 8268,
 3495,
 6127,
 6808,
 7369,
 660,
 703,
 5491,
 4765,
 5520,
 5128,
 9383,
 1570]

In [11]:
#perform padding to make all the sentences in same length
max_length = 150
padded_encoding = pad_sequences(encoding, maxlen = max_length, padding = 'post')

In [12]:
padded_encoding[0]

array([7441, 6530, 1668, 5736, 5056, 8268, 3495, 6127, 6808, 7369,  660,
        703, 5491, 4765, 5520, 5128, 9383, 1570,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0], d

In [13]:
#split data into train and test
x_train, x_test, y_train, y_test = train_test_split(padded_encoding, labels, test_size = 0.2, random_state = 0)

In [14]:
len(x_train)

6080

In [15]:
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

In [16]:
#Keras Sequential model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length = max_length))
model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))

In [17]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 8)            80000     
_________________________________________________________________
flatten (Flatten)            (None, 1200)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 1201      
Total params: 81,201
Trainable params: 81,201
Non-trainable params: 0
_________________________________________________________________


In [19]:
# model.fit(padded_encoding, labels, epochs = 1)
model.fit(x_train, y_train, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff6a0049a10>

In [20]:
loss, accuracy = model.evaluate(x_test, y_test, verbose = 0)
print('Accuracy : ', accuracy * 100)
print('Loss     : ', loss)

Accuracy :  24.80263113975525
Loss     :  0.0
