### 1. Import the libraries
As the first step, we need to import the required libraries.

In [54]:
import pandas as pd
import numpy as np

### 2. Load the dataset
Load the dataset.

In [55]:
df = pd.read_csv('../data/text-classification.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [56]:
df.shape

(2225, 2)

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


### 3. Exploratory Data Analysis

In [58]:
from collections import Counter

def countWord(list_of_words):            
    count = Counter()
    for sentence in list_of_words:
        for word in sentence.split():
            count[word] += 1
    
    return count

In [59]:
countWord(df['category'])

Counter({'tech': 401,
         'business': 510,
         'sport': 511,
         'entertainment': 386,
         'politics': 417})

In [60]:
counter = countWord(df['text'])
counter.most_common(5)

[('the', 52567), ('to', 24955), ('of', 19947), ('and', 18561), ('a', 18251)]

In [61]:
total_words = len(counter)
total_words

43771

### 4. Pre-processing the data
The actual data must meet certain conditions before being sent to the model. We will create a `pipeline`: a multi-level system where each level receives its data from the previous level and sends its results to the next level.

#### 4.1 Tranforming the data

We transform the `textual categories` into `index values`.

In [62]:
def category_transforming(df):
    category_mapper = dict(zip(np.unique(df["category"]), list(range(df['category'].nunique()))))
    category_inv_mapper = dict(zip(list(range(df['category'].nunique())), np.unique(df["category"])))
    
    return category_mapper, category_inv_mapper

In [63]:
category_mapper, category_inv_mapper = category_transforming(df)

In [64]:
category_ind = [category_mapper[i] for i in df['category']]
df['category_ind'] = category_ind
df.head()

Unnamed: 0,category,text,category_ind
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


We can use another alternative with `scikit-learn` :

In [65]:
from sklearn.preprocessing import LabelEncoder

def category_transforming(list_of_categories):
    label_encoder = LabelEncoder()
    label_encoder.fit(df['category'])
    predicted_label = label_encoder.transform(list_of_categories)
    
    return predicted_label

In [66]:
category_ind = category_transforming(df['category'])
df['category_ind'] = category_ind
df.head()

Unnamed: 0,category,text,category_ind
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


#### 4.2 Splitting the data

In [67]:
split_size = int(df.shape[0] * 0.8)

In [68]:
df_train = df[:split_size]
df_val = df[split_size:]

### 5.  NLP Pipeline - Dataset preparation

In [69]:
import nltk

#### Step 1 -  Remove URL's

In [70]:
import re

def remove_links(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub("", text)

In [71]:
corpus = [remove_links(sentence) for sentence in df['text']]
corpus[0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [72]:
df['text'] = df['text'].map(remove_links)
df['text'].head()

0    tv future in the hands of viewers with home th...
1    worldcom boss  left books alone  former worldc...
2    tigers wary of farrell  gamble  leicester say ...
3    yeading face newcastle in fa cup premiership s...
4    ocean s twelve raids box office ocean s twelve...
Name: text, dtype: object

#### Step 2 -  Remove Punctuations

In [73]:
import string

def remove_punctuations(text):
    characters_to_remove = string.punctuation
    translator = str.maketrans("", "", characters_to_remove)
    clean_text = (text
                  .lower()
                  .translate(translator)
                 )
    
    return clean_text

In [74]:
corpus = [remove_punctuations(sentence) for sentence in df['text']]
corpus[0]

'tv future in the hands of viewers with home theatre systems  plasma highdefinition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices  one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes  like the us s tivo and the uk s sky system  allow people to record  store  play  pause and forward wind tv programmes when they want  essentially  the technology allows for much more personalised tv they are also being builtin to highdefinition tv

In [75]:
df['text'] = df['text'].map(remove_punctuations)
df['text'].head()

0    tv future in the hands of viewers with home th...
1    worldcom boss  left books alone  former worldc...
2    tigers wary of farrell  gamble  leicester say ...
3    yeading face newcastle in fa cup premiership s...
4    ocean s twelve raids box office ocean s twelve...
Name: text, dtype: object

In [76]:
df.head()

Unnamed: 0,category,text,category_ind
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


#### Step 3 - Stop words

In [77]:
# nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stop_words(text):
    stop = stopwords.words("english")
    filtered_words = [word for word in text.split() if word not in stop]
    
    return " ".join(filtered_words)

In [78]:
df['text'] = df['text'].map(remove_stop_words)
df['text'].head()

0    tv future hands viewers home theatre systems p...
1    worldcom boss left books alone former worldcom...
2    tigers wary farrell gamble leicester say rushe...
3    yeading face newcastle fa cup premiership side...
4    ocean twelve raids box office ocean twelve cri...
Name: text, dtype: object

#### Step 4 - Tokenization then Stemming or Lemmatization ?

`Tokenization` splits a string into smaller entities such as words or single characters. Therefore, these are also referred to as tokens. <a href="https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization">Wikipedia</a> provides a nice example.

`Stemming` and `Lemmatization` are methods used by search engines and chatbots to analyze the meaning behind a word. `Stemming` uses the stem of the word, while `Lemmatization` uses the context in which the word is being used.

![image.png](attachment:image.png)

In this example, we will use `Stemming` for optimization and performance purposes.

In [79]:
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# nltk.download('punkt')
from nltk.stem.porter import PorterStemmer

def get_tokenized_text(input_sentence):
    return nltk.word_tokenize(input_sentence)

def get_stemmed_text(word):
    stemmer = PorterStemmer()
    return stemmer.stem(word)

def convert_text_to_array(text_tokenized):
    for sentence in text_tokenized:
        for index, word in enumerate(sentence):
            sentence[index] = get_stemmed_text(word)
    
    return np.array(text_tokenized)

In [80]:
train_text_tokenized = [get_tokenized_text(sentence) for sentence in df_train['text']]
val_text_tokenized = [get_tokenized_text(sentence) for sentence in df_val['text']]

In [81]:
train_text_to_array = convert_text_to_array(train_text_tokenized)
val_text_to_array = convert_text_to_array(val_text_tokenized)

In [82]:
train_category_ind_to_array = df_train['category_ind'].to_numpy()
val_category_ind_to_array = df_val['category_ind'].to_numpy()

In [83]:
train_text_to_array.shape, val_text_to_array.shape

((1780,), (445,))

#### Step 5. - Tokenization with Keras

`Keras-Tokenizer` allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector.

This means that the data is already cleaned, lemmatized etc.

In [84]:
from keras.preprocessing.text import Tokenizer

def get_sequence_of_tokens(input_sentences):
    tokenizer = Tokenizer()
    
    tokenizer.fit_on_texts(input_sentences)   
    sentences_to_sequences = tokenizer.texts_to_sequences(input_sentences)
    
    return sentences_to_sequences

In [85]:
train_text_to_sequences = get_sequence_of_tokens(train_text_to_array)
val_text_to_sequences = get_sequence_of_tokens(val_text_to_array)

In [86]:
print(train_text_to_array[:5])
print(train_text_to_sequences[:5])

[list(['tv', 'futur', 'in', 'the', 'hand', 'of', 'viewer', 'with', 'home', 'theatr', 'system', 'plasma', 'high-definit', 'tv', 'and', 'digit', 'video', 'record', 'move', 'into', 'the', 'live', 'room', 'the', 'way', 'peopl', 'watch', 'tv', 'will', 'be', 'radic', 'differ', 'in', 'five', 'year', 'time', '.', 'that', 'is', 'accord', 'to', 'an', 'expert', 'panel', 'which', 'gather', 'at', 'the', 'annual', 'consum', 'electron', 'show', 'in', 'la', 'vega', 'to', 'discuss', 'how', 'these', 'new', 'technolog', 'will', 'impact', 'one', 'of', 'our', 'favourit', 'pastim', '.', 'with', 'the', 'us', 'lead', 'the', 'trend', 'programm', 'and', 'other', 'content', 'will', 'be', 'deliv', 'to', 'viewer', 'via', 'home', 'network', 'through', 'cabl', 'satellit', 'telecom', 'compani', 'and', 'broadband', 'servic', 'provid', 'to', 'front', 'room', 'and', 'portabl', 'devic', '.', 'one', 'of', 'the', 'most', 'talked-about', 'technolog', 'of', 'ce', 'ha', 'been', 'digit', 'and', 'person', 'video', 'record', '('

#### 6.Padding the Sequences

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Keras for this purpose.

In [87]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_padded_sequences(input_sequences):
    max_sequence = max([len(x) for x in input_sequences])
    input_sequences = pad_sequences(input_sequences, maxlen = max_sequence, padding = "post", truncating = "post")
    
    return input_sequences, max_sequence

In [88]:
train_padded, train_max_sequence = generate_padded_sequences(train_text_to_sequences)
val_padded, val_max_sequence = generate_padded_sequences(val_text_to_sequences)

In [89]:
train_padded.shape, val_padded.shape

((1780, 4757), (445, 1483))

### 6.  LSTMs for Text Generation

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

1. Input Layer : Takes the sequence of words as input
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
4. Output Layer : Computes the probability of the best possible next word as output

We will run this model for total 100 epoochs but it can be experimented further.

In [90]:
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from keras.losses import BinaryCrossentropy
from keras.optimizers import Adam

def create_model(total_words, train_max_sequence):
    """
    Generates a sparse matrix from ratings dataframe.
    
    Args:
        df: pandas dataframe
    
    Returns:
        X: sparse matrix
        movie_mapper: dict that maps movie id's to movie indices
    """
    
    model = Sequential()
    
    # Add Input Embedding, Hidden and Ouput Layer
    model.add(Embedding(input_dim = total_words, output_dim = 256, input_length = train_max_sequence))
    model.add(LSTM(units = 502, dropout = 0.1))
    model.add(Dense(units = total_words, activation='sigmoid'))
    
    loss = BinaryCrossentropy(from_logits = False)
    optim = Adam(learning_rate = 0.001)
    metrics = ["accuracy"]

    model.compile(loss = loss, optimizer = optim, metrics = metrics)
    
    return model

In [91]:
model = create_model(train_max_sequence, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 43771, 256)        1217792   
_________________________________________________________________
lstm (LSTM)                  (None, 502)               1524072   
_________________________________________________________________
dense (Dense)                (None, 4757)              2392771   
Total params: 5,134,635
Trainable params: 5,134,635
Non-trainable params: 0
_________________________________________________________________


Lets train our model now

In [92]:
model.fit(train_padded, train_category_ind_to_array, epochs = 100, validation_data = (val_padded, val_category_ind_to_array), verbose=5)