<a href="https://colab.research.google.com/github/jnamor/text-classification/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Import the libraries
As the first step, we need to import the required libraries.

In [None]:
import pandas as pd
import numpy as np

### 2. Load the dataset

In [None]:
!git clone https://github.com/jnamor/text-classification

fatal: destination path 'text-classification' already exists and is not an empty directory.


In [None]:
df = pd.read_csv('text-classification/data/text-classification.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [None]:
df.shape

(2225, 2)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


### 3. Exploratory Data Analysis

In [None]:
from collections import Counter

def countWord(list_of_words):            
    count = Counter()
    for sentence in list_of_words:
        for word in sentence.split():
            count[word] += 1
    
    return count

In [None]:
countWord(df['category'])

Counter({'tech': 401,
         'business': 510,
         'sport': 511,
         'entertainment': 386,
         'politics': 417})

In [None]:
counter = countWord(df['text'])
counter.most_common(5)

[('the', 52567), ('to', 24955), ('of', 19947), ('and', 18561), ('a', 18251)]

In [None]:
total_words = len(counter)
total_words

43771

### 4. Pre-processing the data
The actual data must meet certain conditions before being sent to the model. We will create a `pipeline`: a multi-level system where each level receives its data from the previous level and sends its results to the next level.

#### 4.1 Tranforming the data

We transform the `textual categories` into `index values`.

In [None]:
def category_transforming(df):
    category_mapper = dict(zip(np.unique(df["category"]), list(range(df['category'].nunique()))))
    category_inv_mapper = dict(zip(list(range(df['category'].nunique())), np.unique(df["category"])))
    
    return category_mapper, category_inv_mapper

In [None]:
category_mapper, category_inv_mapper = category_transforming(df)

In [None]:
category_ind = [category_mapper[i] for i in df['category']]
df['category_ind'] = category_ind
df.head()

Unnamed: 0,category,text,category_ind
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


We can use another alternative with `scikit-learn` :

In [None]:
from sklearn.preprocessing import LabelEncoder

def category_transforming(list_of_categories):
    label_encoder = LabelEncoder()
    label_encoder.fit(df['category'])
    predicted_label = label_encoder.transform(list_of_categories)
    
    return predicted_label

In [None]:
category_ind = category_transforming(df['category'])
df['category_ind'] = category_ind
df.head()

Unnamed: 0,category,text,category_ind
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category      2225 non-null   object
 1   text          2225 non-null   object
 2   category_ind  2225 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 52.3+ KB


In [None]:
X = df['text']
Y = (df['category_ind']
     .to_numpy()
     .reshape(df['category_ind'].shape[0], 1))

### 5.  NLP Pipeline - Dataset preparation

In [None]:
import nltk

#### Step 1 -  Remove URL's

In [None]:
import re

def remove_links(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub("", text)

In [None]:
X = X.map(remove_links)
X.head()

0    tv future in the hands of viewers with home th...
1    worldcom boss  left books alone  former worldc...
2    tigers wary of farrell  gamble  leicester say ...
3    yeading face newcastle in fa cup premiership s...
4    ocean s twelve raids box office ocean s twelve...
Name: text, dtype: object

#### Step 2 -  Remove Punctuations

In [None]:
def decrease_text_size(text):
    return ".".join(text.split('.')[:5])

In [None]:
# X = X.map(decrease_text_size)
# X.head()

In [None]:
import string

def remove_punctuations(text):
    characters_to_remove = string.punctuation
    translator = str.maketrans("", "", characters_to_remove)
    clean_text = (text
                  .lower()
                  .translate(translator)
                 )
    
    return clean_text

In [None]:
X = X.map(remove_punctuations)
X.head()

0    tv future in the hands of viewers with home th...
1    worldcom boss  left books alone  former worldc...
2    tigers wary of farrell  gamble  leicester say ...
3    yeading face newcastle in fa cup premiership s...
4    ocean s twelve raids box office ocean s twelve...
Name: text, dtype: object

#### Step 3 - Stop words

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stop_words(text):
    stop = stopwords.words("english")
    filtered_words = [word for word in text.split() if word not in stop]
    
    return " ".join(filtered_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
X = X.map(remove_stop_words)
X.head()

0    tv future hands viewers home theatre systems p...
1    worldcom boss left books alone former worldcom...
2    tigers wary farrell gamble leicester say rushe...
3    yeading face newcastle fa cup premiership side...
4    ocean twelve raids box office ocean twelve cri...
Name: text, dtype: object

#### Step 4 - Tokenization then Stemming or Lemmatization ?

`Tokenization` splits a string into smaller entities such as words or single characters. Therefore, these are also referred to as tokens. <a href="https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization">Wikipedia</a> provides a nice example.

`Stemming` and `Lemmatization` are methods used by search engines and chatbots to analyze the meaning behind a word. `Stemming` uses the stem of the word, while `Lemmatization` uses the context in which the word is being used.

![image.png](attachment:image.png)

In this example, we will use `Stemming` for optimization and performance purposes.

In [None]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

def get_tokenized_text(input_sentence):
    return nltk.word_tokenize(input_sentence)

def get_stemmed_text(word):
    stemmer = PorterStemmer()
    return stemmer.stem(word)

def get_lemmatized_text(word):
  lemmatizer = WordNetLemmatizer()
  return lemmatizer.lemmatize(word)

def convert_text_to_array(text_tokenized):
    text_tokenized = [get_tokenized_text(sentence) for sentence in text_tokenized]
    for sentence in text_tokenized:
        for index, word in enumerate(sentence):
            # sentence[index] = get_stemmed_text(word) # Steming
            sentence[index] = get_lemmatized_text(word) # Lemmatization
    
    return np.array(text_tokenized)

In [None]:
X = convert_text_to_array(X)
X[:5]



array([list(['tv', 'future', 'hand', 'viewer', 'home', 'theatre', 'system', 'plasma', 'highdefinition', 'tv', 'digital', 'video', 'recorder', 'moving', 'living', 'room', 'way', 'people', 'watch', 'tv', 'radically', 'different', 'five', 'year', 'time', 'according', 'expert', 'panel', 'gathered', 'annual', 'consumer', 'electronics', 'show', 'la', 'vega', 'discus', 'new', 'technology', 'impact', 'one', 'favourite', 'pastime', 'u', 'leading', 'trend', 'programme', 'content', 'delivered', 'viewer', 'via', 'home', 'network', 'cable', 'satellite', 'telecom', 'company', 'broadband', 'service', 'provider', 'front', 'room', 'portable', 'device', 'one', 'talkedabout', 'technology', 'ce', 'digital', 'personal', 'video', 'recorder', 'dvr', 'pvr', 'settop', 'box', 'like', 'u', 'tivo', 'uk', 'sky', 'system', 'allow', 'people', 'record', 'store', 'play', 'pause', 'forward', 'wind', 'tv', 'programme', 'want', 'essentially', 'technology', 'allows', 'much', 'personalised', 'tv', 'also', 'builtin', 'highd

#### Step 5. - Tokenization with Keras

`Keras-Tokenizer` allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector.

This means that the data is already cleaned, lemmatized etc.

In [None]:
from keras.preprocessing.text import Tokenizer

def get_sequence_of_tokens(input_sentences):
    tokenizer = Tokenizer()
    
    tokenizer.fit_on_texts(input_sentences)   
    sentences_to_sequences = tokenizer.texts_to_sequences(input_sentences)
    
    return sentences_to_sequences

In [None]:
X = get_sequence_of_tokens(X)
X[:5]

[[93,
  173,
  512,
  971,
  53,
  1002,
  88,
  4666,
  1173,
  93,
  147,
  193,
  2205,
  1318,
  1230,
  1294,
  32,
  6,
  873,
  93,
  5823,
  333,
  107,
  3,
  12,
  141,
  828,
  1198,
  2290,
  579,
  160,
  1188,
  48,
  1253,
  2802,
  1589,
  7,
  64,
  759,
  9,
  699,
  11583,
  8,
  664,
  1147,
  210,
  410,
  1827,
  971,
  773,
  53,
  149,
  1174,
  1865,
  1133,
  19,
  336,
  28,
  1546,
  855,
  1294,
  1019,
  291,
  9,
  18579,
  64,
  2343,
  147,
  366,
  193,
  2205,
  7926,
  5195,
  3967,
  531,
  29,
  8,
  5196,
  21,
  1295,
  88,
  450,
  6,
  102,
  814,
  71,
  4117,
  469,
  4294,
  93,
  210,
  50,
  5824,
  64,
  1935,
  77,
  7927,
  93,
  5,
  4118,
  1173,
  93,
  46,
  154,
  106,
  421,
  8,
  2737,
  30,
  151,
  1090,
  1173,
  3968,
  6,
  469,
  4294,
  2119,
  5,
  2437,
  13891,
  149,
  778,
  2888,
  1269,
  614,
  18580,
  754,
  8,
  149,
  1174,
  1865,
  19,
  1754,
  171,
  378,
  1905,
  709,
  54,
  983,
  2168,
  971,
  4906,


#### 6.Padding the Sequences

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Keras for this purpose.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_padded_sequences(input_sequences, max_sequence = None):
    if max_sequence is None:
        max_sequence = max([len(x) for x in input_sequences])
    input_sequences = pad_sequences(input_sequences, maxlen = max_sequence, padding = "post", truncating = "post")
    
    return input_sequences, max_sequence

In [None]:
X, max_sequence = generate_padded_sequences(X)
X[:5]

array([[  93,  173,  512, ...,    0,    0,    0],
       [1490,  632,  299, ...,    0,    0,    0],
       [2805, 6236, 3436, ...,    0,    0,    0],
       [9928,  205, 1026, ...,    0,    0,    0],
       [3242, 4467, 4670, ...,    0,    0,    0]], dtype=int32)

#### 7. Splitting the data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(1780, 2219) (1780, 1)
(445, 2219) (445, 1)


### 6.  LSTMs for Text Generation

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

1. Input Layer : Takes the sequence of words as input
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
4. Output Layer : Computes the probability of the best possible next word as output

We will run this model for total 100 epoochs but it can be experimented further.

In [None]:
from keras.layers import Embedding, SpatialDropout1D, LSTM, Dense, Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping

def create_model(total_words, max_sequence):
    """
    Generates a sparse matrix from ratings dataframe.
    
    Args:
        df: pandas dataframe
    
    Returns:
        X: sparse matrix
        movie_mapper: dict that maps movie id's to movie indices
    """
    
    model = Sequential()
    
    # Add Input Embedding, Hidden and Ouput Layer
    model.add(Embedding(input_dim = total_words, output_dim = 100, input_length = max_sequence))
    model.add(SpatialDropout1D(0.2))
    model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(units = 5, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
   
    return model

In [None]:
model = create_model(total_words, max_sequence)
model.summary()



Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 2219, 100)         4377100   
                                                                 
 spatial_dropout1d_3 (Spatia  (None, 2219, 100)        0         
 lDropout1D)                                                     
                                                                 
 lstm_3 (LSTM)               (None, 100)               80400     
                                                                 
 dense_3 (Dense)             (None, 5)                 505       
                                                                 
Total params: 4,458,005
Trainable params: 4,458,005
Non-trainable params: 0
_________________________________________________________________


Let's train our model now

In [None]:
from tensorflow.keras.utils import to_categorical

y_train = to_categorical(y_train, 5)
y_test = to_categorical(y_test, 5)

In [None]:
X_train.shape

(1780, 2219)

In [None]:
model.fit(X_train, y_train, epochs = 5, batch_size = 32, validation_split=0.2, callbacks = [EarlyStopping(monitor='val_loss', patience=7, min_delta=0.0001)])

Epoch 1/5
Epoch 2/5

KeyboardInterrupt: ignored

In [None]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
model.save('model')