# Text Classification Problem

We expect a candidate to develop a solution that is capable to classify provided texts in one of **four** classes.
 
You may find the dataset in the **data** folder:
- train.csv contains training dataset. There are four columns in this file:
    - id - column with unique identifier of each data sample
    - category - target variable
    - title - document title
    - description - document text
- test.csv contains test dataset and all the columns are the same except category as it is unknown and should be predicted.
- sample_submission.csv - an example of how resulting submission shoul look like.

Your model should give as an output a probability of each sample belonging to each class.

To submit your solution put this **solution.ipynb** file and generated **submission.csv** in a **zip** file.

We are interested to see how candidate implements his/her typical pipeline to solve machine learning problems starting with a dataset containing both data and target variable.

We **do not** expect a state-of-the-art solution here, rather a code that demonstrates candidate's understanding of crucial parts in ML models development. However, it would be a plus to see a brief description on how to get to the near-state-of-the-art solution in conclusions.

#### Imports

In [None]:
import numpy as np
import pandas as pd

# add needed libraries here
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Download NLTK stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

#### Your solution

##### Hyperparameters

In [None]:
vocab_size = 10000
embedding_dim = 32
max_length_a = 200
max_length_t = 50
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8
num_epochs = 20

##### Prepare training data

In [None]:
# put your code in this and the following blocks
df_train = pd.read_csv('data/train.csv')

In [None]:
# Just check the head of the dataframe
df_train.head()

In [None]:
# Check if the mean lengths of the titles and the descriptions
# make sense with the values written in the hyperparameters cell
print('Mean Length of titles:', round(df_train.title.apply(len).mean(), 2))
print('Mean Length of articles:', round(df_train.description.apply(len).mean(), 2))

In [None]:
def remove_stopwords(input_text):
    '''Function to remove English stopwords from a Pandas Series.'''
    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words)

In [None]:
# Remove stopwords and create train data
train_titles = df_train.title.apply(remove_stopwords)[:].to_numpy()
train_articles = df_train.description.apply(remove_stopwords)[:].to_numpy()
train_targets = df_train.category[:].to_numpy()

In [None]:
text = np.concatenate((train_titles, train_articles))

In [None]:
tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size,
                                               filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{"}~\t\n',
                                               lower=True,
                                               oov_token=oov_tok)
tokenizer.fit_on_texts(text)

In [None]:
# Print most common words
word_index = tokenizer.word_index
dict(list(word_index.items())[0:10])

In [None]:
# Transform to sequences
train_title_s = tokenizer.texts_to_sequences(train_titles)
train_article_s = tokenizer.texts_to_sequences(train_articles)

In [None]:
# Pad sequences
train_title_p = pad_sequences(train_title_s, 
                              maxlen=max_length_t, 
                              padding='post',
                              truncating='post')

train_article_p = pad_sequences(train_article_s, 
                              maxlen=max_length_a, 
                              padding='post',
                              truncating='post')

In [None]:
# Print train example
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_article(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_article_p[10]))
print('***')
print(train_articles[10])

##### The model

In [None]:
def nlp_model():
    '''2 inputs model that performs Embedding and Bidirectional GRU'''
    articles = tf.keras.layers.Input(shape=(vocab_size,))
    titles = tf.keras.layers.Input(shape=(vocab_size,))
    
    a = tf.keras.layers.Embedding(vocab_size, embedding_dim)(articles)
    a = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(embedding_dim))(a)
    a = tf.keras.layers.Dropout(0.5)(a)
    a = tf.keras.layers.Dense(embedding_dim / 2, activation='relu')(a)
    
    t = tf.keras.layers.Embedding(vocab_size, embedding_dim)(titles)
    t = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(embedding_dim))(t)
    t = tf.keras.layers.Dropout(0.5)(t)
    t = tf.keras.layers.Dense(embedding_dim / 2, activation='relu')(t)
    
    z = tf.keras.layers.concatenate([a, t])
    outputs = tf.keras.layers.Dense(4, activation='softmax')(z)

    return tf.keras.models.Model(inputs=(articles, titles), outputs=outputs)

In [None]:
# Create the model
model = nlp_model()

# Compile the model with Adam optimizer and Sparse Categorical Crossentropy loss
adam = tf.keras.optimizers.Adam(lr=3e-4)
model.compile(loss='sparse_categorical_crossentropy',
             optimizer=adam, metrics=['accuracy'])

# Use of Reduce Learning Rate on Plateau and EarlyStopping
lr_plateau = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3)
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=6, restore_best_weights=True)

# Start training process
history = model.fit((train_article_p, train_title_p), train_targets, epochs=num_epochs,
                   validation_split=0.1, shuffle=True,
                   callbacks=[lr_plateau, early_stop])

In [None]:
# Plot accuracy and loss metrics
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_' + string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend(['train_' + string, 'val_' + string])
  plt.show()
    
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

##### Prepare test data

In [None]:
df_test = pd.read_csv('data/test.csv')

In [None]:
df_test.head()

In [None]:
test_titles = df_test.title.apply(remove_stopwords)[:].to_numpy()
test_articles = df_test.description.apply(remove_stopwords)[:].to_numpy()

In [None]:
test_title_s = tokenizer.texts_to_sequences(test_titles)
test_article_s = tokenizer.texts_to_sequences(test_articles)

In [None]:
test_title_p = pad_sequences(test_title_s, 
                              maxlen=max_length_t, 
                              padding='post',
                              truncating='post')

test_article_p = pad_sequences(test_article_s, 
                              maxlen=max_length_a, 
                              padding='post',
                              truncating='post')

##### Predict on test data

In [None]:
predictions = model.predict((test_article_p, test_title_p))

#### Prepare submission

In [None]:
# edit the following code to generate a submission file
submission = pd.DataFrame()
submission['id'] = df_test['id']
submission['category_0'] = predictions[:, 0]
submission['category_1'] = predictions[:, 1]
submission['category_2'] = predictions[:, 2]
submission['category_3'] = predictions[:, 3]
submission.to_csv('submission.csv', index=False)

# Colnclusions

Write a few words about your solution here. 
- My approach has been to use a model that has two inputs. One input is the titles and the second one is the description of the articles. Both inputs go through the same layers:
    - An Embedding layer to aproach for representing words using a dense vector.
    - A Bidirectional GRU layer to learn "throught time" from the titles and descriptions.
    - A Dropout layer to perform some regularization.
    - A Dense layer with a ReLU activation.
- As a final stage the model concatenates both branches and makes use of a final Dense layer with a softmax activation.    

What could be improved? 
 - The model seems to be overfitting. Even though the training data is performing better each iteration, the validation data moves around the same values. Some actions that could be taken:
     - Apply a more aggressive regularitazion to the network (or other methods such as L2 regularization).
     - Reduce the networks capacity, to check if that solves the networks overfitting.
     - Search for a better configuration of the hyperparameters.
     
What approaches may work as well for this problem? 
- Make use of Conv1D layers. Conv layers are also commonly used in nlp and forecasting tasks.
- Create a model full of Dense Layers. For basic nlp tasks, some Dense layers architectures also perform quite well.

What would you implement if you have had more time for this task?
- Nowadays, Attention models have hit pretty hard. They are replacing LSTM/GRU models in state-of-the-art models. So, it could have been a great approach for this problem to try, for example, the Transformer neural network architecture.

Feel free to write anything you think is relevant to this task :)