# NLP Author Classification Case

## Context

Original context:

*Dear colleagues,*

*As you know, NLP is a very useful skill in a data scientist toolbox.*
*You have had the opportunity to discover this during a social Thursday, now it is time to rise and shine !*

*To this end, we will be hosting a case ~~competition~~ training so that everybody can experience this ~~wonderful and life-changing experiment~~ that is classifying text.*

This is my go at the challenge where we need to classify a (multiple) sentence(s) to an author :) 

## Setup

In [31]:
#Import base packages
import os
import sys
import matplotlib.pyplot as plt
from sklearn import metrics
import pandas as pd
import seaborn as sns

proj_dir = os.path.abspath(os.path.join(os.path.dirname(globals()['_dh'][0])))
sys.path.append(proj_dir)

In [2]:
#If you will be modifying the packages
%load_ext autoreload
%autoreload 2

In [26]:
#Import local packages
from src.config import Config
from src.data import construct_training_dataframe
from src.apply import apply_fastai_model, apply_fastai_model_on_sentence

In [27]:
#Setup config
config = Config()

## EDA

#### Load data

The data cleaning from the original files take some time. If you want to save it, you can download the pre-cleanded dataset from a this [sharepoint link](https://agilytic-my.sharepoint.com/:x:/g/personal/jerome_agilytic_be/EcqMEgTAYW9IhOI-SBnyGg8BU80M7aENcMk_55Y8d8Snvw?e=ZxMXqc), and save it in the location: `./data/interim/` 

In [21]:
#Clean input data if not done or downloaded yet
raw_data_directory = '{}training/'.format(config.get_raw_data_path()) #default, change if necessary
training_df_path = '{}training.csv'.format(config.get_interim_data_path()) #default, change if necessary

if os.path.isfile(training_df_path):
    print('Training file exists')
else:
    print('Creating new dataframe from directory, this can take a while')
    construct_training_dataframe(directory=raw_data_directory)
    
#Load data
training_df = pd.read_csv(training_df_path)
print('Data loaded')

Training file exists
Data loaded


#### Analyze dataset

First let's look at the raw data:

In [None]:
training_df.head()

Let's analyze the distribution of the target in 'author' column.

There are three authors:

In [None]:
training_df['author'].unique()

They are distributed as follows:

In [None]:
sns.set()
plt.figure(figsize=(12, 6))
sns.countplot(training_df['author'])

Now let's do a basic analysis of the text lengths

In [None]:
training_df['text_length'] = training_df['text'].str.len()
training_df.head(1)

In [None]:
#Describe for every author:
for author in training_df['author'].unique():
    print('Analysis for {}:'.format(author))
    print(training_df[training_df['author']==author].describe())
    print('\n')

You can see that the means are relatively similar, but that there are some outliers in text length

In [None]:
plt.figure(figsize=(12, 6))
bplot = sns.boxplot(y='text_length', 
                    x='author', 
                    data=training_df, 
                    width=0.5,
                    palette="colorblind")

### Training models

Different modelling techniques were tried out:

For features:
- Tfidf 
- Count vectors

Combined with these algorithms:
- Logistic Regression 
- Ranom forests
- Multinomial Naive Bayes

The results varied between 70% and 84% accuracy on a held out test set

**What worked best:**
In the end I opted for a deep learning model based on Transfer Learning, using the [Fastai libray](https://www.fast.ai/)
 library.
 
Fastai is a library built on top of pytorch, aiming at easily applying state of the art deep learning techniques

This particalar model was trained using ULM_fit model for transfer learning language learned from millions of wikipedia pages. 

**The end result was 86% accuracy on the test set**

Because it is ressource intensive, I used Google Colab for free GPU power. See the script in ./scripts/train_ulm_fit.ipynb

### Applying model

We can now apply the model using the following function:

In [None]:
apply_fastai_model_on_sentence('Idris was very content')

It is also possible to apply the model to a dataframe

But without GPU this can be rather slow:

In [None]:
input_path = training_df_path
max_number_of_rows = 10
apply_fastai_model(input_path, max_number_of_rows)

This is the run on the whole dataset, but it is slow (+-40 minutes for 20k lines):

In [41]:
input_path = training_df_path
result = apply_fastai_model(input_path)

y_real = training_df['author']
predicted = result['prediction']
print("Model Accuracy:",metrics.accuracy_score(y_real, predicted))

   id                                               text author
0   1  Idris was well content with this resolve of mi...    MWS
1   2  I was faint, even fainter than the hateful mod...    HPL
2   3  Above all, I burn to know the incidents of you...    EAP
3   4  He might see, perhaps, one or two points with ...    EAP
4   5  All obeyed the Lord Protector of dying England...    MWS


Model Accuracy: 0.9653675231138581


### API

In order to industrialize the model, an api was made. 
This way it can be run on a server, with GPU if it nees to handle large datasets or with CPU if it is the occasional request

To run it:
1. Check the adress and port in the config file: `./config/config.yml`
2. Open Anaconda prompt: activate nlp_author_case
3. Go to the root folder of this project
4. Run `python ./src/api.py`

For now it only classifies sentences using a url similar to http://localhost:5000/sentence?sentence=%22Idris%20was%20very%20content%22

If it does not responds, your antivirus (like F-secure) sometimes blocks certain port.

### Contact

For any questions, don't hesitate to contact me, Jérôme Belpaire, at jerome@agilytic.be