# fastText for Text Classification

Developed by Facebook AI Research, fastText is an open-source, lightweight library that allows users to learn text representations and text classifiers. <br>
<br>
We will explore the use of fastText in classifying the quality of Stack Overflow questions. Kaggle dataset [here](https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)

Steps on how to install and build fastText [here](https://fasttext.cc/docs/en/supervised-tutorial.html#installing-fasttext)

## Import libraries

In [1]:
import pandas as pd
import numpy as np
from gensim.utils import simple_preprocess
import fasttext

## Preprocessing

In [2]:
# Import datasets
train = pd.read_csv('./data/train.csv')[['Body', 'Y']].rename(columns={'Body': 'questions', 'Y': 'category'})
test = pd.read_csv('./data/valid.csv')[['Body', 'Y']].rename(columns={'Body': 'questions', 'Y': 'category'})

In [3]:
train.head()

Unnamed: 0,questions,category
0,<p>I'm already familiar with repeating tasks e...,LQ_CLOSE
1,<p>I'd like to understand why Java 8 Optionals...,HQ
2,<p>I am attempting to overlay a title over an ...,HQ
3,"<p>The question is very simple, but I just cou...",HQ
4,<p>I'm using custom floatingactionmenu. I need...,HQ


In [4]:
test.head()

Unnamed: 0,questions,category
0,I am having 4 different tables like \r\nselect...,LQ_EDIT
1,I have two table m_master and tbl_appointment\...,LQ_EDIT
2,<p>I'm trying to extract US states from wiki U...,HQ
3,"I'm so new to C#, I wanna make an application ...",LQ_EDIT
4,basically i have this array:\r\n\r\n array(...,LQ_EDIT


In [5]:
print(train.shape, test.shape)

(45000, 2) (15000, 2)


In [6]:
train['category'].value_counts(normalize=True)

HQ          0.333333
LQ_CLOSE    0.333333
LQ_EDIT     0.333333
Name: category, dtype: float64

Baseline score is 0.33

In [7]:
# Use gensim to tokenise and remove symbols
train['questions_preprocessed'] = train['questions'].map(lambda x: ' '.join(simple_preprocess(x)))
test['questions_preprocessed'] = test['questions'].map(lambda x: ' '.join(simple_preprocess(x)))

In [8]:
# Add prefix '__label__' to each row in target column so that fastText can recognise that this is a label
train['category'] = train['category'].map(lambda x: '__label__' + x)
test['category'] = test['category'].map(lambda x: '__label__' + x)

In [9]:
# Saving csv file as a txt file
train[['category', 'questions_preprocessed']].to_csv('./data/train.txt',
                                                    index=False,
                                                    sep=' ',
                                                    header=None)

test[['category', 'questions_preprocessed']].to_csv('./data/test.txt',
                                                    index=False,
                                                    sep=' ',
                                                    header=None)

## Training fastText classifier

In [10]:
# Training
model = fasttext.train_supervised('./data/train.txt', wordNgrams=2, epoch=20)

## Model evaluation

In [11]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

    
# Evaluating fastText's performance on test set
print_results(*model.test('./data/test.txt'))

N	15000
P@1	0.832
R@1	0.832


fastText achieved a precision at 1 and recall at 1 of 0.83 on the test set

As mentioned by [fastText:](https://fasttext.cc/docs/en/supervised-tutorial.html#advanced-readers-precision-and-recall)
>Precision is the number of correct labels among the labels predicted by fastText. Recall is the number of labels that successfully were predicted, among all the real labels.

## References

[fastText Python module](https://fasttext.cc/docs/en/python-module.html)<br>
[fastText for Text Classification](https://towardsdatascience.com/fasttext-for-text-classification-a4b38cbff27c)