# FastText for Text Classification

In [2]:
from keras.preprocessing.text import Tokenizer
# from gensim.models.fasttext import FastText
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

%matplotlib notebook

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\imanursar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\imanursar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\imanursar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Text classification refers to classifying textual data into predefined categories based on the contents of the text. Sentiment analysis, spam detection, and tag detection are some of the most common examples of use-cases for text classification.

# The Dataset

https://www.kaggle.com/yelp-dataset/yelp-dataset/version/4#yelp_review.csv

In [None]:
yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")

bins = [0,2,5]
review_names = ['negative', 'positive']
yelp_reviews['reviews_score'] = pd.cut(yelp_reviews['stars'], bins, labels=review_names)

yelp_reviews.head()

# labeling

In [None]:
import pandas as pd
from io import StringIO
import csv

col = ['reviews_score', 'text']
yelp_reviews = yelp_reviews[col]

yelp_reviews['reviews_score']=['__label__'+ s for s in yelp_reviews['reviews_score']]
yelp_reviews['text']= yelp_reviews['text'].replace('\n',' ', regex=True).replace('\t',' ', regex=True)

yelp_reviews.to_csv(r'/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")

yelp_reviews.head()

# windows vs linux command

In [30]:
import fasttext

https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz

In [None]:
# type this commands in windows command
head cooking.stackexchange.txt

head -n 12404 cooking.stackexchange.txt > cooking.train
more +12404 cooking.stackexchange.txt > cooking.valid

# Our first classifier

## train our first classifier

In [50]:
model = fasttext.train_supervised(input="cooking.train")

## save model

In [36]:
model.save_model("model_cooking.bin")

## testing model

In [52]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.05942686]))

In [55]:
model.predict("Why not put knives in the dishwasher?")

(('__label__food-safety',), array([0.06431008]))

In [57]:
model.test("cooking.valid")

(3000, 0.143, 0.06184229494017587)

The output are the number of samples (here 3000), the precision at one (0.124) and the recall at one (0.0541).

In [58]:
model.test("cooking.valid", k=5)

(3000, 0.066, 0.14271298832348278)

In [59]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__food-safety',
  '__label__baking',
  '__label__substitutions',
  '__label__equipment',
  '__label__bread'),
 array([0.06431008, 0.06271435, 0.03756925, 0.03471329, 0.03450327]))

# Making the model better

## preprocessing the data

In [None]:
# type this commands in windows command

cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid

In [62]:
model = fasttext.train_supervised(input="cooking.train",verbose=True)

In [61]:
model.test("cooking.valid")

(3000, 0.161, 0.06962663975782038)

## more epochs and larger learning rate

In [63]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25)

In [64]:
model.test("cooking.valid")

(3000, 0.585, 0.25299120657344676)

## word n-grams

can improve the performance of a model by using word bigrams, instead of just unigrams
 
using word n-grams (using the option -wordNgrams, standard range [1 - 5]).

In [65]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)

In [66]:
model.test("cooking.valid")

(3000, 0.613, 0.26510018740089375)

A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can be a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.

'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens.

For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'.

Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams.

## Scaling things up

Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax

In [67]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')

The hierarchical softmax is a loss function that approximates the softmax with a much faster computation.

The idea is to build a binary tree whose leaves correspond to the labels. Each intermediate node has a binary decision activation (e.g. sigmoid) that is trained, and predicts if we should go to the left or to the right. The probability of the output unit is then given by the product of the probabilities of intermediate nodes along the path from the root to the output unit leave.

In fastText, we use a Huffman tree, so that the lookup time is faster for more frequent outputs and thus the average lookup time for the output is optimal.

# Multi-label classification

When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.

A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with -loss one-vs-all or -loss ova.

In [74]:
model = fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=3, bucket=200000, dim=50, loss='ova')

It is a good idea to decrease the learning rate compared to other loss functions.

Now let's have a look on our predictions, we want as many prediction as possible (argument -1) and we want only labels with probability higher or equal to 0.5 :

In [75]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__bananas',
  '__label__equipment'),
 array([1.00001001, 0.9994216 , 0.89331943, 0.8519628 ]))

In [76]:
model.test("cooking.valid", k=-1)

(3000, 0.003146031746031746, 1.0)