# Text classification

Learning to process and understand text is one of the first steps on the journey to
getting meaningful insights from textual data. Though it is important to understand
how language is structured and specific text syntax patterns, that alone is not sufficient
to be of much use to businesses and organizations who want to derive useful patterns
and insights and get maximum use out of their vast volumes of text data.  

One of the most relevant and challenging problems is text classification or
categorization, which involves trying to organize text documents into various categories
based on inherent properties or attributes of each text document. This is used in
various domains, including email spam identification and news categorization. The
concept may seem simple, and if you have a small number of documents, you can look
at each document and gain some idea about what it is trying to indicate. Based on
this knowledge, you can group similar documents into categories or classes. It’s more
challenging when the number of text documents to be classified increases to several
hundred thousands or millions. This is where techniques like feature extraction and
supervised or unsupervised ML come in handy. Document classification is a generic
problem not limited to text alone but also can be extended for other items like music,
images, video, and other media.  

To formalize our problem more clearly, we will have a given set of classes or
categories and several text documents. Remember that documents are basically sentences
or paragraphs of text. This forms a corpus. Our task would be to determine which class
or classes each document belongs to.  

![](http://www.kdnuggets.com/wp-content/uploads/text-analysis-acme2.jpg)

## What Is Text Classification?

Before we define text classification, we need to understand the scope of textual data and
what we really mean by classification. The textual data involved here can be anything
ranging from a phrase, sentence, or a complete document with paragraphs of text, which
can be obtained from corpora, blogs, or anywhere from the Web. Text classification is
also often called document classification just to cover all forms of textual content under
the word document. The word document could be defined as some form of concrete
representation of thoughts or events that could be in the form of writing, recorded
speech, drawings, or presentations. I use the term document here to represent textual
data such as sentences or paragraphs belonging to the English language.  

Text or document classification is the process of assigning text documents into one
or more classes or categories, assuming that we have a predefined set of classes.
Documents here are textual documents, and each document can contain a sentence or
even a paragraph of words. A text classification system would successfully be able to
classify each document to its correct class(es) based on inherent properties of the
document.  

There are a few types of text classification based on the number of classes to predict
and the nature of predictions. These types of classification are based on the dataset, the
number of classes/categories pertaining to that dataset, and the number of classes that
can be predicted on any data point:  

- **Binary classification** is when the total number of distinct classes
or categories is two in number and any prediction can contain
either one of those classes.  
- **Multi-class classification**, also known as multinomial
classification, refers to a problem where the total number of
classes is more than two, and each prediction gives one class
or category that can belong to any of those classes. This is an
extension of the binary classification problem where the total
number of classes is more than two.  
- **Multi-label classification** refers to problems where each prediction
can yield more than one outcome/predicted class for any data
point.  

In this notebook I would like to highlight a great example. In the summer of 2016, two interesting NLP papers were published by Facebook Research, [Bojanowski et al., 2016](https://arxiv.org/abs/1607.04606) and [Joulin et al., 2016](https://arxiv.org/abs/1607.01759). The first one proposed a new method for word embedding and the second one a method for text classification. The authors also opensourced a C++ library with the implementation of these methods, [fastText](https://github.com/facebookresearch/fastText), that rapidly attracted a lot of interest.  

In this notebook we will discuss how to easily implement several projects using a python wrapper of fastText, [fastText.py](https://github.com/salestock/fastText.py).

In [2]:
%pylab inline
%xmode plain

import os,sys  
import pandas as pd
import numpy as np
import fasttext
from pandas import DataFrame, Series

from urllib.request import urlopen 
from html import unescape

print(sys.version)

Populating the interactive namespace from numpy and matplotlib
Exception reporting mode: Plain
3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


## Text classification
The first task will be to perform text classification dataset DBPedia, which can be accessed [here](https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M). The dataset consists of text descriptions of 14 different classes. The training set contains 560,000 reviews and the test contains 70,000. 

In [2]:
#set dataset path
#dbpedia_csv.tar.gz needs to be downloaded
data_path = ''

train_file = data_path + 'dbpedia_train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

test_file = data_path + 'dbpedia_test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

#Mapping from class number to class name
class_dict={
1:'Company',
2:'EducationalInstitution',
3:'Artist',
4:'Athlete',
5:'OfficeHolder',
6:'MeanOfTransportation',
7:'Building',
8:'NaturalPlace',
9:'Village',
10:'Animal',
11:'Plant',
12:'Album',
13:'Film',
14:'WrittenWork'
}
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [3]:
desc = df.groupby('class')
desc.describe().transpose()

class,1,1,1,1,2,2,2,2,3,3,...,12,12,13,13,13,13,14,14,14,14
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,...,top,freq,count,unique,top,freq,count,unique,top,freq
class_name,40000,1,Company,40000,40000,1,EducationalInstitution,40000,40000,1,...,Album,40000,40000,1,Film,40000,40000,1,WrittenWork,40000
description,40000,39996,MegaPath Corporation—headquartered in Pleasan...,2,40000,39992,Akuressa Training Center of National Youth Se...,2,40000,40000,...,Before Smile Empty Soul became Smile Empty So...,2,40000,40000,Koryo Celadon is a 1979 American short docume...,1,40000,39984,Tom Clancy's Net Force Explorers or Net Force...,15
name,40000,40000,Tsokkos,1,40000,40000,Christ's School,1,40000,40000,...,Indispensable: The Best of Michael Franks,1,40000,40000,Netaji Subhas Chandra Bose: The Forgotten Hero,1,40000,40000,A Star Called Henry,1


The next step is to treat the data. We have to create an intermediate file. This intermediate file doesn't have commas, non-ascii characters and everything is lowercase. The changes are based on [this script](https://github.com/facebookresearch/fastText/blob/a88344f6de234bdefd003e9e55512eceedde3ec0/classification-example.sh#L17).

In [4]:
def clean_dataset(dataframe, shuffle=False, encode_ascii=False, clean_strings = False, label_prefix='__label__'):
    # Transform train file
    df = dataframe[['name','description']].apply(lambda x: x.str.replace(',',' '))
    df['class'] = label_prefix + dataframe['class'].astype(str) + ' '
    if clean_strings:
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('"',''))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('\'',' \' '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('.',' . '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('(',' ( '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(')',' ) '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('!',' ! '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace('?',' ? '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(':',' '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.replace(';',' '))
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.lower())
    if shuffle:
        df.sample(frac=1).reset_index(drop=True)
    if encode_ascii :
        df[['name','description']] = df[['name','description']].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8'))
    df['name'] = ' ' + df['name'] + ' '
    df['description'] = ' ' + df['description'] + ' '
    return df

In [5]:
%%time
# Transform datasets
df_train_clean = clean_dataset(df, True, False)
df_test_clean = clean_dataset(df_test, False, False)

# Write files to disk
train_file_clean = data_path + 'dbpedia.train'
df_train_clean.to_csv(train_file_clean, header=None, index=False, columns=['class','name','description'] )

test_file_clean = data_path + 'dbpedia.test'
df_test_clean.to_csv(test_file_clean, header=None, index=False, columns=['class','name','description'] )

CPU times: user 9.75 s, sys: 744 ms, total: 10.5 s
Wall time: 10.7 s


Once the dataset is cleaned, the next step is to train the classifier. 

In [6]:
%%time
# Train a classifier
output_file = data_path + 'dp_model'
classifier = fasttext.supervised(train_file_clean, output_file, label_prefix='__label__')

CPU times: user 1min 20s, sys: 1.3 s, total: 1min 21s
Wall time: 11.8 s


Once the model is trained, we can test its accuracy. We can obtain the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of the model. High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

In [7]:
%%time
# Evaluate classifier
result = classifier.test(test_file_clean)
print('P@1:', result.precision)
print('R@1:', result.recall)
print ('Number of examples:', result.nexamples)

P@1: 0.9797857142857143
R@1: 0.9797857142857143
Number of examples: 70000
CPU times: user 564 ms, sys: 0 ns, total: 564 ms
Wall time: 562 ms


The next step is to check how the model works with real sentences.

In [8]:
sentence1 = ['Picasso was a famous painter born in Malaga, Spain. He revolutionized the art in the 20th century.']
labels1 = classifier.predict(sentence1)
class1 = int(labels1[0][0])
print("Sentence: ", sentence1[0])
print("Label: %d; label name: %s" %(class1, class_dict[class1]))

sentence2 = ['One of my favourite tennis players in the world is Rafa Nadal.']
labels2 = classifier.predict_proba(sentence2)
class2, prob2 = labels2[0][0] 
print("Sentence: ", sentence2[0])
print("Label: %s; label name: %s; certainty: %f" %(class2, class_dict[int(class2)], prob2))

#a dialouge from pulp fiction :-)
sentence3 = ['Say what one more time, I dare you, I double-dare you motherfucker!']
number_responses = 3
labels3 = classifier.predict_proba(sentence3, k=number_responses)
print("Sentence: ", sentence3[0])
for l in range(number_responses):
    class3, prob3 = labels3[0][l]
    print("Label: %s; label name: %s; certainty: %f" %(class3, class_dict[int(class3)], prob3))


Sentence:  Picasso was a famous painter born in Malaga, Spain. He revolutionized the art in the 20th century.
Label: 3; label name: Artist
Sentence:  One of my favourite tennis players in the world is Rafa Nadal.
Label: 4; label name: Athlete; certainty: 0.904297
Sentence:  Say what one more time, I dare you, I double-dare you motherfucker!
Label: 12; label name: Album; certainty: 0.287109
Label: 14; label name: WrittenWork; certainty: 0.246094
Label: 1; label name: Company; certainty: 0.240234


The model predicts the first sentence as `Artist`, which is correct. The second sentence is also predicted correctly. This time we used the function `predict_proba` that returns the certainty of the prediction as a probability. Finally, sentence 3 was not correctly classified. The correct label would be `Film`, since the sentence is from a famous scene of a very good film.

## Closing words: 

In this notebook we have shown how to classify text.  

Text classification is indeed a powerful tool, and we have covered some of the most important aspects related to it in this notebook. We started off our journey with look at the definition and
scope of text classification. Next, we defined automated text classification problem and looked at the various types of text classification and finally implemented a text classifier on a real world dataset.