# DBPedia Classification with fastText

### "fastText: Faster, better text classification!". A research from Facebook AI Research (FAIR) lab.

fastText as name suggest is for doing fast text classificaiton. For this they have used character ngrams with many methods to get better results.

The [paper]() give quite detailed view of how things work here.

Let's get our hands on with fastText with text classification dataset of DBPedia. This dataset consists of text descriptions of 14 different classes. The training set contains 560,000 reviews and the test contains 70,000. 

Download this dataset from [here](https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M). 


In [1]:
# Importing Libraries
import os,sys  

# For loading data and doing some exploration
import pandas as pd

# The default import
import numpy as np

In [2]:
# Set path for loading data, saving processed data and saving model
data_path = '~/data/dbpedia_csv/'

In [3]:
# Loading train data
train_file = data_path + 'train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

# Loading test data
test_file = data_path + 'test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

# Data with us
print("Train:{} Test:{}".format(df.shape,df_test.shape))

Train:(560000, 3) Test:(70000, 3)


In [4]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [5]:
df.tail()

Unnamed: 0,class,name,description,class_name
559995,14,Barking in Essex,Barking in Essex is a Black comedy play direc...,WrittenWork
559996,14,Science & Spirit,Science & Spirit is a discontinued American b...,WrittenWork
559997,14,The Blithedale Romance,The Blithedale Romance (1852) is Nathaniel Ha...,WrittenWork
559998,14,Razadarit Ayedawbon,Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေး...,WrittenWork
559999,14,The Vinyl Cafe Notebooks,Vinyl Cafe Notebooks: a collection of essays ...,WrittenWork


In [6]:
# What is the group behaviour
desc = df.groupby('class')
desc.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,class_name,description,name
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,count,40000,40000,40000
1,unique,1,39996,40000
1,top,Company,DTOX is a mobile recovery smartphone app that...,ManhattanGMAT
1,freq,40000,2,1
2,count,40000,40000,40000
2,unique,1,39992,40000
2,top,EducationalInstitution,Allameh Mohaddes Nouri University is one of t...,Laguna BelAir School
2,freq,40000,2,1
3,count,40000,40000,40000
3,unique,1,40000,40000


In [7]:
# Lets do some cleaning
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
        
    # for fastext to understand data better
    df['name'] = ' ' + df['name'] + ' '
    df['description'] = ' ' + df['description'] + ' '
        
    return df



In [8]:
%%time
# Transform datasets
df_train = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, False)

CPU times: user 6.02 s, sys: 372 ms, total: 6.39 s
Wall time: 6.4 s


In [9]:
df_train.head()

Unnamed: 0,name,description,class
0,e . d . abbott ltd,abbott of farnham e d abbott limited was a b...,__class__1
1,schwan-stabilo,schwan-stabilo is a german maker of pens for...,__class__1
2,q-workshop,q-workshop is a polish company located in po...,__class__1
3,marvell software solutions israel,marvell software solutions israel known as r...,__class__1
4,bergan mercy medical center,bergan mercy medical center is a hospital lo...,__class__1


In [10]:
df_train.tail()

Unnamed: 0,name,description,class
559995,barking in essex,barking in essex is a black comedy play dire...,__class__14
559996,science & spirit,science & spirit is a discontinued american ...,__class__14
559997,the blithedale romance,the blithedale romance ( 1852 ) is nathani...,__class__14
559998,razadarit ayedawbon,razadarit ayedawbon ( burmese ရာဇာဓိရာဇ် အ...,__class__14
559999,the vinyl cafe notebooks,vinyl cafe notebooks a collection of essays...,__class__14


In [11]:
df['description'][661]

' İzmir Banliyö Anonym Şirketi or İZBAN A.Ş. is the holding company of İZBAN. It was created in 2006 to operate a commuter railroad around İzmir. İZBAN A.Ş. is owned 50% by the Turkish State Railways and 50% by the İzmir Municipality.'

In [12]:
df_train['description'][661]

'  i̇zmir banliyö anonym şirketi or i̇zban a . ş .  is the holding company of i̇zban .  it was created in 2006 to operate a commuter railroad around i̇zmir .  i̇zban a . ş .  is owned 50% by the turkish state railways and 50% by the i̇zmir municipality .  '

### Now since fastext is basically built on C++ for direct commandline usages, the api exposed need data from the directory itself. Hence we need to save data and hold its path to pass to fasttext model.

In [15]:
# Write files to disk
train_file = data_path + 'dbpedia_train.csv'
df_train.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + 'dbpedia_test.csv'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )

# also small function to see evaluated results.
def print_results(N, p, r):
    print("N\t" + str(N))
    print("Precision {}\t{:.3f}".format(1, p))
    print("Recall    {}\t{:.3f}".format(1, r))

In [2]:
# The library under exploration
import fasttext

from fastText import train_supervised

ImportError: /home/jitins_lab/anaconda2/envs/nlpstack/lib/python3.4/site-packages/fasttext/fasttext.cpython-34m.so: undefined symbol: _ZTVNSt7__cxx1115basic_stringbufIcSt11char_traitsIcESaIcEEE

### Making Basic Model with fasttext

In [None]:
%%time
# Train a classifier
model = train_supervised(
    input=train_file, epoch=25, lr=1.0, wordNgrams=2, verbose=2, minCount=1
)

# Evaluating results
print_results(*model.test(test_file))

# Saving model
model.save_model(data_path +"basic_model")
                 

### Trying to set cutoffs, other settings and retraining model

In [None]:
%%time
# Classifier retraining
model.quantize(input=train_data, qnorm=True, retrain=True, cutoff=100000)

# Evaluating
print_results(*model.test(test_file))

# again saving retrained model
model.save_model(data_path +"basic_model_quantized")