In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>
**Note**: This notebook uses an older version of fasttext.

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install pandas==1.5.3
!pip install gdown==4.6.6
!pip install fasttext==0.9.2

# ===========================



In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try:
#     import google.colab
#     !curl  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/ch4-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError:
#     !pip install -r "ch4-requirements.txt"

# ===========================

In [3]:
#necessary imports
import os
import pandas as pd
import tarfile
import gdown

In [6]:
def check_if_file_exists(filename: str, locations: list) -> str :
    for location in locations:
        if os.path.exists(os.path.join(location, filename)):
            return location
    return None

def extract_tar_file(file_path: str, extraction_path: str) -> None:
    tar = tarfile.open(file_path, "r:gz")
    tar.extractall(extraction_path)
    tar.close()

try :

    from google.colab import files

    # specifying the data_path
    data_path = "./DATAPATH"

    !mkdir ./DATAPATH

    # downloading the data
    !gdown -O ./DATAPATH/dbpedia_csv.tar.gz "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k"

    # untaring the required file
    !tar -xvf ./DATAPATH/dbpedia_csv.tar.gz --directory ./DATAPATH

    # sneek peek in the folder structure
    !ls -lah ./DATAPATH

except ModuleNotFoundError:
    data_path = './Data/'
    compressed_file_name = 'dbpedia_csv.tar.gz'
    extracted_file_name = 'dbpedia_csv'

    # Check if Extracted File exists
    location_of_extracted_file = check_if_file_exists(extracted_file_name, ['./Data'])

    if location_of_extracted_file:
        # Extracted File exists
        path_to_model = os.path.join(location_of_extracted_file, extracted_file_name)

    else:
        location_of_compressed_file = check_if_file_exists(compressed_file_name, ['./Data'])

        if location_of_compressed_file:
            # Compressed File exists
            extract_tar_file(os.path.join(location_of_compressed_file, compressed_file_name), data_path)
            path_to_model = os.path.join(data_path, extracted_file_name)

        else:
            # Download File
            os.makedirs("./Data", exist_ok=True)
            output_path = './Data/'
            gdown.download("https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k", output=output_path)

            # Extract File
            extract_tar_file(os.path.join(data_path, compressed_file_name), output_path)

            path_to_model = os.path.join(data_path, extracted_file_name)

    print(f"Data Present at location : {path_to_model}")

Downloading...
From: https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k
To: /content/DATAPATH/dbpedia_csv.tar.gz
100% 68.3M/68.3M [00:00<00:00, 164MB/s]
dbpedia_csv/
dbpedia_csv/classes.txt
dbpedia_csv/test.csv
dbpedia_csv/train.csv
dbpedia_csv/readme.txt
total 66M
drwxr-xr-x 3 root root  4.0K Aug 22 18:17 .
drwxr-xr-x 1 root root  4.0K Aug 22 18:17 ..
drwxrwxr-x 2 3666 11555 4.0K Sep  9  2015 dbpedia_csv
-rw-r--r-- 1 root root   66M Aug 22 18:17 dbpedia_csv.tar.gz


In [7]:
# Loading train data
train_file = data_path + '/dbpedia_csv/train.csv'
df = pd.read_csv(train_file, header=None, names=['class','name','description'])
# Loading test data
test_file = data_path + '/dbpedia_csv/test.csv'
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])
# Data we have
print("Train:{} Test:{}".format(df.shape,df_test.shape))

Train:(560000, 3) Test:(70000, 3)


In [8]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [9]:
df["class_name"].value_counts()

Company                   40000
EducationalInstitution    40000
Artist                    40000
Athlete                   40000
OfficeHolder              40000
MeanOfTransportation      40000
Building                  40000
NaturalPlace              40000
Village                   40000
Animal                    40000
Plant                     40000
Album                     40000
Film                      40000
WrittenWork               40000
Name: class_name, dtype: int64

In [10]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()

    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')

    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '

    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))

    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)

    return df

In [11]:
%%time
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: user 3.78 s, sys: 220 ms, total: 4 s
Wall time: 4.15 s


In [12]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + '/dbpedia_train.csv'
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'] )

test_file = data_path + '/dbpedia_test.csv'
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'] )


Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [13]:
%%time
## Using fastText for feature extraction and training
from fasttext import train_supervised
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__.
There are several other parameters which can be seen in:
https://pypi.org/project/fasttext/
"""
model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=75, loss='ova', wordNgrams=2, dim=200, thread=2, verbose=100)

CPU times: user 1h, sys: 19 s, total: 1h 19s
Wall time: 33min 36s


In [14]:
for k in range(1,6):
    results = model.test(test_file,k=k)
    print(f"Test Samples: {results[0]} Precision@{k} : {results[1]*100:2.4f} Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 70000 Precision@1 : 91.5214 Recall@1 : 91.5214
Test Samples: 70000 Precision@2 : 47.6493 Recall@2 : 95.2986
Test Samples: 70000 Precision@3 : 31.9848 Recall@3 : 95.9543
Test Samples: 70000 Precision@4 : 24.2014 Recall@4 : 96.8057
Test Samples: 70000 Precision@5 : 19.4149 Recall@5 : 97.0743


Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 90% Precision and Recall are hard numbers to beat, too!