### Apply Chollet criteria

When you're working on a task where data is Text **one question** that could arise is:

***Is it worth using a model based on Tranformers or is it possible to use a simpler model, for example based on TF-IDF?***

Well, it depends on the data you have for your task. In general, **Deep Learning** enables to get an higher accuracy if you have **many samples**, and more is better.

In **Deep Learning in Python Book v2** a simple **criteria** is suggested: you need to consider
* the number of samples you have
* the average length (in words) of samples

Then, you can compute the Ratio

Ratio = N_SAMPLES/AVG_LENGTH

If this Ratio is < 1500, it is suggested to go with a simpler model

But thar criteria was established in 2017 and in Deep Learning field things are changing very rapidly. 

So I decided to test the criteria, using **IMdb dataset** for a **Sentiment Analysis** task. As you can see from this NB, in the case of IMdb dataset

**Ratio = 173**

In addition to the book: **Deep Learning in Python v2**, see also (section 2.5):

https://developers.google.com/machine-learning/guides/text-classification

In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

%matplotlib inline

### loads the entire dataset

In [2]:
# reads all the files to build the entire dataset as DataFrame
# code inspired by: https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch08/ch08.ipynb

# it takes some time... good to have a progress bar
# added using tqdm

basepath = 'aclImdb'
NUM_FILES = 50000

labels = {'pos': 1, 'neg': 0}

df = pd.DataFrame()

with tqdm(total=NUM_FILES) as pbar:
    for s in ('test', 'train'):
        for l in ('pos', 'neg'):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), 
                          'r', encoding='utf-8') as infile:
                    txt = infile.read()
                df = df.append([[txt, labels[l]]], 
                               ignore_index=True)
                
                pbar.update(1)

# rename columns as expected by transformers
df.columns = ['text', 'target']

100%|██████████| 50000/50000 [01:34<00:00, 530.60it/s]


### compute metrics

In [3]:
# we assume here that only 80% of the dataset will be used for training, the rest for validation
VALID_FRAC = 0.2

NUM_TRAIN_SAMPLES = df.shape[0] * (1. - VALID_FRAC)

# compute avg # of words per sentence
df["wps"] = df["text"].str.split().apply(len)
avg_wps = round(df.describe()['wps']['mean'], 1)

ratio = round(NUM_TRAIN_SAMPLES/avg_wps, 1)

print(f'Number of training samples: {NUM_TRAIN_SAMPLES}')
print(f'Avg number of words per sentence: {avg_wps}')
print()
print(f'Computed ratio is: {ratio}')

Number of training samples: 40000.0
Avg number of words per sentence: 231.2

Computed ratio is: 173.0


In [4]:
# Ratio is < 1500, so based on the expressed criteria we should go with a TF-IDF model.
# But, in reality, today with resources available to everyone, using a Trasnformer a better result in terms of accuracy can be obtained.
# Have a look at the other Notebooks