<a href="https://colab.research.google.com/github/harnalashok/deeplearning/blob/main/Tabular_data_with_fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [33]:
# Last amended: 13th Jan, 2022
# github deeplearning repo
# Ref: https://docs.fast.ai/tutorial.tabular.html
%reset -f

# Tabular models
Reference See [here](https://docs.fast.ai/tutorial.tabular.html)<br>
What are DataLoaders? See [here](https://dirk-kalmbach.medium.com/datablock-and-dataloaders-in-fastai-d5aa7ae560e5) and [here](https://muttoni.github.io/blog/machine-learning/fastai/2020/12/26/datablocks-vs-dataloaders.html)

In [2]:
# 0.0
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


## Call libraries

In [3]:
# 1.0 Call libraries:

from fastai.tabular import *

# For FillMissing, Categorify, Normalize, untar_data
from fastai.tabular.all import *

## Get data

Pathlib module contains useful functions to perform file-related tasks. Pathlib provides a more readable and easier way to build up paths by representing filesystem paths as proper objects and enables us to write code that is portable across platforms.

In [4]:
# 1.1 untar_data will download data (if not already downloaded)
#      to /root/.fastai/data/adult_sample

path = untar_data(URLs.ADULT_SAMPLE)
path
print("=====\n")
path.ls()
print("=====\n")
type(path)

Path('/root/.fastai/data/adult_sample')

=====



(#3) [Path('/root/.fastai/data/adult_sample/adult.csv'),Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/export.pkl')]

=====



pathlib.PosixPath

### About pathlib
See [here](https://stackabuse.com/introduction-to-the-python-pathlib-module/)


    Path.cwd(): Return path object representing the current working directory
    Path.home(): Return path object representing the home directory
    Path.stat(): return info about the path
    Path.chmod(): change file mode and permissions
    Path.glob(pattern): Glob the pattern given in the directory that is represented by the path, yielding matching files of any kind
    Path.mkdir(): to create a new directory at the given path
    Path.open(): To open the file created by the path
    Path.rename(): Rename a file or directory to the given target
    Path.rmdir(): Remove the empty directory
    Path.unlink(): Remove the file or symbolic link


In [5]:
# 2.0
from pathlib import *


In [6]:
# 2.1
Path.cwd()
Path.home()

Path('/content')

Path('/root')

In [7]:
# 2.2 How to buid paths
outpath = Path.cwd() / 'output' / 'output.xlsx'
outpath

Path('/content/output/output.xlsx')

In [8]:
# 2.3
path.is_dir()
path.stat()

True

os.stat_result(st_mode=16893, st_ino=4980765, st_dev=45, st_nlink=3, st_uid=1000, st_gid=1000, st_size=4096, st_atime=1673607400, st_mtime=1543965152, st_ctime=1673607400)

In [9]:
# 2.4
path.glob("*.csv")

<generator object Path.glob at 0x7f7ae60a0eb0>

In [10]:
# 2.5
for i in path.glob("*.csv"):
  print(i)

/root/.fastai/data/adult_sample/adult.csv


## Read our data

In [11]:
# 3.0 Read the downloaded dataset 
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


In [12]:
# 3.1
df.shape   # (32561, 15)

(32561, 15)

## Data types & Data processing

In [13]:
# 3.2 Define some constants:

dep_var = 'salary'    # target
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']


Processes to operate on data. Refer below for API:<br>
[FillMissing](https://docs.fast.ai/tabular.core.html#fillmissing) will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)<br>
[Categorify](https://docs.fast.ai/tabular.core.html#categorify): Transform the categorical variables to something similar to pd.Categorical<br>
[Normalize](https://docs.fast.ai/data.transforms.html#normalize) will normalize the continuous variables (subtract the mean and divide by the std)<br>


In [14]:
# 3.3 What all we need to do over this data
#  and in what sequence:

procs = [FillMissing, Categorify, Normalize]

About TabularDataLoaders see [here](https://docs.fast.ai/tabular.data.html#tabulardataloaders.from_csv)<br>
The following code does not give satisfactory results

In [None]:
# Defaults to 80:20
#dls = TabularDataLoaders.from_csv(
#                                  path / 'adult.csv',
#                                  path=path,
#                                  y_names="salary",
#                                  bs = 64,   # Try 2 or 3
#                                  cat_names = cat_names,
#                                  cont_names = cont_names,
#                                  procs = procs)

Instead proceed as follows:<br>
First split the dataset

In [47]:
# 4.0 Get two splits of data
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [48]:
# 4.1 This is what splits object is:
splits

((#26049) [26211,15918,9218,9024,8161,15125,2601,10144,19292,28404...],
 (#6512) [30841,5579,21831,12437,16768,13784,4590,8247,18567,18380...])

Transform pandas to a fastai data structure known as Tabular Pandas

In [49]:
# 5.0
to = TabularPandas(
                   df,
                   procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits
                   )

In [50]:
# 5.1 See now completely preprocessed data:

to.xs.iloc[:2]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
26211,5,10,1,5,5,3,1,0.033753,0.569118,1.148712
15918,7,16,3,6,1,5,1,-0.69967,-0.461361,-0.028424


In [51]:
# 5.2 Build our DataLoaders now:

dls = to.dataloaders(bs=64)

### A bit about `dls` object
Can be skipped

In [None]:
# 5.2.1
x = dls.train_ds
x   #  26049 rows x 16 columns

In [25]:
# 5.2.2
type(x)

fastai.tabular.core.TabularPandas

In [None]:
# 5.2.3 Show validation batch
dls.valid.show_batch()

In [None]:
# 5.2.4 Show a batch (bs = 3)
#  Batch has three observations
#   List contains label encoded cat values
#    And tensor floats are numerical values
#     Displayed batch is picked up randomy
dls.one_batch()

In [None]:
# 5.2.5 Same batch as above but actual values are shown:
##       Displayed batch is picked up randomly
dls.show_batch()

## LEarn now

In [53]:
# 6.0
learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the `fit_one_cycle` method (the `fine_tune` method won’t be useful here since we don’t have a pretrained model).

In [None]:
# 6.1 Less the batch_size (bs), more the time:
learn.fit_one_cycle(n_epoch = 10)

## Make predictions

### Directly from a DataFrame

In [None]:
# 7.0 We can then have a look at some predictions:
learn.show_results()

In [59]:
# 7.1 Make prediction for one row:
row, clas, probs = learn.predict(df.iloc[0])

In [60]:
# 7.2 Show result of one row
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.000913,12.0,<50k


In [61]:
# 7.3 Other information
clas, probs

(tensor(0), tensor([0.5373, 0.4627]))

### From Data Loader object

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column. About `test_dl` object , see [here](https://muellerzr.github.io/fastblog/2020/08/10/testdl.html)

In [62]:
# 8.0
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)

In [63]:
# 8.1 Transform test_df exactly in the manner
#      train was done using test_dl
dl = learn.dls.test_dl(test_df)

In [64]:
# 8.2 Get predictions now:

learn.get_preds(dl=dl)

(tensor([[0.5373, 0.4627],
         [0.4583, 0.5417],
         [0.9702, 0.0298],
         ...,
         [0.6320, 0.3680],
         [0.7073, 0.2927],
         [0.7190, 0.2810]]), None)

### Get prediction from any arbitrary data

In [71]:
# 9.0
test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}

# 9.1
input = pd.DataFrame(test_data)

In [72]:
# 9.2
tdl = learn.dls.test_dl(input)

In [None]:
# 9.3
learn.get_preds(dl=tdl)

In [None]:
###########################