<a href="https://colab.research.google.com/github/mjkimcs/portfolio/blob/main/%EB%94%A5%EB%9F%AC%EB%8B%9D/fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#hide
#skip
! [ -e /content ] && pip install -Uqq fastai  # upgrade fastai on colab

# Tabular training

> How to use the tabular application in fastai

To illustrate the tabular application, we will use the example of the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/Adult) where we have to predict if a person is earning more or less than $50k per year using some general data.

In [None]:
from fastai.tabular.all import *

We can download a sample of this dataset with the usual `untar_data` command:

In [None]:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/root/.fastai/data/adult_sample/adult.csv'),Path('/root/.fastai/data/adult_sample/export.pkl'),Path('/root/.fastai/data/adult_sample/models')]

Then we can have a look at how the data is structured:

In [None]:
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


In [None]:
df.shape

(32561, 15)

The last part is the list of pre-processors we apply to our data:

- `Categorify` is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
- `FillMissing` will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
- `Normalize` will normalize the continuous variables (substract the mean and divide by the std)



To further expose what's going on below the surface, let's rewrite this utilizing `fastai`'s `TabularPandas` class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

In [None]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [None]:
# 이후에 dls로 할당될 예정인데, 여기서 이미 train셋만 추출됨
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

Once we build our `TabularPandas` object, our data is completely preprocessed as seen below:

In [None]:
to.xs.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
1725,5,13,5,11,2,5,1,-0.261958,-0.017365,1.531988
15639,5,10,3,13,1,5,1,0.398305,1.242085,1.141328
18519,5,10,5,2,3,5,1,-0.628771,-0.712666,1.141328
16865,5,12,1,5,5,3,1,0.618393,-0.1304,-0.421315
25369,3,12,5,9,2,3,1,-0.408684,1.049701,-0.421315


In [None]:
to.xs.iloc[[0,1]]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
1725,5,13,5,11,2,5,1,-0.261958,-0.017365,1.531988
15639,5,10,3,13,1,5,1,0.398305,1.242085,1.141328


In [None]:
to.ys.head(2)

Unnamed: 0,salary
1725,1
15639,1


Now we can build our `DataLoaders` again:

In [None]:
dls = to.dataloaders(bs=64)

In [None]:
dls.xs.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
27192,1,16,7,1,5,5,1,1.871294,-0.808649,-0.032348
2589,2,12,1,11,2,5,1,0.843103,-0.117093,-0.425737
2872,5,12,4,8,4,2,1,-0.772624,0.482117,-0.425737
23082,8,8,5,2,2,5,1,-1.139835,1.201378,0.754429
31857,5,16,5,8,4,5,1,-0.992951,-0.822037,-0.032348


In [None]:
dls.xs.shape

(26049, 10)

> Later we will explore why using `TabularPandas` to preprocess will be valuable.

The `show_batch` method works like for every other application:

In [None]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-voc,Widowed,Adm-clerical,Not-in-family,White,False,60.0,227332.000982,11.0,<50k
1,Private,Some-college,Married-civ-spouse,Adm-clerical,Husband,White,False,31.0,209537.999223,10.0,<50k
2,Private,Some-college,Never-married,Adm-clerical,Own-child,White,False,20.0,114873.998598,10.0,<50k
3,Local-gov,Some-college,Married-civ-spouse,Tech-support,Husband,White,False,36.0,113337.001427,10.0,>=50k
4,Private,HS-grad,Never-married,Machine-op-inspct,Not-in-family,Black,False,36.0,359677.999123,9.0,<50k
5,Private,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,False,28.0,110145.001477,9.0,<50k
6,Private,HS-grad,Divorced,Other-service,Not-in-family,Asian-Pac-Islander,False,59.0,98350.000437,9.0,<50k
7,Local-gov,Some-college,Divorced,Prof-specialty,Own-child,White,False,39.0,98586.996865,10.0,<50k
8,Private,HS-grad,Married-civ-spouse,Sales,Husband,White,False,60.0,308607.995443,9.0,<50k
9,Private,7th-8th,Married-civ-spouse,Transport-moving,Husband,White,False,23.0,256628.002538,4.0,<50k


We can define a model using the `tabular_learner` method. When we define our model, `fastai` will try to infer the loss function based on our `y_names` earlier. 

**Note**: Sometimes with tabular data, your `y`'s may be encoded (such as 0 and 1). In such a case you should explicitly pass `y_block = CategoryBlock` in your constructor so `fastai` won't presume you are doing regression.

In [None]:
learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the `fit_one_cycle` method (the `fine_tune` method won't be useful here since we don't have a pretrained model).

In [None]:
 learn.fit_one_cycle(2)

epoch,train_loss,valid_loss,accuracy,time
0,0.361914,0.351296,0.834152,00:06
1,0.347478,0.345934,0.835842,00:06


In [None]:
row, clas, probs = learn.predict(df.iloc[0])

In [None]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.001747,12.0,>=50k


In [None]:
clas, probs

(tensor(1), tensor([0.4128, 0.5872]))

In [None]:
probs[0].item()

0.412760466337204

In [None]:
df.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,1


To get prediction on a new dataframe, you can use the `test_dl` method of the `DataLoaders`. That dataframe does not need to have the dependent variable in its column.

In [None]:
test_df = df.copy()
test_df.drop(["salary"], axis=1, inplace=True)
test_y = df["salary"]

dl = learn.dls.test_dl(test_df)

In [None]:
a = learn.get_preds(dl=dl)

In [None]:
a[0]

tensor([[0.4128, 0.5872],
        [0.4777, 0.5223],
        [0.9347, 0.0653],
        ...,
        [0.5841, 0.4159],
        [0.6716, 0.3284],
        [0.6910, 0.3090]])

In [None]:
a[0][0][0].item()

0.412760466337204

In [None]:
a[0][:,1].numpy()

array([0.5872395 , 0.52234334, 0.06527726, ..., 0.41587391, 0.32842872,
       0.3090367 ], dtype=float32)

In [None]:
preds = (a[0].numpy()[:, 1] >= 0.5).astype('int')
preds

array([0, 1, 0, ..., 0, 0, 0])

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_pred=preds, y_true=df["salary"]))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90     24720
           1       0.71      0.56      0.63      7841

    accuracy                           0.84     32561
   macro avg       0.79      0.75      0.76     32561
weighted avg       0.83      0.84      0.83     32561



In [None]:
title_mapping = {"<50k":0, ">=50k": 1}
for i in [df]:
    i['salary'] = i['salary'].map(title_mapping)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,1
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,1
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,0
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,1
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,0


In [None]:
test_y

0        >=50k
1        >=50k
2         <50k
3        >=50k
4         <50k
         ...  
32556    >=50k
32557     <50k
32558    >=50k
32559     <50k
32560     <50k
Name: salary, Length: 32561, dtype: object

In [None]:
test_df.shape

(32561, 15)

Then `Learner.get_preds` will give you the predictions:

In [None]:
learn.get_preds(dl=dl)

(tensor([[0.5222, 0.4778],
         [0.4707, 0.5293],
         [0.9554, 0.0446],
         ...,
         [0.6629, 0.3371],
         [0.7109, 0.2891],
         [0.6732, 0.3268]]), None)

In [None]:
row, clas, probs = learn.predict(test_df.iloc[0])
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101319.999972,12.0,<50k


In [None]:
test_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States


> Note: Since machine learning models can't magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

## `fastai` with Other Libraries

As mentioned earlier, `TabularPandas` is a powerful and easy preprocessing tool for tabular data. Integration with libraries such as Random Forests and XGBoost requires only one extra step, that the `.dataloaders` call did for us. Let's look at our `to` again. It's values are stored in a `DataFrame` like object, where we can extract the `cats`, `conts,` `xs` and `ys` if we want to:

In [None]:
to.xs[:3]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
25387,5,16,3,5,1,5,1,0.471582,-1.467756,-0.030907
16872,1,16,5,1,4,5,1,-1.215622,-0.649792,-0.030907
25852,5,16,3,5,1,5,1,1.865358,-0.218915,-0.030907


Now that everything is encoded, you can then send this off to XGBoost or Random Forests by extracting the train and validation sets and their values:

In [None]:
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

In [None]:
to.xs.shape

(32561, 10)

In [None]:
X_train.shape

(26049, 10)

In [None]:
X_test.shape

(6512, 10)

In [None]:
26049+6512

32561

In [None]:
6512/(26049+6512)

0.1999938576825036

And now we can directly send this in!