<img src="https://jaipresentation.blob.core.windows.net/comm/jai_avatar.png" width="100" align="right"/>

# JAI - Trust your data

## Fill: leverage JAI to smart-fill your missing data
This is an example of how to use the fill missing values capabilities of JAI.

In this notebook we will use a subset of the [PC Games 2020](https://www.kaggle.com/jesneuman/pc-games) dataset to mask some values about whether or not a game is Indie and fill them again using JAI.

You can install JAI in your environment using `pip install jai-sdk`.

And you can read the docs [here](https://jai-sdk.readthedocs.io/en/stable/)!

If you have any comments or suggestions, feel free to contact us: support@getjai.com

*Drop by drop is the water pot filled. Likewise, the wise man, gathering it little by little, fills himself with good.* - Buddha

In [26]:
# JAI imports
from jai import Jai
from jai.utilities import predict2df

# I/O and data manipulation imports
import pandas as pd
import numpy as np

## Reading data

In [27]:
# it might take a few seconds to download this dataset (10MB) to your computer
DATASET_URL = "https://jaipresentation.blob.core.windows.net/data/games_jai.parquet"
df_games = pd.read_parquet(DATASET_URL).astype({"Indie": "object"})

### Let's check how many NaN are there in each column

In [28]:
df_games.isna().sum()

id             0
Name           0
Genres         0
Indie          0
Platform       0
Players        0
Description    0
dtype: int64

### And let's also check how many unique values are in each column

In [29]:
df_games.nunique()

id             11570
Name           10927
Genres           739
Indie              2
Platform        1503
Players           29
Description    10921
dtype: int64

### And the number of rows as well

In [30]:
df_games.shape[0]

11570

Columns like 'Genres' and 'Players' have too many unique values compared to the total number of rows. So we will use the 'Indie' column instead. 

In the following cells, we are going to randomly select 15% of rows and set their 'Indie' value to NaN. 

After that, we will use JAI's `fill` method to actually fill these values we deliberately masked.

## Create a random mask using 15% of rows

In [31]:
mask = np.unique(np.random.randint(low=0, high=df_games.shape[0], size=int(df_games.shape[0] * 0.15)))

## Create a new dataframe where the indexes will be used to set the 'Indie' column to NaN

In [32]:
column_to_fill = "Indie"
df_masked = df_games.copy()
df_masked.loc[mask, column_to_fill] = np.nan

In [33]:
# make sure we masked some values in the Indie column
df_masked.isna().sum()

id                0
Name              0
Genres            0
Indie          1609
Platform          0
Players           0
Description       0
dtype: int64

## Now we can use JAI to fill these missing values!

In [34]:
j = Jai()

### We call `fill` passing a given `name` for the database, the `data` itself and the `column` we want the NaN values to be filled.

### There is a 'gotcha', though...

As a rule of thumb, we should send data that us humans would normally use to actually fill those values. In this sense, columns `Name`, `Genres` and `Indie` should suffice to learn if a NaN value is an Indie game or not. Other columns like `Players` or `Description` do not provide much relevant information and would probably get in the way of JAI's learning.

In [35]:
# set which columns to use
cols_to_use = ["id", "Name", "Genres", "Indie"]

In [36]:
db_name = "games_fill"
results = j.fill(name=db_name,
                 data=df_masked[cols_to_use],
                 column=column_to_fill,
                 db_type="FastText",
                 hyperparams={"learning_rate": 0.0001})

Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
JAI is working: 100%|██████████| 12/12 [02:19<00:00, 11.66s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.07s/it]
JAI is working: 100%|██████████| 5/5 [00:21<00:00,  4.30s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.26s/it]
JAI is working: 100%|██████████| 12/12 [02:03<00:00, 10.27s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
JAI is working: 100%|██████████| 5/5 [00:32<00:00,  6.46s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.61s/it]
Recognized setup args:
hyperparams: {'learning_rate': 0.0001}
mycelia_bases: [{'id_name': 'id_Name', 'db_parent': 'games_fill_name'}, {'id_name': 'id_Genres', 'db_parent': 'games_fill_genres'}]
label: {'task': 'metric_classification', 'label_name': 'Indie'}
split: {'type': 'stratified', 'split_column': 'Indie', 'test_size': 0.2}
JAI is working: 100%|██████████| 18/18 [01:37<00:00,  5.43s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.63s/it]

### Finally, we process the results...

In [37]:
processed = predict2df(results)
df_result = pd.DataFrame(processed).sort_values('id')
df_result

Predict all ids: 100%|██████████| 1609/1609 [00:00<00:00, 458311.38it/s]


Unnamed: 0,id,predict,probability(%)
0,7,0.0,64.92
1,25,0.0,70.16
2,26,0.0,60.96
3,30,1.0,56.13
4,35,0.0,61.86
...,...,...,...
1604,30095,0.0,52.17
1605,30107,1.0,70.80
1606,30166,1.0,69.05
1607,30195,0.0,53.03


### ... and check the accuracy of the fill

In [38]:
predicted = df_result["predict"]
ground_truth = df_games.loc[mask].drop_duplicates().sort_index()[column_to_fill]
np.equal(predicted.to_numpy(), ground_truth.astype(str).to_numpy()).sum() / predicted.shape[0]

0.8433809819763829

The `fill` method correctly predicted the values on over 80% of the samples! Let's plug these results back into our original dataframe

In [39]:
df_filled = df_masked.copy()
df_filled.loc[mask, "Indie"] = df_result["predict"].tolist()

In [41]:
df_filled.isna().sum()

id             0
Name           0
Genres         0
Indie          0
Platform       0
Players        0
Description    0
dtype: int64