<img src="https://jaipresentation.blob.core.windows.net/comm/jai_avatar.png" width="100" align="right"/>

# JAI - Trust your data

## Sanity: checking if your data is valid
This is an example of how to use the sanity check capabilities of JAI.

In this notebook we will use a subset of the [PC Games 2020](https://www.kaggle.com/jesneuman/pc-games) dataset to check if the `Name`, `Genres` and `Indie` information are valid.

You can install JAI in your environment using `pip install jai-sdk`.

And you can read the docs [here](https://jai-sdk.readthedocs.io/en/stable/)!

If you have any comments or suggestions, feel free to contact us: jedis@jquant.com.br

*The object of life is not to be on the side of the majority, but to escape finding oneself in the ranks of the insane.* - Marcus Aurelius

In [1]:
# JAI imports
from jai import Jai
from jai.processing import process_predict

# I/O imports
import pandas as pd

## Reading data

In [2]:
# it might take a few seconds to download this dataset (10MB) to your computer
DATASET_URL = "https://jaipresentation.blob.core.windows.net/data/games_jai.parquet"
df_games = pd.read_parquet(DATASET_URL)

## Use columns `id`, `Name`, `Genres` and `Indie`

Why? As a rule of thumb, we should send data that us humans would normally use to actually check the validity of each value. In this sense, columns Name, Genres and Indie represent a fair subset that are prone to errors when filling information in.

In [3]:
df_sanity = df_games[["id", "Name", "Genres", "Indie"]]

## Now we can use JAI to check the sanity of this data!

In [4]:
j = Jai("YOUR_AUTH_KEY")

### We call `sanity` passing a given `name` for the database, the `data` itself and the fraction (`frac`) of the data to be appended to the dataframe as invalid values (by shuffling them).

In [9]:
db_name = "games_sanity2"
frac = 0.25

In [10]:
results = j.sanity(name=db_name,
                 data=df_sanity,
                 frac=frac,
                 columns_ref=["Indie"])

Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.44s/it]
Recognized setup args:
hyperparams: {'nt': 1000.0}
JAI is working: 100%|██████████| 11/11 [05:50<00:00, 31.87s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.37s/it]
Recognized setup args:
hyperparams: {'nt': 1000.0}
JAI is working: 100%|██████████| 11/11 [00:21<00:00,  1.95s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.74s/it]
Recognized setup args:
mycelia_bases: [{'id_name': 'id_Name', 'db_parent': 'games_sanity2_name'}, {'id_name': 'id_Name', 'db_parent': 'games_sanity2_name'}, {'id_name': 'id_Genres', 'db_parent': 'games_sanity2_genres'}, {'id_name': 'id_Genres', 'db_parent': 'games_sanity2_genres'}]
label: {'task': 'metric_classification', 'label_name': 'is_valid'}
split: {'type': 'stratified', 'split_column': 'is_valid', 'test_size': 0.2}
JAI is working: 100%|██████████| 18/18 [01:36<00:00,  5.38s/it]
Predict: 100%|██████████| 1/1 [00:06<00:00,  6.99s/it]


### Finally, we process the results...

In [11]:
processed = process_predict(results)
df_result = pd.DataFrame(processed).sort_values('id')

Predict all ids: 100%|██████████| 11570/11570 [00:00<00:00, 282289.45it/s]


In [12]:
df_result

Unnamed: 0,id,predict,probability(%)
0,1,Valid,80.68
1,2,Invalid,56.17
2,4,Invalid,90.73
3,5,Valid,85.29
4,6,Valid,91.78
...,...,...,...
11565,30209,Valid,89.10
11566,30225,Valid,87.92
11567,30230,Valid,90.72
11568,30239,Valid,87.13


### ...and check how many entries were valid

In [13]:
df_result["predict"].value_counts()

Valid      11066
Invalid      504
Name: predict, dtype: int64

This output indicates that there are a little over 3% of samples that have some kind of inconsistency. We can check which ones were flaged as Invalid:

In [15]:
df_games.loc[df_result["predict"] == "Invalid"]

Unnamed: 0,id,Name,Genres,Indie,Platform,Players,Description
1,2,Destiny 2,"Action, Adventure, Free to Play",0.0,"PlayStation 5, Web, Xbox Series X, PC, Xbox On...","multiplayer, coop, online coop",Destiny 2 is an online multiplayer first-perso...
2,4,The Elder Scrolls Online,"Massively Multiplayer, RPG",0.0,PC,"singleplayer, multiplayer, coop",Includes The Elder Scrolls Online base game an...
15,17,Black Desert Online,"Action, Adventure, Massively Multiplayer, RPG,...",0.0,"PC, PlayStation 4, Xbox One","multiplayer, pvp, coop","Players will enjoy jaw-dropping graphics, intu..."
18,21,Deep Rock Galactic,"Action, Indie",0.0,PC,"singleplayer, multiplayer, coop, online coop",DEEP ROCK GALACTIC is a first-person co-operat...
41,48,Disco Elysium,RPG,1.0,PC,singleplayer,Disco Elysium is a groundbreaking blend of har...
...,...,...,...,...,...,...,...
11383,29170,Master of Mutations,Strategy,1.0,"macOS, PC",singleplayer,Welcome to the laboratory\r\nThe game mechanic...
11388,29197,Escape: The Room,Adventure,1.0,iOS,singleplayer,"This is an ""escape the room"" type of game in w..."
11472,29618,Forest Plague,"Action, Simulation",1.0,"macOS, PC",singleplayer,This forest has one key difference from the hu...
11500,29786,Rupert and Riley Shipwrecked,"Action, Adventure, Casual",1.0,PC,singleplayer,Rupert and Riley Shipwrecked!\n\r\nIn this sid...


Some entries, such as IDs 21 and 29830, indicate entries that are indeed invalid (given that `Genres` state that the game is Indie, but the `Indie` column say it is not). Other entries seem to be misclassified and should undergo a quick human validation.