<img src="https://jaipresentation.blob.core.windows.net/comm/jai_avatar.png" width="100" align="right"/>

# JAI - Trust your data

## Sanity: checking if your data is valid
This is an example of how to use the sanity check capabilities of JAI.

In this notebook we will use a subset of the [PC Games 2020](https://www.kaggle.com/jesneuman/pc-games) dataset to check if the `Name`, `Genres` and `Indie` information are valid.

You can install JAI in your environment using `pip install jai-sdk`.

And you can read the docs [here](https://jai-sdk.readthedocs.io/en/stable/)!

If you have any comments or suggestions, feel free to contact us: support@getjai.com

*The object of life is not to be on the side of the majority, but to escape finding oneself in the ranks of the insane.* - Marcus Aurelius

In [30]:
# JAI imports
from jai import Jai
from jai.processing import process_predict

# I/O imports
import pandas as pd

## Reading data

In [2]:
# it might take a few seconds to download this dataset (10MB) to your computer
DATASET_URL = "https://jaipresentation.blob.core.windows.net/data/games_jai.parquet"
df_games = pd.read_parquet(DATASET_URL)

## Use columns `id`, `Name`, `Genres`, `Description` and `Indie`

Why? As a rule of thumb, we should send data that us humans would normally use to actually check the validity of each value. In this sense, columns Name, Genres and Indie represent a fair subset that are prone to errors when filling information in.

In [31]:
#cols = ["id", "Name", "Genres", "Indie", "Description"]
cols = ["id", "Name", "Genres", "Platform", "Description"]
df_sanity = df_games[cols]

## Now we can use JAI to check the sanity of this data!

In [32]:
j = Jai("f0e4c4a6d35d4eb1a871093468fbf679")

### We call `sanity` passing a given `name` for the database, the `data` itself and the fraction (`frac`) of the data to be appended to the dataframe as invalid values (by shuffling them).

In [34]:
db_name = "games_sanity6"
frac = 0.3

In [35]:
db_types = {"Name": "FastText", "Genre": "FastText", "Description": "Text", "Platform": "FastText"}
ref = ["Genres"]

In [36]:
results = j.sanity(name=db_name,
                 data=df_sanity,
                 frac=frac,
                 columns_ref=ref,
                 db_type=db_types)

Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]
JAI is working: 100%|██████████| 12/12 [01:05<00:00,  5.48s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.74s/it]
Recognized setup args:
hyperparams: {'nt': 1000.0}
JAI is working: 100%|██████████| 11/11 [00:22<00:00,  2.01s/it]
Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.90s/it]
JAI is working: 100%|██████████| 12/12 [01:06<00:00,  5.51s/it]
Insert Data: 100%|██████████| 1/1 [00:14<00:00, 14.44s/it]
JAI is working: 100%|██████████| 9/9 [07:33<00:00, 50.41s/it]
Insert Data: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]
Recognized setup args:
mycelia_bases: [{'id_name': 'id_Name', 'db_parent': 'games_sanity6_name'}, {'id_name': 'id_Name', 'db_parent': 'games_sanity6_name'}, {'id_name': 'id_Genres', 'db_parent': 'games_sanity6_genres'}, {'id_name': 'id_Genres', 'db_parent': 'games_sanity6_genres'}, {'id_name': 'id_Platform', 'db_parent': 'games_sanity6_platform'}, {'id_name': 'id_Platform', 'db_parent': 'games_san

### Finally, we process the results...

In [37]:
processed = process_predict(results)
df_result = pd.DataFrame(processed).sort_values('id')

Predict all ids: 100%|██████████| 11570/11570 [00:00<00:00, 700180.32it/s]


In [38]:
df_result

Unnamed: 0,id,predict,probability(%)
0,1,Invalid,50.98
1,2,Invalid,51.31
2,4,Invalid,51.51
3,5,Valid,50.71
4,6,Invalid,51.23
...,...,...,...
11565,30209,Invalid,50.57
11566,30225,Invalid,51.61
11567,30230,Invalid,51.33
11568,30239,Invalid,51.88


### ...and check how many entries were valid

In [39]:
df_result["predict"].value_counts()

Invalid    7880
Valid      3690
Name: predict, dtype: int64

This output indicates that there are a little over 3% of samples that have some kind of inconsistency. We can check which ones were flaged as Invalid:

In [29]:
df_games.loc[df_result["predict"] == "Invalid"]

Unnamed: 0,id,Name,Genres,Indie,Platform,Players,Description
2,4,The Elder Scrolls Online,"Massively Multiplayer, RPG",0.0,PC,"singleplayer, multiplayer, coop",Includes The Elder Scrolls Online base game an...
6,8,Rocket League,"Action, Indie, Racing, Sports",1.0,"Linux, macOS, PC, PlayStation 4, Xbox One, Nin...","singleplayer, multiplayer, coop, online coop",Highly competitive soccer game with rocket-car...
8,10,Rust,"Action, Adventure, Indie, Massively Multiplaye...",1.0,"Xbox One, macOS, PC, Linux","multiplayer, pvp, coop, online coop",To survive is the main task in the game Rust. ...
9,11,Grand Theft Auto V,"Action, Adventure",0.0,"PlayStation 5, PC, PlayStation 4, PlayStation ...","singleplayer, multiplayer, coop","Rockstar Games went bigger, since their previo..."
15,17,Black Desert Online,"Action, Adventure, Massively Multiplayer, RPG,...",0.0,"PC, PlayStation 4, Xbox One","multiplayer, pvp, coop","Players will enjoy jaw-dropping graphics, intu..."
...,...,...,...,...,...,...,...
11563,30204,Night Fly,"Adventure, Indie",1.0,PC,singleplayer,Night Fly - 2D game that will give you a chanc...
11564,30206,Curse in our heads,Indie,1.0,PC,singleplayer,Exciting indie game based on the legendary fil...
11565,30209,Rocks and Rockets,"Adventure, Indie",1.0,"PC, PS Vita, PSP, PlayStation 3, Nintendo DS",singleplayer,Rocks and Rockets - a terrific action game whe...
11566,30225,Rainbow Hunter,"Adventure, Casual, Indie",1.0,PC,singleplayer,"Willy is a little boy who loves songs, drawing..."


Some entries, such as IDs 21 and 29830, indicate entries that are indeed invalid (given that `Genres` state that the game is Indie, but the `Indie` column say it is not). Other entries seem to be misclassified and should undergo a quick human validation.