<img src="https://jaipresentation.blob.core.windows.net/comm/jai_avatar.png" width="100" align="right"/>

# JAI - Trust your data

## Match: find samples that refer to the same thing
This is an example of how to use the match capabilities of JAI.

In this notebook we will use a subset of the [PC Games 2020](https://www.kaggle.com/jesneuman/pc-games) dataset. We will split column `Platform` into two different series and match them.

You can install JAI in your environment using `pip install jai-sdk`.

And you can read the docs [here](https://jai-sdk.readthedocs.io/en/stable/)!

If you have any comments or suggestions, feel free to contact us: support@getjai.com

*The goal of life is to make your heartbeat match the beat of the universe, to match your nature with Nature.* - Joseph Campbell

In [4]:
# JAI imports
from jai import Jai

# I/O import
import pandas as pd
import numpy as np


## Reading data

In [5]:
# it might take a few seconds to download this dataset (10MB) to your computer
DATASET_URL = "https://jaipresentation.blob.core.windows.net/data/games_jai.parquet"
df_games = pd.read_parquet(DATASET_URL)

In [6]:
# checking values in the Platform column
df_games["Platform"].value_counts()

PC                                                             4977
macOS, PC                                                       517
Linux, macOS, PC                                                472
Linux, PC, macOS                                                453
PC, macOS                                                       433
                                                               ... 
PC, macOS, Xbox One, PlayStation 4, PlayStation 3, Xbox 360       1
PC, macOS, Nintendo Switch, Xbox One, PlayStation 4               1
Xbox 360, PlayStation 3                                           1
PC, PlayStation 3, Dreamcast, PlayStation, PS Vita, PSP           1
macOS, iOS, PC, Android, Wii, Nintendo DS                         1
Name: Platform, Length: 1503, dtype: int64

We can see column `Platform` has some values that actually refer to the same thing (i.e., "Linux, macOS, PC" and "Linux, PC, macOS"). So we can split this column into two and match these occurrences.

## Let us get unique representations of each value in the `Platform` column. In other words, "Linux, macOS, PC" and "Linux, PC, macOS" will be reduced to "Linux, macOS, PC". 

In [7]:
# Helper function to remove unwanted characters from
# a particular string
def remove_chars(data):
    chars_to_remove = ["{", "}", "'"]
    for char in chars_to_remove:
        data = data.replace(char, "")
    return data

In [8]:
# The ideia is to create sets of each value.
# This way, {"Linux", "PC", "macOS"} and {"macOS", "Linux", "PC"} refer to the same thing
values_set = [set([item2 for item2 in item.split(", ")]) for item in df_games["Platform"].tolist()]
unique_values = []
for item in values_set:
    if item not in unique_values:
        unique_values.append(item)

unique_values_series = [remove_chars(str(item)) for item in unique_values]


## We will create two datasets, `A` and `B`. Dataset `A` will have the unique values (sets) of each occurence, whereas dataset `B` will be the actual `Platform` column and all of its permutations.

In [12]:
col = "Platform"
dfA = pd.Series(unique_values_series)
dfB = df_games[col]

## We can use JAI to find match values in each dataframe!

In [13]:
j = Jai()

### We call `match` passing a given `name` for the database and both Pandas Series as `data1` and `data2`

In [63]:
db_name = "games_match"
results = j.match(name=db_name, data_left=dfA, data_right=dfB, db_type="FastText", top_k=20)

Insert Data: 100%|██████████| 1/1 [00:01<00:00,  1.54s/it]
JAI is working: 100%|██████████| 12/12 [01:04<00:00,  5.36s/it]
Similar: 100%|██████████| 1/1 [00:33<00:00, 33.13s/it]
Fiding threshold: 100%|██████████| 1157/1157 [00:00<00:00, 441325.00it/s]
Process:  39%|███▉      | 4566/11570 [00:00<00:00, 45649.05it/s]
random sample size: 1157
threshold: 5.3241986734065e-06

Process: 100%|██████████| 11570/11570 [00:00<00:00, 45651.63it/s]


## The results provide all the matches found between the IDs of both datasets and it looks like the following

In [64]:
results

Unnamed: 0,id_left,distance,id_right
0,2,0.000000e+00,2
1,2,0.000000e+00,4
2,85,3.259472e-06,5
3,247,3.208111e-06,6
4,623,3.172757e-06,11
...,...,...,...
7810,2,0.000000e+00,11563
7811,2,0.000000e+00,11564
7812,2,0.000000e+00,11566
7813,2,0.000000e+00,11568


## The most interesting part is to look for IDs that did not match 100% (i.e., their distances to one another are greater than 0)

In [65]:
not_the_same = results.loc[results["distance"] != 0.]
not_the_same

Unnamed: 0,id_left,distance,id_right
2,85,3.259472e-06,5
3,247,3.208111e-06,6
4,623,3.172757e-06,11
5,85,3.259472e-06,13
6,85,3.259485e-06,15
...,...,...,...
7620,538,4.486850e-06,11349
7656,137,1.521367e-15,11394
7664,137,2.097389e-15,11402
7736,247,3.208107e-06,11480


## Let's look at the results for IDs 85 and 137

In [69]:
id = 85
print(f"ID {id} on dfA: {dfA.iloc[id]}\n")
print("Similar results found on dfB:")
print(dfB.loc[results.loc[results["id_left"] == id]["id_right"].to_numpy()].unique())

dfA: PlayStation 4, Xbox One

Similar results found on dfB:
['PlayStation 4, PC, Xbox One' 'PC, PlayStation 4, Xbox One'
 'PlayStation 4, Xbox One']


In [72]:
id = 137
print(f"ID {id} on dfA: {dfA.iloc[id]}\n")
print("Similar results found on dfB:")
print(dfB.loc[results.loc[results["id_left"] == id]["id_right"].to_numpy()].unique())

ID 137 on dfA: macOS, Linux, iOS, PC

Similar results found on dfB:
['iOS, macOS, Linux, PC' 'macOS, iOS, Linux, PC' 'iOS, Linux, macOS, PC'
 'Linux, macOS, iOS, PC' 'macOS, Linux, iOS, PC']


## Discussion
We can see that results for ID 85 were not as consistent as those for ID 137. Values for ID 85 in `dfB` were not permutations of `'Playstation 4, 'Xbox One'`, but values with `'PC'` included as well. The differences are subtle, that's why JAI considered them to be similar.

On the other hand, results for ID 137 were spot on. Their permutations were distinctive enough from other entries, so JAI correctly identified them as being the same thing.

This example shows that the `match` application really helps us narrow down similar values on different datasets, and that some results should undergo a quick review to remove inconsistencies that JAI could not solve by itself.