# Data Engineering in Python with databolt  - Identify and analyze join problems (d6tlib/d6tjoin.utils)

## Introduction

Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions etc. 

** `d6tjoin.utils` module allows you to test for join accuracy and quickly identify and analyze join problems. **

Here are some examples which show you how to:
* do join quality analysis prior to attempting a join
* detect and analyze a string-based identifiers mismatch
* detect and analyze a date mismatch

## Generate sample data

Let's generate some random respresentative data:
* identifier (string)
* date (np.datetime)
* values (flaot)

In [5]:
import pandas as pd
import numpy as np
import uuid
import itertools
import importlib

import d6tjoin.utils
importlib.reload(d6tjoin.utils)

# ******************************************
# generate sample data
# ******************************************
nobs = 10
uuid1 = [str(uuid.uuid4()) for _ in range(nobs)]
dates1 = pd.date_range('1/1/2010','1/1/2011')

df1 = pd.DataFrame(list(itertools.product(uuid1,dates1)),columns=['id','date'])
df1['v']=np.random.sample(df1.shape[0])

In [6]:
df1.groupby(['id']).head(2).head(6)

Unnamed: 0,id,date,v
0,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-01,0.777782
1,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-02,0.893936
366,40c75a77-dc5f-4ae2-99b6-bd34871ca98e,2010-01-01,0.301851
367,40c75a77-dc5f-4ae2-99b6-bd34871ca98e,2010-01-02,0.770794
732,818c280b-0d82-45e5-8286-cf42b884b0a6,2010-01-01,0.796229
733,818c280b-0d82-45e5-8286-cf42b884b0a6,2010-01-02,0.811402


## Use Case: assert 100% join accuracy for data integrity checks 

In data enginerring QA you want to test that data is joined correctly. This is particularly useful for detecting potential data problems in production.

In [7]:
df2 = df1.copy()

j = d6tjoin.utils.PreJoin([df1,df2],['id','date'])
assert j.is_all_matched() # succeeds
assert j.is_all_matched('id') # succeeds
assert j.is_all_matched('date') # succeeds


## Use Case: detect and analyze id mismatch 

When joining data from different sources, eg different vendors, often your ids don't match and then you need to manually analyze the situation. With databolt this becomes much easier.

### 100% id mismatch

Let's look at an example where say vendor 1 uses a different id convention than vendor 2 and none of the ids match.

In [8]:
# create mismatch
df2['id'] = df1['id'].str[1:-1]

j = d6tjoin.utils.PreJoin([df1,df2],['id','date'])

try:
    assert j.is_all_matched() # fails
except:
    print('assert fails!')

assert fails!


The QA check shows there's a problem, lets analyze the issue with `Prejoin.stats_prejoin()`. We can immediately see that none of the ids match.

In [9]:
j.stats_prejoin(print_only=False)

Unnamed: 0,key left,key right,all matched,inner,left,right,outer,unmatched total,unmatched left,unmatched right
0,id,id,False,0,10,10,20,20,10,10
1,date,date,True,366,366,366,366,0,0,0
2,__all__,__all__,False,0,3660,3660,7320,7320,3660,3660


Let's look at some of the mismatched records with `Prejoin.show_unmatched()`. Looks like there might be a length problem.

In [10]:
print(j.show_unmatched('id')['left'])
print(j.show_unmatched('id')['right'])

                                       id       date         v
732  818c280b-0d82-45e5-8286-cf42b884b0a6 2010-01-01  0.796229
733  818c280b-0d82-45e5-8286-cf42b884b0a6 2010-01-02  0.811402
734  818c280b-0d82-45e5-8286-cf42b884b0a6 2010-01-03  0.301586
                                      id       date         v
1098  27bf350-47a9-415f-a53c-bd1ea931bc3 2010-01-01  0.526386
1099  27bf350-47a9-415f-a53c-bd1ea931bc3 2010-01-02  0.830395
1100  27bf350-47a9-415f-a53c-bd1ea931bc3 2010-01-03  0.824349


We can show string length statistics using `d6tjoin.utils.df_str_summary()` which confirms that the id string lenghts are different.

In [11]:
print(d6tjoin.utils.df_str_summary(df1,['id']))
print(d6tjoin.utils.df_str_summary(df2,['id']))


    mean  median   min   max   total
id  36.0    36.0  36.0  36.0  3660.0
    mean  median   min   max   total
id  34.0    34.0  34.0  34.0  3660.0


### Partial id mismatch

Let's look at another example where there is a partial mismatch. In this case let's say vendor 2 only has a certain percentage of ids covered.

In [12]:
# create partial mismatch
uuid_sel = np.array(uuid1)[np.random.choice(nobs, nobs//5, replace=False)].tolist()
df2 = df1[~df1['id'].isin(uuid_sel)]

j = d6tjoin.utils.PreJoin([df1,df2],['id','date'])

try:
    assert j.is_all_matched() # fails
except:
    print('assert fails!')

assert fails!


Again we've quickly identified a problem. This would typically cause you to do manual and tedious manual QA work but with `Prejoin.stats_prejoin()` you can quickly see how many ids were mismatched.

In [13]:
j.stats_prejoin(print_only=False)

Unnamed: 0,key left,key right,all matched,inner,left,right,outer,unmatched total,unmatched left,unmatched right
0,id,id,False,8,10,8,10,2,2,0
1,date,date,True,366,366,366,366,0,0,0
2,__all__,__all__,False,2928,3660,2928,3660,732,732,0


## Use Case: detect and analyze date mismatch 

Dates are another common sources of frustration for data engineers working with time series data. Dates come in a variety of different formats and conventions. Let's use databolt to analyze a date mismatch situation.

In [14]:
dates2 = pd.bdate_range('1/1/2010','1/1/2011') # business instead of calendar dates
df2 = pd.DataFrame(list(itertools.product(uuid1,dates2)),columns=['id','date'])
df2['v']=np.random.sample(df2.shape[0])

To highlight some different functionality for `Prejoin.stats_prejoin()` we use `print_only=False` which returns the dataframe instead of printing results. The QA test for all matches fails.

In [15]:
j = d6tjoin.utils.PreJoin([df1,df2],['id','date'])
dfr = j.stats_prejoin(print_only=False)
try:
    assert dfr['all matched'].all() # fails
except:
    print('assert fails!')

assert fails!


We can look at the dataframe to see 105 dates are not matched.

In [16]:
dfr

Unnamed: 0,key left,key right,all matched,inner,left,right,outer,unmatched total,unmatched left,unmatched right
0,id,id,True,10,10,10,10,0,0,0
1,date,date,False,261,366,261,366,105,105,0
2,__all__,__all__,False,2610,3660,2610,3660,1050,1050,0


We can look at mismatched records using `Prejoin.show_unmatched()`. Here we will return all mismatched records into a dataframe you can analyze.

In [17]:
dft = j.show_unmatched('date',keys_only=False,nrecords=-1,nrows=-1)['left']

In [18]:
dft.head()

Unnamed: 0,id,date,v
1,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-02,0.893936
2,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-03,0.338417
8,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-09,0.268176
9,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-10,0.681733
15,a531acd5-cdfd-480b-ba7f-5766f85053ac,2010-01-16,0.479296


Looking at the weekdays of the mismatched entries, you can see they are all weekends. 

In [19]:
dft['date_wkday']=dft['date'].dt.weekday
dft['date_wkday'].unique()

array([5, 6], dtype=int64)

## Conclusion

Joining datasets from different sources can be a big time waster for data engineers! With databolt you can quickly do join QA and analyze problems without doing manual tedious work.