# Review: *pandas_dq* library

In [1]:
from pandas_dq import dq_report
from pandas_dq import dc_report
from sklearn.datasets import fetch_california_housing

# dataset

In [2]:
dd = fetch_california_housing(as_frame=True)
data = dd["frame"]
data.shape, data.columns

((20640, 9),
 Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
        'Latitude', 'Longitude', 'MedHouseVal'],
       dtype='object'))

# dq report

In [3]:
dqr = dq_report(data, target="MedHouseVal", html=False, csv_engine="pandas", verbose=1)

Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
MedInc,float64,0.0,,0.4999,15.0001,has 681 outliers greater than upper bound (8.01) or lower than lower bound(-0.71). Cap them or remove them.
HouseAge,float64,0.0,,1.0,52.0,No issue
AveRooms,float64,0.0,,0.846154,141.909091,has 511 outliers greater than upper bound (8.47) or lower than lower bound(2.02). Cap them or remove them.
AveBedrms,float64,0.0,,0.333333,34.066667,"has 1424 outliers greater than upper bound (1.24) or lower than lower bound(0.87). Cap them or remove them., has a high correlation with ['AveRooms']. Consider dropping one of them."
Population,float64,0.0,,3.0,35682.0,has 1196 outliers greater than upper bound (3132.00) or lower than lower bound(-620.00). Cap them or remove them.
AveOccup,float64,0.0,,0.692308,1243.333333,has 711 outliers greater than upper bound (4.56) or lower than lower bound(1.15). Cap them or remove them.
Latitude,float64,0.0,,32.54,41.95,No issue
Longitude,float64,0.0,,-124.35,-114.31,has a high correlation with ['Latitude']. Consider dropping one of them.
MedHouseVal,float64,0.0,,0.14999,5.00001,has 1071 outliers greater than upper bound (4.82) or lower than lower bound(-0.98). Cap them or remove them.


# datasets comparison report

In [4]:
# dataset splitting
i_train = data.head(15000).index.tolist()
i_test = [i for i in data.index if not i in i_train]
train = data[data.index.isin(i_train)]
test = data[data.index.isin(i_test)]
train.shape, test.shape

((15000, 9), (5640, 9))

In [5]:
dc_report = dc_report(train, test, exclude=[], html=False, verbose=1)

Analyzing two dataframes for differences. This will take time, please be patient...


Unnamed: 0,Column Name,Data Type_Train,Missing Values%_Train,Unique Values%_Train,Minimum Value_Train,Maximum Value_Train,DQ Issue_Train,Data Type_Test,Missing Values%_Test,Unique Values%_Test,Minimum Value_Test,Maximum Value_Test,DQ Issue_Test,Distribution Difference
0,MedInc,float64,0.0,,0.4999,15.0001,has 529 outliers greater than upper bound (7.88) or lower than lower bound(-0.73). Cap them or remove them.,float64,0.0,,0.4999,15.0001,has 162 outliers greater than upper bound (8.40) or lower than lower bound(-0.67). Cap them or remove them.,The distributions of MedInc are different with a KS test statistic of 0.080.
1,HouseAge,float64,0.0,,1.0,52.0,No issue,float64,0.0,,1.0,52.0,No issue,The distributions of HouseAge are different with a KS test statistic of 0.106.
2,AveRooms,float64,0.0,,0.846154,141.909091,has 401 outliers greater than upper bound (8.46) or lower than lower bound(1.91). Cap them or remove them.,float64,0.0,,1.130435,37.063492,has 134 outliers greater than upper bound (8.38) or lower than lower bound(2.44). Cap them or remove them.,The distributions of AveRooms are different with a KS test statistic of 0.092.
3,AveBedrms,float64,0.0,,0.375,34.066667,"has 1075 outliers greater than upper bound (1.24) or lower than lower bound(0.87). Cap them or remove them., has a high correlation with ['AveRooms']. Consider dropping one of them.",float64,0.0,,0.333333,7.185185,has 351 outliers greater than upper bound (1.23) or lower than lower bound(0.87). Cap them or remove them.,The distributions of AveBedrms are different with a KS test statistic of 0.034.
4,Population,float64,0.0,,3.0,28566.0,has 895 outliers greater than upper bound (3159.00) or lower than lower bound(-633.00). Cap them or remove them.,float64,0.0,,8.0,35682.0,has 293 outliers greater than upper bound (3076.00) or lower than lower bound(-596.00). Cap them or remove them.,The distributions of Population are different with a KS test statistic of 0.020.
5,AveOccup,float64,0.0,,0.692308,599.714286,has 464 outliers greater than upper bound (4.68) or lower than lower bound(1.09). Cap them or remove them.,float64,0.0,,0.970588,1243.333333,has 195 outliers greater than upper bound (4.27) or lower than lower bound(1.30). Cap them or remove them.,The distributions of AveOccup are different with a KS test statistic of 0.078.
6,Latitude,float64,0.0,,32.54,41.95,has 4 outliers greater than upper bound (41.83) or lower than lower bound(29.11). Cap them or remove them.,float64,0.0,,32.61,41.95,has 32 outliers greater than upper bound (41.30) or lower than lower bound(31.97). Cap them or remove them.,The distributions of Latitude are different with a KS test statistic of 0.491.
7,Longitude,float64,0.0,,-124.35,-114.31,"has 1 outliers greater than upper bound (-114.08) or lower than lower bound(-124.33). Cap them or remove them., has a high correlation with ['Latitude']. Consider dropping one of them.",float64,0.0,,-123.53,-116.2,"has 2 outliers greater than upper bound (-116.28) or lower than lower bound(-125.84). Cap them or remove them., has a high correlation with ['Latitude']. Consider dropping one of them.",The distributions of Longitude are different with a KS test statistic of 0.560.
8,MedHouseVal,float64,0.0,,0.14999,5.00001,has 892 outliers greater than upper bound (4.42) or lower than lower bound(-0.79). Cap them or remove them.,float64,0.0,,0.14999,5.00001,No issue,The distributions of MedHouseVal are different with a KS test statistic of 0.164.
