# Data Quality: pandas_dq library

Parece una buena opcion para revisar de manera automatica la calidad de los datos contenido en un dataframe.

### References:

- [[github] pandas_dq homepage.](https://github.com/AutoViML/pandas_dq)

In [1]:
%pip install pandas_dq

Collecting pandas_dq
  Using cached pandas_dq-1.9-py3-none-any.whl (19 kB)
Collecting numpy>=1.21.5
  Using cached numpy-1.21.6-cp37-cp37m-macosx_10_9_x86_64.whl (16.9 MB)
Collecting pandas>=1.3.5
  Using cached pandas-1.3.5-cp37-cp37m-macosx_10_9_x86_64.whl (11.0 MB)
Collecting scikit-learn>=0.24.2
  Using cached scikit_learn-1.0.2-cp37-cp37m-macosx_10_13_x86_64.whl (7.8 MB)
Collecting pytz>=2017.3
  Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.3/502.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib>=0.11
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting scipy>=1.1.0
  Using cached scipy-1.7.3-cp37-cp37m-macosx_10_9_x86_64.whl (33.0 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installing collected packages: pytz, threadpoolctl, numpy, joblib, scipy, pandas, scikit-learn, pandas_dq
Successfully installed joblib-1.2.0 numpy-1

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.datasets import load_iris
import seaborn
from pandas_dq import dq_report

Imported pandas_dq (1.9). Always upgrade to get latest version.



## Weather dataset

### Load data

In [3]:
path = 'https://raw.githubusercontent.com/jmquintana79/utilsDS/master/scripts/datasets/data/dataset.weather.csv.gz'
data = pd.read_csv(path)
data['datetime'] = pd.to_datetime(data['datetime'])
data['dtnow'] = [datetime(2022,1,1,12,0,0) for i in range(len(data))]
data['dtrandom'] = pd.to_datetime(np.sort(np.random.choice(pd.date_range('2015-01-01', '2018-01-01', freq='H'), len(data), replace=False)))
data.shape, data.columns

((17544, 16),
 Index(['datetime', 'RH (%)', 'WD', 'WS (m/s)', 'cloud_coverage',
        'dew_point (degC)', 'irradiation (MJ/m2)', 'local_press (hPa)',
        'precipitation (mm)', 'sea-level pressure (hPa)',
        'sunlight_duration (h)', 'temperature (degC)', 'vapor_press (hPa)',
        'visibility (km)', 'dtnow', 'dtrandom'],
       dtype='object'))

### Data Quality Analysis

In [5]:
df_report = dq_report(data, csv_engine="pandas", verbose=1)

Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
datetime,datetime64[ns],0.0,100.0,2016-01-01 00:00:00,2017-12-31 23:00:00,Possible ID colum: drop before modeling process.
RH (%),float64,0.079799,,13.000000,100.000000,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
WD,object,0.062699,0.0,,,"11 missing values. Impute them with mean, median, mode, or a constant value such as 123., 8 rare categories: ['ENE', 'WNW', 'E', 'ESE', 'SW', 'W', 'WSW', 'C']. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: object, float,"
WS (m/s),float64,0.062699,,0.000000,12.200000,"11 missing values. Impute them with mean, median, mode, or a constant value such as 123., has 412 outliers greater than upper bound (6.300000000000001) or lower than lower bound(-0.9000000000000001). Cap them or remove them."
cloud_coverage,object,70.833333,0.0,,,"12427 missing values. Impute them with mean, median, mode, or a constant value such as 123., 9 rare categories: ['9', '1', '8', '2', '3', '7', '6', '4', '5']. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: float, object,"
dew_point (degC),float64,0.079799,,-20.100000,27.300000,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
irradiation (MJ/m2),float64,40.925673,,0.000000,3.710000,"7180 missing values. Impute them with mean, median, mode, or a constant value such as 123."
local_press (hPa),float64,0.0,,965.500000,1031.300000,has 128 outliers greater than upper bound (1031.8999999999999) or lower than lower bound(990.3). Cap them or remove them.
precipitation (mm),float64,80.523256,,0.000000,40.000000,"14127 missing values. Impute them with mean, median, mode, or a constant value such as 123., has 345 outliers greater than upper bound (2.5) or lower than lower bound(-1.5). Cap them or remove them."
sea-level pressure (hPa),float64,0.0,,968.200000,1034.400000,"has 126 outliers greater than upper bound (1034.95) or lower than lower bound(992.95). Cap them or remove them., has a high correlation with ['local_press (hPa)']. Consider dropping one of them."


## IRIS DATASET

### Load data

In [13]:
# load dataset
dataset = load_iris()
dataset.keys()
# dataset to df
data = pd.DataFrame(dataset.data, columns = dataset.feature_names)
data['class'] = dataset.target
data.shape, data.columns

((150, 5),
 Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
        'petal width (cm)', 'class'],
       dtype='object'))

### Data Quality Analysis

In [12]:
df_report = dq_report(data, target = "class", csv_engine="pandas", verbose=1)

Alert: Dropping 1 duplicate rows can sometimes cause column data types to change to object. Double-check!


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
sepal length (cm),float64,0.0,,4.3,7.9,No issue
sepal width (cm),float64,0.0,,2.0,4.4,has 4 outliers greater than upper bound (4.05) or lower than lower bound(2.05). Cap them or remove them.
petal length (cm),float64,0.0,,1.0,6.9,"has a high correlation with ['sepal length (cm)']. Consider dropping one of them., petal length (cm) has a correlation >= 0.8 with class. Possible data leakage. Double check this variable."
petal width (cm),float64,0.0,,0.1,2.5,"has a high correlation with ['sepal length (cm)', 'petal length (cm)']. Consider dropping one of them., petal width (cm) has a correlation >= 0.8 with class. Possible data leakage. Double check this variable."
class,int64,0.0,2.0,0.0,2.0,"has a high correlation with ['petal length (cm)', 'petal width (cm)']. Consider dropping one of them."


## Titanic Dataset

### Load Data

In [11]:
data = seaborn.load_dataset('titanic')
data["class"] = data["class"].astype(str)
data["deck"] = data["deck"].astype(str)
data.shape, data.columns

((891, 15),
 Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
        'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
        'alive', 'alone'],
       dtype='object'))

### Data Quality Analysis

In [14]:
df_report = dq_report(data, target = "survived", csv_engine="pandas", verbose=1)

Alert: Dropping 107 duplicate rows can sometimes cause column data types to change to object. Double-check!


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
survived,int64,0.0,0.0,0,1,No issue
pclass,int64,0.0,0.0,1,3,No issue
sex,object,0.0,0.0,female,male,No issue
age,float64,13.520408,,0.420000,80.000000,"106 missing values. Impute them with mean, median, mode, or a constant value such as 123., has 7 outliers greater than upper bound (67.5) or lower than lower bound(-8.5). Cap them or remove them."
sibsp,int64,0.0,0.0,0,8,has 39 outliers greater than upper bound (2.5) or lower than lower bound(-1.5). Cap them or remove them.
parch,int64,0.0,0.0,0,6,has 15 outliers greater than upper bound (2.5) or lower than lower bound(-1.5). Cap them or remove them.
fare,float64,0.0,,0.000000,512.329200,has 102 outliers greater than upper bound (73.198375) or lower than lower bound(-31.039025). Cap them or remove them.
embarked,object,0.255102,0.0,,,"2 missing values. Impute them with mean, median, mode, or a constant value such as 123., Mixed dtypes: has 2 different data types: object, float,"
class,object,0.0,0.0,First,Third,No issue
who,object,0.0,0.0,child,woman,No issue
