# Data Profiler(s)

You can see here:
 * [pandas.DataFrame.describe¶](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)
 * [numpy.histogram](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html)
 * [tensorflow data_validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic)
 * [data profiling example](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md)

To run this notebook you will need:
```
# install git, maven, jdk and python
sudo apt install git maven openjdk-11-jdk-headless python3

# [optional] create (and activate) python environment (conda, venv or other)
conda create --prefix ./envs && conda activate ./envs

# install python packages
pip install --use-feature=2020-resolver jupyterlab numpy pandas tensorflow tensorflow_data_validation[visualization] apache-beam[interactive] pyspark altair
```

Other profilers worth mentioning are:
 * [kaggle](https://www.kaggle.com) - for example
   [us-election-2020](https://www.kaggle.com/unanimad/us-election-2020?select=president_county_candidate.csv)
   or [us-election-2020-tweets](https://www.kaggle.com/manchunhui/us-election-2020-tweets)
 * [trifacta](https://www.trifacta.com) - for example [this screenshot](https://cpb-us-e1.wpmucdn.com/blogs.ntu.edu.sg/dist/c/1904/files/2016/05/monitor-e1532926034471-2643mzn.png)


In [1]:
import json, pathlib

import numpy as np
import pandas as pd

from IPython.core import display

import tensorflow_data_validation as tfdv
from google.protobuf.json_format import MessageToDict, MessageToJson

import pyspark


In [2]:
data_dir = 'data'
pathlib.Path(data_dir).mkdir(parents=True, exist_ok=True)
tmp_dir = 'tmp'
pathlib.Path(tmp_dir).mkdir(parents=True, exist_ok=True)


In [3]:
def download_and_cache_data_file(url, data_dir=data_dir):
    import pathlib, shutil, urllib
    file_name = pathlib.Path(urllib.parse.urlparse(url).path).name
    local_path = pathlib.Path(data_dir) / file_name
    if not local_path.exists():
        print(f'Downloading {url} to {local_path} ...')
        print(f'BEWARE: Downloading large data files may take a while!')
        with urllib.request.urlopen(url) as u:
            with open(local_path, 'wb') as f:
                shutil.copyfileobj(u, f)
    return local_path


In [4]:
max_rows_per_file = 100000

# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
tripdata_base_url = 'https://s3.amazonaws.com/nyc-tlc/trip+data'

months = ['2016-02', '2020-02', '2020-03']
file_names_by_key = { m: download_and_cache_data_file(f'{tripdata_base_url}/yellow_tripdata_{m}.csv') for m in months }
df_by_key = { k: pd.read_csv(v, error_bad_lines=False, warn_bad_lines=False, nrows=max_rows_per_file) for k, v in file_names_by_key.items() }


# Pandas


In [5]:
df_by_key['2020-02'].describe()


Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,1.68106,1.55352,2.726359,1.04995,160.65441,159.70352,1.27821,11.539294,0.901178,0.493245,1.991273,0.235292,0.297474,17.021821,2.301098
std,0.466068,1.186731,3.327921,0.563115,68.478763,71.901561,0.483354,10.574961,1.163517,0.072124,2.397636,1.377297,0.037958,12.773744,0.704652
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-222.22,-0.5,-0.5,-1.16,-11.75,-0.3,-222.52,-2.5
25%,1.0,1.0,0.99,1.0,107.0,106.0,1.0,6.0,0.0,0.5,0.0,0.0,0.3,10.56,2.5
50%,2.0,1.0,1.65,1.0,158.0,161.0,1.0,8.5,0.5,0.5,1.76,0.0,0.3,13.55,2.5
75%,2.0,2.0,2.98,1.0,234.0,234.0,2.0,13.0,2.5,0.5,2.66,0.0,0.3,18.36,2.5
max,2.0,6.0,55.2,99.0,265.0,265.0,4.0,400.0,3.5,0.5,80.0,66.12,0.3,400.3,2.5


# Numpy


In [6]:
# np.histogram(df_by_key['2020-02']['trip_distance'])
# np.histogram(df_by_key['2020-02']['trip_distance'], bins=np.linspace(0.0, 100.0, num=10))
np.histogram(df_by_key['2020-02']['trip_distance'], bins=np.logspace(0.0, 2.0, num=10))


(array([25272, 22423, 13487,  7277,  3819,  2474,   184,    14,     0]),
 array([  1.        ,   1.66810054,   2.7825594 ,   4.64158883,
          7.74263683,  12.91549665,  21.5443469 ,  35.93813664,
         59.94842503, 100.        ]))

# Tensorflow Data Validation

See [tfdv_basic](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic) for details.


In [7]:
# tfdv_stats_by_key = { k: tfdv.generate_statistics_from_csv(data_location=v) for k, v in file_names_per_path.items() }
tfdv_stats_by_key = { k: tfdv.generate_statistics_from_dataframe(v) for k, v in df_by_key.items() }


In [8]:
# tfdv.visualize_statistics(tfdv_stats_by_key['2020-02'])
# tfdv.visualize_statistics(lhs_statistics=tfdv_stats_by_key['2020-02'], rhs_statistics=tfdv_stats_by_key['2020-03'], lhs_name='2020-02', rhs_name='2020-03')
tfdv.visualize_statistics(lhs_statistics=tfdv_stats_by_key['2016-02'], rhs_statistics=tfdv_stats_by_key['2020-03'], lhs_name='2016-02', rhs_name='2020-03')


In [9]:
url = 'https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv'
url = 'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-03.csv'

stats = tfdv_stats_by_key['2020-02']
# stats = tfdv.generate_statistics_from_csv(data_location=download_and_cache_data_file(url))
# stats = tfdv.generate_statistics_from_dataframe(pd.read_csv(download_and_cache_data_file(url), error_bad_lines=False, warn_bad_lines=False))
# tfdv.visualize_statistics(stat)


In [10]:
display.JSON(MessageToDict(stats))


<IPython.core.display.JSON object>

# Deequ

See [data profiling example](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md)
and [builder methods](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/profiles/ColumnProfilerRunBuilder.scala)
for details.


In [11]:
if not pathlib.Path('src/deequ/target/deequ_2.12-1.1.0-SNAPSHOT.jar').exists():
    !mkdir -p src
    !cd src && git clone https://github.com/awslabs/deequ.git
    !cd src/deequ && git checkout a586d1c
    !cd src/deequ && mvn package -Pscala-2.12,spark-3.0 -DskipTests


In [12]:
jars = ['src/deequ/target/deequ_2.12-1.1.0-SNAPSHOT.jar']
for jar in jars:
    assert pathlib.Path(jar).exists(), f'Missing jar file: {jar}'

spark_builder = pyspark.sql.SparkSession.builder.appName('pysparktest')
spark_builder = spark_builder.config('spark.jars', ','.join(jars))
# spark_builder = spark_builder.config('spark.driver.memory', '4g')
spark = spark_builder.getOrCreate()


In [13]:
ColumnProfilerRunner = spark._jvm.com.amazon.deequ.profiles.ColumnProfilerRunner
ColumnProfiles = spark._jvm.com.amazon.deequ.profiles.ColumnProfiles


In [14]:
# dfs_by_key = { k: spark.read.format('csv').option('header', True).load(str(v)).limit(max_rows_per_file) for k, v in file_names_by_key.items() }
dfs_by_key = { k: spark.createDataFrame(v) for k, v in df_by_key.items() }
deequ_profiles_by_key = { k: ColumnProfilerRunner().onData(v._jdf).cacheInputs(True).withKLLProfiling().run() for k, v in dfs_by_key.items() }


In [15]:
deequ_profile = deequ_profiles_by_key['2020-02']
display.JSON(json.loads(ColumnProfiles.toJson(deequ_profile.profiles().values().toSeq())))


<IPython.core.display.JSON object>