# Exploratory data analysis

## Sparse categorical feature counts

The script below leverages the PyAthena Python API to query summaries of each of the training datasets, showing record counts for each unique categorical feature value. These will later be used to encode the sparse categorical features as integers. As recommended by [Song et. al.](), the encoding is binned so that each sparse feature value has at least $t$ occurances in the training set with:

- $t=5$ for KDD12
- $t=10$ for Avazu
- $t=10$ for Criteo

In [3]:
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://mlds-final-project-bucket/athena_output/',
               region_name='eu-west-2')

In [5]:
test_query = "SELECT display_url,count(*) as record_count FROM kdd12.training group by display_url order by record_count desc"
df = pd.read_sql(test_query,conn)

  df = pd.read_sql(test_query,conn)


In [6]:
df.head()

Unnamed: 0,display_url,record_count
0,14340390157469404125,14507449
1,12057878999086460853,13073969
2,7903914528320191889,4085858
3,4298118681424644510,1563456
4,14531867648059391627,1322932


In [7]:
len(df[df.record_count>5])/len(df)

0.8997449853461729

In [12]:
mapping = dict(zip(df['display_url'],df.index))

In [13]:
mapping

{'14340390157469404125': 0,
 '12057878999086460853': 1,
 '7903914528320191889': 2,
 '4298118681424644510': 3,
 '14531867648059391627': 4,
 '13756257544627676222': 5,
 '15145480155589094740': 6,
 '5851252814446935968': 7,
 '8134264174510892884': 8,
 '10349468651765658911': 9,
 '14756578758696272233': 10,
 '2002143058964984831': 11,
 '5468727571223080485': 12,
 '2692859619851282505': 13,
 '15859647530187389343': 14,
 '5511132461021800102': 15,
 '17363854844105063905': 16,
 '1729963849377387605': 17,
 '9751072248584610585': 18,
 '8994557070508579623': 19,
 '2412771796110463309': 20,
 '5120683440510467664': 21,
 '15785112999276740221': 22,
 '1414030043541818662': 23,
 '5079901068051390251': 24,
 '13090221289647714786': 25,
 '15989049276530887671': 26,
 '999105727583572092': 27,
 '11315908569713166924': 28,
 '52605941700718194': 29,
 '7703279069701542953': 30,
 '2434756320065278633': 31,
 '2670952723278904693': 32,
 '5888277431783335848': 33,
 '11309025704045849703': 34,
 '55896494124657371