# Preprocess

Datasets from Kaggle are pre-cleaned, but CSV is not the fasted way to read them into future notebooks.

This notebook reads the text files, converts the data to the smallest applicable data types and writes them
out as parquet, which is both pandas and pyspark friendly. 

In [1]:
import pandas as pd
import numpy as np
import gc

dtypes = {
        'ip'            : 'int32',
        'app'           : 'int16',
        'device'        : 'int16',
        'os'            : 'int16',
        'channel'       : 'int16',
        'is_attributed' : 'int8',
        'click_id'      : 'int32'
}

In [6]:
df = pd.read_csv('../data/raw/train.csv', dtype=dtypes, parse_dates=['click_time', 'attributed_time'])
df.to_parquet('../data/intermed/train.parquet')

class0 = df.loc[df.is_attributed == 0, :].sample(frac=.1)
class0.to_parquet('../data/intermed/train0_10pct.parquet')

class1 = df.loc[df.is_attributed == 1, :]
class1.to_parquet('../data/intermed/train1.parquet')

# clear biggest file from memory before doing smaller files
del df, class0, class1
gc.collect()

In [6]:
df = pd.read_csv('../data/raw/train_sample.csv', dtype=dtypes, parse_dates=['click_time', 'attributed_time'])
df.to_parquet('../data/intermed/train_sample.parquet')

In [6]:
df = pd.read_csv('../data/raw/test_supplement.csv', dtype=dtypes, parse_dates=['click_time'])
df.to_parquet('../data/intermed/test_supplement.parquet')

In [2]:
df = pd.read_csv('../data/raw/test.csv', dtype=dtypes, parse_dates=['click_time'])
df.to_parquet('../data/intermed/test.parquet')