<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#SkData---Data-Specification" data-toc-modified-id="SkData---Data-Specification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>SkData - Data Specification</a></span><ul class="toc-item"><li><span><a href="#Importing-data" data-toc-modified-id="Importing-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importing data</a></span></li><li><span><a href="#Data-preparing-and-cleaning" data-toc-modified-id="Data-preparing-and-cleaning-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data preparing and cleaning</a></span></li></ul></li></ul></div>

# SkData - Data Specification

SkData provide a data class to structure and organize the preprocessing data.

The data is stored in **hdf5** format. The original data is kept and all steps 
of preprocessing is kept to and applied on demand.

To import data from *csv* source:

```python
from skdata import SkData

sd = SkData('filename.h5')
sd.import_from(source='filename.csv')
```

In [1]:
from skdata.data import (
    SkDataFrame as DataFrame,
    SkDataSeries as Series
)

In [2]:
import pandas as pd

## Importing data

In [3]:
df_train = DataFrame(
    pd.read_csv('../data/train.csv', index_col='PassengerId')
)

In [4]:
df_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df_train.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,int64,"[0, 1]",2,891,0
Pclass,int64,"[1, 2, 3]",3,891,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",891,891,0
Sex,object,"['female', 'male']",2,891,0
Age,float64,"[0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, ...",88,714,177
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,891,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,891,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",681,891,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",248,891,0
Cabin,object,"['A10', 'A14', 'A16', 'A19', 'A20', 'A23', 'A2...",147,204,687


## Data preparing and cleaning

In [6]:
df_train['Sex'].replace({
    'male': 'Male', 'female': 'Female'
}, inplace=True)

df_train['Embarked'].replace({
    'C': 'Cherbourg', 'Q': 'Queenstown', 'S': 'Southampton'
}, inplace=True)

In [7]:
df_train.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,int64,"[0, 1]",2,891,0
Pclass,int64,"[1, 2, 3]",3,891,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",891,891,0
Sex,object,"['Female', 'Male']",2,891,0
Age,float64,"[0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, ...",88,714,177
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,891,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,891,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",681,891,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",248,891,0
Cabin,object,"['A10', 'A14', 'A16', 'A19', 'A20', 'A23', 'A2...",147,204,687


In [8]:
df_train['Sex'].steps

AttributeError: 'Series' object has no attribute 'steps'

In [None]:
survived_dict = {0: 'Died', 1: 'Survived'}
pclass_dict = {1: 'Upper Class', 2: 'Middle Class', 3: 'Lower Class'}

sd['train']['Pclass'].categorize(categories=pclass_dict)
sd['train']['Survived'].categorize(categories=survived_dict)
sd['train']['Sex'].categorize()
sd['train']['Embarked'].categorize()

sd['train'].summary(compute=True)

In [None]:
sd['train'].result.head()

In [None]:
sd['train'].drop_columns(max_na_values=0.1)
sd['train'].summary(compute=True)

In [None]:
sd['train'].dropna()
sd['train'].summary(compute=True)

In [None]:
sd['train'].drop_columns(max_unique_values=0.3)
sd['train'].summary(compute=True)

In [None]:
print('STEPS:')
sd['train'].attr_load('steps')