 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8888/notebooks/SkData.ipynb#SkData---Data-Specification" data-toc-modified-id="SkData---Data-Specification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>SkData - Data Specification</a></span></li></ul></div>

# SkData - Data Specification

SkData provide a data class to structure and organize the preprocessing data.

The data is stored in **hdf5** format. The original data is kept and all steps 
of preprocessing is kept to and applied on demand.

To import data from *csv* source:

```python
from skdata import SkData

sd = SkData('filename.h5')
sd.import_from(source='filename.csv')
```

In [1]:
try:
    from skdata import SkData
except:
    # development version
    import sys
    import os

    sys.path.insert(0, os.path.abspath('../'))
    from skdata import SkData
    

In [2]:
sd = SkData('/tmp/titanic.h5')

sd.import_from(
    source='../data/train.csv', 
    index_col='PassengerId',
    target_col='Survived',
    dset_id='train'
)

In [3]:
df = sd.compute('train')

In [4]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [6]:
sd.categorize(dset_id='train', col_name='Pclass')
sd.categorize(dset_id='train', col_name='Survived')
sd.categorize(dset_id='train', col_name='Sex')
sd.categorize(dset_id='train', col_name='Embarked')

df = sd.compute('train')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null category
Pclass      891 non-null category
Name        891 non-null object
Sex         891 non-null category
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null category
dtypes: category(4), float64(2), int64(2), object(3)
memory usage: 59.6+ KB
