# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Scikit-Data-Introduction" data-toc-modified-id="Scikit-Data-Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scikit-Data Introduction</a></div><div class="lev2 toc-item"><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Preparation</a></div><div class="lev3 toc-item"><a href="#Variables-description:" data-toc-modified-id="Variables-description:-111"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Variables description:</a></div><div class="lev3 toc-item"><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-112"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Data cleaning</a></div><div class="lev2 toc-item"><a href="#Widget" data-toc-modified-id="Widget-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Widget</a></div><div class="lev2 toc-item"><a href="#Predictions" data-toc-modified-id="Predictions-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Predictions</a></div><div class="lev3 toc-item"><a href="#Prepare-data" data-toc-modified-id="Prepare-data-131"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Prepare data</a></div><div class="lev3 toc-item"><a href="#Classification" data-toc-modified-id="Classification-132"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Classification</a></div><div class="lev3 toc-item"><a href="#Regression" data-toc-modified-id="Regression-133"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Regression</a></div><div class="lev2 toc-item"><a href="#Conclusion" data-toc-modified-id="Conclusion-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Conclusion</a></div>

# Scikit-Data Introduction

Scikit-Data library offers a set of functionalities to help
the Data Analysts in their work.

Initially is just a small set of simple functionalities like convert a dataframe
in a crostab dataframe using some specifics fields.

Other interesting functionality is offer a jupyter widget to offer interactive 
options to handle the data with graphical and tabular outputs.

To import the Scikit-Data Jupyter Widget just use the following code:

```python
from skdata.widgets import DataAnalysisWidget
```

In [1]:
try:
    from skdata.widgets import DataAnalysisWidget
except:
    # development version
    import sys
    import os

    sys.path.insert(0, os.path.abspath('../'))
    from skdata.widgets import DataAnalysisWidget
    
#from sklearn import datasets
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd

## Data Preparation

The data used in this example was extracted from Kaggle Titanic challenge.

### Variables description:

* survival        Survival            (0 = No; 1 = Yes)
* pclass          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name            Name
* sex             Sex
* age             Age
* sibsp           Number of Siblings/Spouses Aboard
* parch           Number of Parents/Children Aboard
* ticket          Ticket Number
* fare            Passenger Fare
* cabin           Cabin
* embarked        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations."

In [2]:
data = pd.read_csv('../data/train.csv', index_col='PassengerId')

data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


With *DataAnalysisWidget* class, you can read a *pandas.DataFrame* or read a csv
from specific file path:

```python
daw = DataAnalysisWidget(data)
```

or

```python
daw = DataAnalysisWidget.load(file_path)
```

summary method can be used to return some interesting information from the data:

```python
daw.summary()
```

In [3]:
daw = DataAnalysisWidget(data)
daw.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,int64,"[0, 1]",2,891,0
Pclass,int64,"[1, 2, 3]",3,891,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",891,891,0
Sex,object,"['female', 'male']",2,891,0
Age,float64,"[0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, ...",88,714,177
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,891,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,891,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",681,891,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",248,891,0
Cabin,object,"['A10', 'A14', 'A16', 'A19', 'A20', 'A23', 'A2...",147,204,687


### Data cleaning

If you need to convert some feature to categorical type, you can use prepare_data
method with a dictionary with the name of the feature as a key and a dictionary
with with old value and new value, such as:

```python
daw.prepare_data({
    'field_name1': {'old_value1': 'new_value1', 'old_value2': 'new_value2'},
    'field_name2': {'old_value1': 'new_value1', 'old_value2': 'new_value2'}
})
```

In [4]:
survived_dict = {0: 'Died', 1: 'Survived'}
pclass_dict = {1: 'Upper Class', 2: 'Middle Class', 3: 'Lower Class'}
sex_dict = {'male': 'Male', 'female': 'Female'}
embarked_dict = {'C': 'Cherbourg', 'Q': 'Queenstown', 'S': 'Southampton'}

daw.prepare_data({
    'Survived': survived_dict,
    'Pclass': pclass_dict,
    'Sex': sex_dict,
    'Embarked': embarked_dict
})

daw.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,category,"['Died', 'Survived']",2,891,0
Pclass,category,"['Lower Class', 'Middle Class', 'Upper Class']",3,891,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",891,891,0
Sex,category,"['Female', 'Male']",2,891,0
Age,float64,"[0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, ...",88,714,177
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,891,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,891,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",681,891,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",248,891,0
Cabin,object,"['A10', 'A14', 'A16', 'A19', 'A20', 'A23', 'A2...",147,204,687


In [5]:
daw.dropna_columns(threshold=0.10)
daw.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,category,"['Died', 'Survived']",2,891,0
Pclass,category,"['Lower Class', 'Middle Class', 'Upper Class']",3,891,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",891,891,0
Sex,category,"['Female', 'Male']",2,891,0
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,891,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,891,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",681,891,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",248,891,0
Embarked,category,"['Cherbourg', 'Queenstown', 'Southampton']",3,889,2


In [6]:
daw.data.dropna(inplace=True)
daw.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,category,"['Died', 'Survived']",2,889,0
Pclass,category,"['Lower Class', 'Middle Class', 'Upper Class']",3,889,0
Name,object,"['Abbing, Mr. Anthony', 'Abbott, Mr. Rossmore ...",889,889,0
Sex,category,"['Female', 'Male']",2,889,0
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,889,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,889,0
Ticket,object,"['110152', '110413', '110465', '110564', '1108...",680,889,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",247,889,0
Embarked,category,"['Cherbourg', 'Queenstown', 'Southampton']",3,889,0


In [7]:
daw.drop_columns_with_unique_values(threshold=0.3)
daw.summary()

Unnamed: 0,Types,Set Values,Count Set,# Observations,# NaN
Survived,category,"['Died', 'Survived']",2,889,0
Pclass,category,"['Lower Class', 'Middle Class', 'Upper Class']",3,889,0
Sex,category,"['Female', 'Male']",2,889,0
SibSp,int64,"[0, 1, 2, 3, 4, 5, 8]",7,889,0
Parch,int64,"[0, 1, 2, 3, 4, 5, 6]",7,889,0
Fare,float64,"[0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.495...",247,889,0
Embarked,category,"['Cherbourg', 'Queenstown', 'Southampton']",3,889,0


## Widget

You can use the show_chart method to change some parameters of the chart that 
show information of a cross tab of the fields selected:
    
```python
daw.show_chart(
    field_reference='field_of_reference',
    fields_comparison=['field1']
)
```

This method will use the parameters informed and create and show a chart and 
a data table.

In [8]:
%matplotlib notebook

daw.show_chart(
    field_reference='Survived',
    fields_comparison=['Pclass']
)

## Predictions

In [9]:
from sklearn.metrics import f1_score, accuracy_score, precision_score

### Prepare data

In [10]:
train = daw.data.copy()
y_train = train['Survived'].copy()

train.replace('Male', 1, inplace=True)
train.replace('Female', 2, inplace=True)
train.replace('Cherbourg', 1, inplace=True)
train.replace('Queenstown', 2, inplace=True)
train.replace('Southampton', 3, inplace=True)
train.replace('Upper Class', 1, inplace=True)
train.replace('Middle Class', 2, inplace=True)
train.replace('Lower Class', 3, inplace=True)
train.replace('Died', 0, inplace=True)
train.replace('Survived', 1, inplace=True)

test = pd.read_csv('../data/test.csv')
y_test =  pd.read_csv('../data/genderclassmodel.csv')

# test['Survived'] = 0
test['Survived'] = y_test.values.tolist()
test = test[[train.index.name] + train.keys().tolist()]

test.replace('male', 1, inplace=True)
test.replace('female', 2, inplace=True)
test.replace('C', 1, inplace=True)
test.replace('Q', 2, inplace=True)
test.replace('S', 3, inplace=True)

features = ['Pclass', 'Sex']

### Classification

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [12]:
predicted = test[['PassengerId']].copy()

for Model in [
    RandomForestClassifier
]:
    model = Model()
    model.fit(train[features], y_train)
    
    predicted['Survived'] = model.predict(test[features])
    
    print('\n')
    print(Model)
    print('f1 score')
    print(f1_score(y_test, predicted['Survived']))
    print('precision score')
    print(precision_score(y_test, predicted['Survived']))
    print('accuracy score')
    print(accuracy_score(y_test, predicted['Survived']))




<class 'sklearn.ensemble.forest.RandomForestClassifier'>
f1 score


ValueError: Can't handle mix of multiclass-multioutput and binary

### Regression

In [None]:
from sklearn.svm import SVC

## Conclusion

These are an initial functionalities to help handle and observe data phenomenons
in a very quick way.