### Ecce Homo application to improve Machine Learning workflow
In the following notebook we will explore 3 datasets from Kaggle competitions, to see how well Ecce Homo performs on automatically create data products and contrast them with the information of participants in mentioned competitions

In [1]:
import pandas as pd
import eccehomo

### Exploratory Data Analysis
Ecce Homo performs basic exploratory data analysis automatically and plot the data to a html file. What is basic exploratory data analysis:
* Descriptive statistics. Count, mean, median, first and third quartile, min and max.
* Unique values for categorical data.
* Agregate data on categorical data 
* Boxplots
* Histograms
* Empty values
* Correlations and heatmaps
* pair scatter plots
* bar plots on categorical data

The decision of printing exploratory data analysis results in other place is to keep clean your notebook.
For more help use the help() function

In [2]:
df = pd.read_csv('data/titanic.csv')

### Data
We still have to understand what information each feature is providing us. Even though there is a brief explanation of the data in the html created.
* Survival: 0 is No 1 is Yes.
* Pclass: Is the ticket class, 1st, 2nd, 3rd.
* Sex: Gender of passenger
* Age: Age in years.
* Siblings/Spouses Aboard: Number of sibling or spouses abroad in the titanic.
* Parents/Children Aboard: Parents (or children) on the ship
* Fare: Passenger Fare (price of ticket)

In [3]:
df.head(5)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [3]:
eda = eccehomo.EDA(df, 'Survived', output_path = "data")
eda.describe(to_print = True)

         Survived      Pclass         Age  Siblings/Spouses Aboard  \
count  887.000000  887.000000  887.000000               887.000000   
mean     0.385569    2.305524   29.471443                 0.525366   
std      0.487004    0.836662   14.121908                 1.104669   
min      0.000000    1.000000    0.420000                 0.000000   
25%      0.000000    2.000000   20.250000                 0.000000   
50%      0.000000    3.000000   28.000000                 0.000000   
75%      1.000000    3.000000   38.000000                 1.000000   
max      1.000000    3.000000   80.000000                 8.000000   

       Parents/Children Aboard       Fare  
count               887.000000  887.00000  
mean                  0.383315   32.30542  
std                   0.807466   49.78204  
min                   0.000000    0.00000  
25%                   0.000000    7.92500  
50%                   0.000000   14.45420  
75%                   0.000000   31.13750  
max              

In [4]:
eda.unique_values(unique_size = 20)
#There are two categorical variables. One is name and not important for our exploration
eda.groupby(aggregators =['Sex'])
eda.boxplot()
eda.histograms()
eda.empty_values(to_print = True)
eda.correlations()
eda.barplot()
#There is a max column parameter to avoid printing all columns and select random features to plot
eda.scatter()
eda.make_html(name = 'summary_titanic_eda')

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64


<Figure size 432x288 with 0 Axes>

<Figure size 1080x1080 with 0 Axes>

# HTML file
This cell brings the images saved from the cell above.<br>
**Please** if open the html file in Jupyter lab, images would not be displayed. Try it on your browser, test on Google Chrome worked.
<img src="boxplot1.png" alt="Not found"><br><img src="boxplot2.png" alt="Not found"><br><img src="boxplot3.png" alt="Not found"><br><img src="boxplot4.png" alt="Not found"><br><img src="boxplot5.png" alt="Not found"><br><img src="histograms1.png" alt="Not found"><br><img src="histograms2.png" alt="Not found"><br><img src="correlation.png" alt="Not found"><br><img src="pairplot.png" alt="Not found"><br>