# EDA - Exploration Data Analysis

- Variables Identification
- Univariate Analysis
- Bi-variate Analysis
- Missing Values Handling
- Outliers
- Variables transformation
- Variable creation
- Re-assessment and iteration

## Load Data

In [6]:
import pandas as pd
import numpy as np

seed = 1234

In [2]:
# Load a dataset
# https://www.kaggle.com/c/titanic/data
df_titanic = pd.read_csv("data/titanic_train.csv")

In [26]:
df_titanic.shape

(891, 12)

There are 891 `observations` / `cases` in the dataset.

In [3]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [8]:
df_titanic.sample(5, random_state=seed)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
523,524,1,1,"Hippach, Mrs. Louis Albert (Ida Sophia Fischer)",female,44.0,0,1,111361,57.9792,B18,C
778,779,0,3,"Kilgannon, Mr. Thomas J",male,,0,0,36865,7.7375,,Q
760,761,0,3,"Garfirth, Mr. John",male,,0,0,358585,14.5,,S
496,497,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C
583,584,0,1,"Ross, Mr. John Hugo",male,36.0,0,0,13049,40.125,A10,C


## Variables Identification

### Variables types
- **input** (predictors, independent features / variables / observations / attributes)
- **output** (target, dependent feature / variable / attribute, ground truth)

### Variables categories
- Continuous / Discrete
- Categorical
    - Nominal
    - Ordinal
    
qualitative and quantitative

### Data types
- Numeric
- Textual

In [27]:
# wypisac z dataset - zob. opis w Kaggle
# wypisac i sklasyfikowac

## Univariate Analysis
Explore variables individually.

### Continuous Variables
- Understand the central tendency and spread (dispersion).
    - Central tendency
        - Mean
        - Median
        - Mode
        - Min
        - Max
    - Measures of dispersion
        - Range
        - Quartiles
        - IQR
        - Variance
        - Standard deviation
        - Skewness and Kurtosis
- Identify missing values and outliers
- Visualization methods
    - Histogram
    - BoxPlot
    
### Categorical Variables
- Understand distribution of each category, get a sense of distribution of records across the categories
    - One-way frequency or relative frequency table (count and percentage of values under a category).
- Visualization methods
    - Bar chart

In [12]:
# One-way frequency table
pd.crosstab(index=df_titanic["Survived"], columns="count")

col_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [11]:
# One-way relative frequency table
pd.crosstab(index=df_titanic["Survived"], columns="frequency", normalize=True)

col_0,frequency
Survived,Unnamed: 1_level_1
0,0.616162
1,0.383838


In [13]:
# Bar chart

## Bi-variate Analysis
Find relationships between any two variables, continuous and categorical.

### Continuous and Continuous
Identify any linear or non-linear relationship (pattern) between two variables.

- Visualization methods
    - Scatter plot

<img src="images/correlations.png" alt="Correlations" style="width: 600px;"/>

Spearman, Pearson correlation
Correlation matrix
Correlation heatmap

Metrics
- Co-variance
- Variance
- Correlation coefficient

### Categorical and Categorical
#### Methods
Two-way `frequency or relative frequency tables` also known as `crosstabs` or `contingency` tables. A frequency table is just a data table that shows the counts of one or more categorical variables. Relative frequency tables show what percent of data points fit in each category.

In the example below ([source](https://www.khanacademy.org/math/statistics-probability/analyzing-categorical-data/two-way-tables-for-categorical-data/a/two-way-tables-review)) there are two variables - gender and preference - this is where the two in two-way frequency table comes from. Each cell tells us the number (or frequency).

<img src="images/two-way-frequency-table.png" alt="Correlations" style="width: 300px;"/>

Two-way relative frequency tables show what percent of data points fit in each category. We can use row relative frequencies or column relative frequencies, it just depends on the context of the problem.

<img src="images/two-way-relative-frequency-table.png" alt="Correlations" style="width: 600px;"/>

Sometimes your percentages won't add up to 100% even though we rounded properly. This is called `round-off error`, and we don't worry about it too much.

Two-way relative frequency tables are useful when there are different sample sizes in a dataset. In this example, more females were surveyed than males, so using percentages makes it easier to compare the preferences of males and females. From the relative frequencies, we can see that a large majority of males preferred dogs (78%) compared to a minority of females (41%).
    
    - Stacked column chart




In [18]:
# Two-way table
# Table of survival vs. sex
survived_sex = pd.crosstab(index=df_titanic["Survived"], columns=df_titanic["Sex"])
survived_sex.index= ["died", "survived"]
survived_sex

Sex,female,male
died,81,468
survived,233,109


In [20]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"])

survived_class.columns = ["class1","class2","class3"]
survived_class.index= ["died","survived"]

survived_class

Unnamed: 0,class1,class2,class3
died,80,97,372
survived,136,87,119


In [22]:
# get the marginal counts (totals for each row and column) by including the argument margins=True
# Table of survival vs passenger class
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"],
                             margins=True)   # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]

survived_class

Unnamed: 0,class1,class2,class3,rowtotal
died,80,97,372,549
survived,136,87,119,342
coltotal,216,184,491,891


In [23]:
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"],
                             margins=True, normalize=True)   # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]

survived_class

Unnamed: 0,class1,class2,class3,rowtotal
died,0.089787,0.108866,0.417508,0.616162
survived,0.152637,0.097643,0.133558,0.383838
coltotal,0.242424,0.20651,0.551066,1.0


`Stacked column chart`

In [24]:
# Stacked column chart

`Chi_Square Test` is used to derive statistical significance of relationship between varables. It also tests whether the evidence in the sample is strong enough to generalize the relationship for a larger population. It returns probability of the computed chi-square distribution with the degree of freedom.
- Probability of 0: both categorical variables are dependent
- Probability of 1: independent
- Probability less than 0.05: indicates that the relationship between the variables is significant at 95% of confidence

In [None]:
# Chi-Square Test

Other statistical measures used to analyze the power of relationship are:
- Cramer's V for Nominal Categorical Variable
- Mantel-Haenszed Chi-Square for ordinal categorical variable

### Categorical and Continuous
To explore relation between a categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, there is no statistical significance.
#### Methods
- Z-test - tests if means of two groups are statistically different from each other 
- T-test - like Z-test but for categories with less than 30 samples each
- ANOVA - assesses if the average of more than two groups is statistically different

## Missing Values Handling
Missing data in the training datase can reduce the power / fit of a model or can lead to a biased model as the data do not present relationships between variables correctly. The most often reason for missing data are related to:
- **Data collection**. Difficult and usually time consuming to correct.
- **Data extraction**. Easy to find and corrected, mechanisms like hashing may help in ensuring that the data are extracted correctly.

<img src="images/missing-values-handling.png" alt="Correlations" style="width: 600px;"/>

[source](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)

The missing values treatment can be as follows:
- **Listwise or pairwise deletion**. It is important to understand that in the vast majority of cases, an important assumption to using either of these techniques is that your data is missing completely at random (MCAR). 

    In the `listwise deletion` we delete observations where any of the variable is missing. It is simple method but reduces the power of model by reducing the sample size. 
    
    `Pairwise deletion` occurs when the statistical procedure uses cases that contain some missing data. The procedure cannot include a particular variable when it has a missing value, but it can still use the case when analyzing other variables with non-missing values. Pairwise deletion allows you to use more of your data. However, each computed statistic may be based on a different subset of cases.

    The choice between pairwise and listwise deletion of records is limited. The choice between these two types of deletion is not relevant when only one variable is being analyzed. In other situations, missing values may be treated as a valid category. 
    
- **Mean / mode / median imputation**. Imputation means filling the missing values with estimated (most frequently used) ones. It consists of replacing the missing data for a given attribute quantitavely (mean, median) or qualitatively (mode). The imputation can take forms of:

    - `Generalized imputation`. We calculate mean, median or mode for all non missing values of a variable and replace all missing values of this variable with the result.
    - `Similar case imputation`. We calculate mean, median or mode for similar cases only (looking at similarity of other variables of other cases without missing data for the variable in question).
    

- **Prediction model**. We create a predictive model (linear / logistic regression, tree, etc.) to estimate values that will substitute the missing data. To do this, we divide our dataset into two parts: one with no missing values for the variable (training dataset with a target variable), another with missing values for the variable (test dataset to pedict the target / missing variable). We populate the missing values with the predicted ones.

- **KNN Imputation**. The missing values of a variable are imputed using the given number of variables that are most similar to the attribute whose values are missing. The similarity is defined as a distance function.

    - Advantages: 
        - KNN can predict both qualitative and quantitative variables
        - Creation of a predictive model for each variable with missing data is not required
        - Variables with multiple missing values can be easily treated
        - Correlation of data is taken into consideration
    - Disadvantages:
        - KNN is time consuming
        - Choice of k-value is very critical 

## Outliers

An `outlier` is an observation that appears far away and diverges from an overall pattern in a sample. Outliers can be of two types: univariate and multivariate. `Univariate outliers` can be found while looking at distribution of a single variable data. `Multivariate outliers` are outstanding observations in an n-dimensional space.

<img src="images/outlier.png" alt="Correlations" style="width: 300px;"/>

<img src="images/n-outlier.png" alt="Correlations" style="width: 600px;"/>

[source](https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)



## Variables transformation

## Variables creation

# References
- xxx