# Exploratory Data Analysis: Part 2

## Dataset

- A census dataset is available at [github](https://github.com/WinVector/zmPDSwR/blob/master/Custdata/custdata2.tsv), which will be used for demonstrating the steps of data cleaning, imputation and prepartion. The dataset is taken from the book [Practical Data Science with R](https://www.manning.com/books/practical-data-science-with-r)

- The dataset contains information about customers like 
    - gender
    - if he or she is currently employed or not
    - income
    - marital status
    - house type where he or she lives
    - whether moved in recently 
    - number of vehicles owned
    - age 
    - state of residence
    - Whether he or she has an insurance cover. 

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## Read Dataset

## Metadata Exploration

#### Note:

* It can be observed that some of the columns like *is.employed*, *housing.type*, *recent.move* and *num.vehicles* have missing values.

## Visualize the missing values

* There are libaries like **missingno** are available for visualizing missing values and if any relationship exist between these missing values like if one is missing the other is also missing. 

* *missingno* can be installed through *pip*
    - pip install missingno

In [7]:
#!pip install missingno

In [8]:
import matplotlib.pyplot as plt
import seaborn as sn
import missingno as msno

#### Note:

* *is.employed* is missing from may observations. *housing.type*, *recent.move* and *num.vehicles* are also missing from some observations, but there seems to be some pattern to it. We can create a heatmap of these missing values and confirm this pattern.

### Count or percentage of missing values

## Why values are missing?

- There can be multiple reasons, why data is missing.

    - Data is not available at the time of capturing
    - It could be recording error, left empty intentionally or unintentionally
    - User might have intentionally not filled the data
    
  
  
    
- Missing data mechanisms are typically classified as either 
    - Missing Completely at Random (MCAR)
    - Missing at Random (MAR) 
    - Missing Not at Random (MNAR).

https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html


For example,

- Quite a lot of values are missing in *is.employed* column.   
- This may be either because the person does not have an active/ full time job. So, at the time of capturing, user might not have filled the employment information if he or she is not employed? This is MNAR.


## How to deal with missing values?

There are multiple steps that can be taken

- Obtain the missing data;
- Leave out incomplete cases and use only those for which all variables are available;
- Replace missing data by a conservative estimate, e.g. the sample mean;
- Estimate the missing data from the other data on the person.

### How much missing value is accepted?

- More than 20% is too much missing.
- In some domain, missing values can not be imputed as it may lead high risk or it may inject too much noise.

### Drop samples with missing values

* All columns with null values can be removed from the dataset. Remove all observations where at least one data element is missing. 

### Drop if values are missing from specific columns

* For example, only removing those observations where *is.employed* data is missing.

## Imputation Techniques

- Impute with default values
- Impute with estimated values
    - Numerial Features - Mean or median based imputation
    - Categorical Features - Most Frequent 
- Model based imputations

## Income vs. Is Employed

### Is *is.employed* missing below a certain income level?

## Imputing Number of Vehicles

- *num.vehicles* is a numerical variable and is a discrete variable. An imputation strategy of most frequent can be adopted for the missing values of this column.

### Using Model based Imputation

- The missing data can aso be estimated from other variables, for example income, age, housing.type etc. 

- We will discuss this later.

## When to apply imputation?

- We will discuss this later.

## Income Vs Insurance

## Gender Vs. Insurance

## Income Vs State of Residence

## State of Residence Vs. Housing Type

## Age Vs. Income

## Data Binning

- Sometimes a continuous variable may need to be binned into categories. 
    - For example, age. Sometimes it may make sense to divide the ages into ranges and then create categories like young, adult, midage & old. 
    - For example, income can be categorized into low-income, middle-income and high-income etc.

## Income Vs Age Group