# Exploratory Data Analysis (EDA) in Python <br />  <hr style="border:4.5px solid #108999"> </hr>

This is my notebook on an example pipeline of EDA in Python.

## How to quickly get a handle on almost any tabular dataset

Becoming inherently familiar with a new dataset can be challenging and time consuming. However, an in-depth and broad exploratory data analysis (EDA) can help a lot to understand your dataset, get a feeling for how things are connected and what needs to be done to properly process your dataset.

This NB will explore multiple useful EDA routines. However, to keep things short and compact it might not always dig deeper or explain all of the implications. But, spending enough time on a proper EDA to fully understand a dataset is a key part of any good data science project.

>> As a rule of thumb, you probably will spend 80% of your time in data preparation and exploration and only 20% in actual machine learning modeli

---
## Overview

This is an exploratory data analysis on the House Prices Kaggle Competition found at 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

* **In order to download the data directly from the 'Kaggle Competition' you need to go to the rules tab and accept the rules for this competition**
 > Navigate to the Kaggle competition rule tab to accept [here](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/rules)

---

1. [Import Python Libraries](#t1.) 
2. [Get the Data](#t2.) 
3. [Structure Investigation](#t3.)
    * 3.1. [Structure of *non-numerical* features](#t3.1.) 
    * 3.2. [Structure of *numerical* features](#t3.2.)
    * 3.3. [Conclusion of structure investigation](#t3.3.)  
4. [Quality Investigation](#t4.)
    * 4.1. [Duplicates](#t4.1.)
    * 4.2. [Missing Values](#t4.2.)
        * 4.2.1 [Per Sample](#t4.2.1.)
        * 4.2.2 [Per Feature](#t4.2.2.)
        * 4.2.3 [Side Note](#t4.2.3.)
    * 4.3. [Unwanted entries and recording errors](#t4.3.)
        * 4.3.1 [Numerical Features](#t4.3.1.)
        * 4.3.2 [Non-Numerical Features](#t4.3.2.)
    * 4.4. [Conclusion of quality investigation](#t4.4.)
5. [Content Investigation](#t5.)
    * 5.1. [Feature Distribution](#t5.1.)
    * 5.2. [Feature Patterns](#t5.2.)
       * 5.2.1 [Continuous Features](#t5.2.1.)
       * 5.2.2 [Discrete and Ordinal Features](#t5.2.2.)
    * 5.3. [Feature Relationships](#t5.3.)
    * 5.4. [Conclusion of content investigation](#t5.4.)
6. [Take Home Message](#t6.)

## Outline
___

<a id="t1."></a>
## 1. Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st
from sklearn import ensemble, tree, linear_model
import missingno as msno

<a id="t2."></a>
## 2. Get the Data

Using Pandas we can load the data from a .CSV file into a data frame.

If you would like to install the data  directly from Kaggle you can use: `pip install Kaggle` or `conda install Kaggle` - then use the `kaggle.api` package to install the data from Kaggle.

Otherwise download the data manually and load in the dataset using: `df = pd.read_csv("data.csv")`

### Downloading the data directly from Kaggle using the API:

Kaggle API requires an API token. Go to the Account Tab ( `https://www.kaggle.com/<username>/account`) and click ‘Create API Token’.  A file named kaggle.json will be downloaded. Move this file in to `~/.kaggle/` folder in Mac and Linux or to `C:\Users\<username>\.kaggle\`  on windows. This is required for authentication and do not skip this step.

In [3]:
# Initializing and Authenticating With API Server

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

* **In order to download the data directly from the 'Kaggle Competition' you need to go to the rules tab and accept the rules for this competition**
 > Navigate to the Kaggle competition rule tab to accept [here](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/rules)

 > Next you will be able to run the code below

In [4]:
api.competition_download_file('house-prices-advanced-regression-techniques','train.csv', path='./', force=False, quiet=True)
api.competition_download_file('house-prices-advanced-regression-techniques','test.csv', path='./', force=False, quiet=True)