# Exploring Data with Pandas and Seaborn

## The Machine Learning Workflow

The 'bigger picture' of machine learning involves a workflow of different stages, usually starting with problem formulation and data exploration, and finishing with model deployment.

In this notebook, you will begin to explore data using the `pandas` and `seaborn` libraries.


### Exercise 1

Read through the following to get the idea of a machine learning workflow:

https://www.kdnuggets.com/2018/05/general-approaches-machine-learning-process.html

https://www.kdnuggets.com/2018/12/machine-learning-project-checklist.html

Also watch the video below:

<a href="http://www.youtube.com/watch?feature=player_embedded&v=nKW8Ndu7Mjw
" target="_blank"><img src="http://img.youtube.com/vi/nKW8Ndu7Mjw/0.jpg" 
alt="IMAGE ALT TEXT HERE" width="240" height="180" border="10" /></a>

#### Comparison of two particular workflows

Often different workflows are quite similar, e.g.:

![Workflows](Images/workflows.png)

One stage in particular that most workflows share in common is **exploring the data**.

`Pandas` and `seaborn` are useful libraries for exploring data.

### Exercise 2 

Read through Notebook 1.3 on Tabular Data and Pandas.

Watch the following video as well:

<a href="http://www.youtube.com/watch?feature=player_embedded&v=mJ9KajSVG0Q
" target="_blank"><img src="http://img.youtube.com/vi/mJ9KajSVG0Q/0.jpg" 
alt="IMAGE ALT TEXT HERE" width="240" height="180" border="10" /></a>


### Exercise 3

Download `winequality-red.csv` from: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009


Below we import the libraries we will need (pandas, numpy, matplotlib.pyplot, seaborn)

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

### Exercise 4 

Read `winequality-red.csv` into a pandas DataFrame

### Exercise 5

Have a look at the top 10 rows of data

### Exercise 6

Check if there is any missing data (null values)

### Exercise 7 

Print basic statistical info about each column in the data-set

### Exercise 8 

Use seaborn to draw a heatmap showing correlations between all of the columns

Try to make it look as good as you can:

- https://seaborn.pydata.org/examples/many_pairwise_correlations.html

- https://likegeeks.com/seaborn-heatmap-tutorial/

### Exercise 9 

Suppose "Quality"  is the label and the other columns are the features.

What would be the top 2 features you would choose to try to predict Quality from and why?

Alcohol and Volatile acidity - they have the highest correlation with Quality.

### Exercise 10 

If you have time, check out: https://seaborn.pydata.org/examples/index.html

and experiment with other types of plots you find interesting and informative.