<h1>Format Data for Machine Learning</h1>

Objectives

* Learn what format scikit-learn expects data to be in
* Convert data into appropriate format (Pandas DataFrames to NumPy arrays) 

## Features Matrix and Target Vector

Before using any machine learning model, you need to have your data able to be input into a machine learning model. In the case of supervised learning, you need a features matrix and a target (matrix or vector) which are defined below. 

<b>features matrix</b>: Two-dimensional grid of data where rows represent samples and columns represent features. 

<b>target</b>:Usually one dimensional and in the case of supervised learning, what you want to predict from the data.

The difference for unsupervised learning is that you only need a features matrix.  


![images](images/DataFormattingSupervisedUnsupervised.png)

![images](images/abaloneFirst5Rows.png)

Let's now go over how to make sure your data is in an acceptable format

## Import Libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import load_iris

## Load the Dataset
Abalone is a mollusc with a peculiar ear-shaped shell lined of mother of pearl. Its age can be estimated counting the number of rings in their shell with a microscope.  

![](images/abalone.png)

The code below loads a modified version of the abalone dataset. This dataset can be used for regression if you predict number of rings or classification if you predict Sex (M, F, I). The only difference between this and the normal abalone dataset is that there is a missing value in this modified version. 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/Abalone/abalone.csv')

In [None]:
df.head()

## Regression: Arrange Data into Features Matrix and Target Vector 
Target matrix is continuous like Rings. 

In [None]:
feature_names = ['Length',
                 'Diameter',
                 'Height',
                 'Whole weight', 
                 'Shucked weight',
                 'Viscera weight',
                 'Shell weight']

In [None]:
# Multiple column features matrix to convert to NumPy Array
df.loc[:, feature_names]

In [None]:
# Convert to numpy array
X = df.loc[:, feature_names].values

In [None]:
# Make sure NumPy array is two dimensional
X.shape

In [None]:
# Pandas series to convert to NumPy Array
df.loc[:, 'Rings']

In [None]:
y = df.loc[:, 'Rings'].values

In [None]:
y.shape

## Classification: Arrange Data into Features Matrix and Target Vector 
Target matrix can also be categorical. 

In [None]:
feature_names = ['Length',
                 'Diameter',
                 'Height',
                 'Whole weight', 
                 'Shucked weight',
                 'Viscera weight',
                 'Shell weight']

In [None]:
# Multiple column features matrix to convert to NumPy Array
df.loc[:, feature_names]

In [None]:
# Convert to numpy array
X = df.loc[:, feature_names].values

In [None]:
# Make sure NumPy array is two dimensional
X.shape

In [None]:
# Pandas series to convert to NumPy Array
df.loc[:, 'Sex']

In [None]:
y = df.loc[:, 'Sex'].values

In [None]:
y.shape

## Common question

### Why X is a capitalized variable
The X (and sometimes Y) variable is a matrix.

In some math notations, it is common practice to write vector variable names as lower case and matrix variable names as upper case. Often these are in bold or have other annotation, but that does not translate well to code. The practice may have transferred from this notation.

You may also notice in code, when the target variable is a single column of values, it is written y.

Of course, this has no special meaning in Python and you are free to ignore the convention. Basically if you call the variable x instead of X, it won't cause errors

However, because it has become a convention, it may be worth maintaining if you share your code.

Answer roughly taken from: https://datascience.stackexchange.com/questions/17598/why-are-variables-of-train-and-test-data-defined-using-the-capital-letter-in-py