<h1>Format Data for Machine Learning</h1>

Notebook Goals

* Learn what format scikit-learn expects data to be in
* Convert data into appropriate format (Pandas DataFrames to NumPy arrays) 

## Features Matrix and Target Vector

Before using any machine learning model, you need to have your data able to be input into a machine learning model. In the case of supervised learning, you need a features matrix and a target vector which are defined below. 

<b>features matrix</b>: Two-dimensional grid of data where rows represent samples and columns represent features. 

<b>target vector</b>:Usually one dimensional and in the case of supervised learning, what you want to predict from the data.

The difference for unsupervised learning is that you only need a features matrix.  


![images](images/DataFormattingSupervisedUnsupervised.png)

![images](images/abaloneFirst5Rows.png)

Let's now go over how to make sure your data is in an acceptable format

## Import Libraries

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import load_iris

## Load the Dataset
Abalone is a mollusc with a peculiar ear-shaped shell lined of mother of pearl. Its age can be estimated counting the number of rings in their shell with a microscope.  

![](images/abalone.png)

The code below loads a modified version of the abalone dataset. This dataset can be used for regression if you predict number of rings or classification if you predict Sex (M, F, I). The only difference between this and the normal abalone dataset is that there is a missing value in this modified version. 

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/Abalone/abalone.csv')

In [3]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


## Regression: Arrange Data into Features Matrix and Target Vector 
Target matrix is continuous like Rings. 

In [4]:
feature_names = ['Length',
                 'Diameter',
                 'Height',
                 'Whole weight', 
                 'Shucked weight',
                 'Viscera weight',
                 'Shell weight']

In [5]:
# Multiple column features matrix to convert to NumPy Array
df.loc[:, feature_names]

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500
1,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700
2,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100
3,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550
4,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550
...,...,...,...,...,...,...,...
4172,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490
4173,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605
4174,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080
4175,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960


In [6]:
# Convert to numpy array
X = df.loc[:, feature_names].values

In [7]:
# Make sure NumPy array is two dimensional
X.shape

(4177, 7)

In [8]:
# Pandas series to convert to NumPy Array
df.loc[:, 'Rings']

0       15
1        7
2        9
3       10
4        7
        ..
4172    11
4173    10
4174     9
4175    10
4176    12
Name: Rings, Length: 4177, dtype: int64

In [9]:
y = df.loc[:, 'Rings'].values

In [10]:
y.shape

(4177,)

## Classification: Arrange Data into Features Matrix and Target Vector 
Target matrix can also be categorical. 

In [11]:
feature_names = ['Length',
                 'Diameter',
                 'Height',
                 'Whole weight', 
                 'Shucked weight',
                 'Viscera weight',
                 'Shell weight']

In [12]:
# Multiple column features matrix to convert to NumPy Array
df.loc[:, feature_names]

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500
1,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700
2,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100
3,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550
4,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550
...,...,...,...,...,...,...,...
4172,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490
4173,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605
4174,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080
4175,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960


In [13]:
# Convert to numpy array
X = df.loc[:, feature_names].values

In [14]:
# Make sure NumPy array is two dimensional
X.shape

(4177, 7)

In [15]:
# Pandas series to convert to NumPy Array
df.loc[:, 'Sex']

0       M
1       M
2       F
3       M
4       I
       ..
4172    F
4173    M
4174    M
4175    F
4176    M
Name: Sex, Length: 4177, dtype: object

In [16]:
y = df.loc[:, 'Sex'].values

In [17]:
y.shape

(4177,)

## Common question

### Why X is a capitalized variable
The X (and sometimes Y) variable is a matrix.

In some math notations, it is common practice to write vector variable names as lower case and matrix variable names as upper case. Often these are in bold or have other annotation, but that does not translate well to code. The practice may have transferred from this notation.

You may also notice in code, when the target variable is a single column of values, it is written y.

Of course, this has no special meaning in Python and you are free to ignore the convention. Basically if you call the variable x instead of X, it won't cause errors

However, because it has become a convention, it may be worth maintaining if you share your code.

Answer roughly taken from: https://datascience.stackexchange.com/questions/17598/why-are-variables-of-train-and-test-data-defined-using-the-capital-letter-in-py