# Standardizing Data
> This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/batman.jpg

> Note: This is a summary of the course's chapter 2 exercises "Preprocessing for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Standardizing Data

### When to standardize


Now that you've learned when it is appropriate to standardize your data, which of these scenarios would you NOT want to standardize?



<pre>
Possible Answers

A column you want to use for modeling has extremely high variance.

You have a dataset with several continuous columns on different scales and you'd like to use a linear model to train the data.

The models you're working with use some sort of distance metric in a linear space, like the Euclidean metric.

<b>Your dataset is comprised of categorical data.</b>

</pre>

**Standardization is a preprocessing task performed on numerical, continuous data.**

### Modeling without normalizing


<div class=""><p>Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the <code>wine</code> dataset. One of the columns, <code>Proline</code>, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.</p>
<p>The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (<code>knn</code>) as well as the <code>X</code> and <code>y</code> sets you need to fit and score on.</p></div>

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

wine_subset = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/wine_subset.csv')

y = wine_subset['Type']
X = wine_subset.drop('Type', 1)

knn = KNeighborsClassifier()

Instructions
<ul>
<li>Split up the <code>X</code> and <code>y</code> sets into training and test sets using <code>train_test_split()</code>.</li>
<li>Use the <code>knn</code> model's <code>fit()</code> method on the <code>X_train</code> data and <code>y_train</code> labels, to fit the model to the data.</li>
<li>Print out the <code>knn</code> model's <code>score()</code> on the <code>X_test</code> data and <code>y_test</code> labels to evaluate the model.</li>
</ul>

In [10]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.5555555555555556


**You can see that the accuracy score is pretty low. Let's explore methods to improve this score.**

##  Log normalization

### Checking the variance


<p>Check the variance of the columns in the <code>wine</code> dataset. Out of the four columns listed in the multiple choice section, which column is a candidate for normalization?</p>

In [45]:
wine = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/wine.csv')

<pre>
Possible Answers

Alcohol

<b>Proline</b>

Proanthocyanins

Ash
</pre>

In [12]:
wine.var()

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

**The Proline column has an extremely high variance.**

### Log normalization in Python


<div class=""><p>Now that we know that the <code>Proline</code> column in our wine dataset has a large amount of variance, let's log normalize it.</p>
<p><code>Numpy</code> has been imported as <code>np</code> in your workspace.</p></div>

Instructions
<ul>
<li>Print out the variance of the <code>Proline</code> column for reference.</li>
<li>Use the <code>np.log()</code> function on the <code>Proline</code> column to create a new, log-normalized column named <code>Proline_log</code>.</li>
<li>Print out the variance of the <code>Proline_log</code> column to see the difference.</li>
</ul>

In [14]:
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542428
0.17231366191842018


**The np.log() function is an easy way to log normalize a column.**

## Scaling data for feature comparison

### Scaling data - investigating columns


<p>We want to use the <code>Ash</code>, <code>Alcalinity of ash</code>, and <code>Magnesium</code> columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using <code>describe()</code> to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?</p>

<pre>
Possible Answers

The max of Ash is 3.23, the max of Alcalinity of ash is 30, and the max of Magnesium is 162.

The means of Ash and Alcalinity of ash are less than 20, while the mean of Magnesium is greater than 90.

The standard deviations of Ash and Alcalinity of ash are equal.

<b>1 and 2 are true.</b>

</pre>

In [15]:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


**Both of these statements are true according to the statistics returned by describe()**

### Scaling data - standardizing columns


<p>Since we know that the <code>Ash</code>, <code>Alcalinity of ash</code>, and <code>Magnesium</code> columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.</p>

Instructions
<ul>
<li>Import <code>StandardScaler</code> from <code>sklearn.preprocessing</code>.</li>
<li>Create the <code>StandardScaler()</code> method and store in a variable named <code>ss</code>.</li>
<li>Create a subset of the <code>wine</code> DataFrame of the <code>Ash</code>, <code>Alcalinity of ash</code>, and <code>Magnesium</code> columns, store in a variable named <code>wine_subset</code>.</li>
<li>Apply the <code>ss.fit_transform</code> method to the <code>wine_subset</code> DataFrame.</li>
</ul>

In [27]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

In [38]:
wine_subset.head()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
0,2.43,15.6,127
1,2.14,11.2,100
2,2.67,18.6,101
3,2.5,16.8,113
4,2.87,21.0,118


In [41]:
wine_subset_scaled[:5]

array([[ 0.23205254, -1.16959318,  1.91390522],
       [-0.82799632, -2.49084714,  0.01814502],
       [ 1.10933436, -0.2687382 ,  0.08835836],
       [ 0.4879264 , -0.80925118,  0.93091845],
       [ 1.84040254,  0.45194578,  1.28198515]])

**In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.**

## Standardized data and modeling

### KNN on non-scaled data


<p>Let's first take a look at the accuracy of a K-nearest neighbors model on the <code>wine</code> dataset without standardizing the data. The <code>knn</code> model as well as the <code>X</code> and <code>y</code> data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.</p>

In [47]:
wine = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/wine.csv')

In [63]:
X = wine.drop('Type', 1)
y = wine['Type'] 

knn = KNeighborsClassifier()

Instructions
<ul>
<li>Split the dataset into training and test sets using <code>train_test_split()</code>.</li>
<li>Use the <code>knn</code> model's <code>fit()</code> method on the <code>X_train</code> data and <code>y_train</code> labels, to fit the model to the data.</li>
<li>Print out the <code>knn</code> model's <code>score()</code> on the <code>X_test</code> data and <code>y_test</code> labels to evaluate the model.</li>
</ul>

In [66]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7333333333333333


**This scikit-learn workflow should be very familiar to you at this point.**

### KNN on scaled data


<p>The accuracy score on the unscaled <code>wine</code> dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the <code>knn</code> model as well as the <code>X</code> and <code>y</code> data and labels set have already been created for you.</p>

Instructions
<ul>
<li>Create the <code>StandardScaler()</code> method, stored in a variable named <code>ss</code>.</li>
<li>Apply the <code>ss.fit_transform</code> method to the <code>X</code> dataset.</li>
<li>Use the <code>knn</code> model's <code>fit()</code> method on the <code>X_train</code> data and <code>y_train</code> labels, to fit the model to the data.</li>
<li>Print out the <code>knn</code> model's <code>score()</code> on the <code>X_test</code> data and <code>y_test</code> labels to evaluate the model.</li>
</ul>

In [70]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

0.9555555555555556


**The increase in accuracy is worth the extra step of scaling the dataset.**