# Chapter #2: Standardizing Data

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## 1. Standardizing Data

1. Standardizing Data
It's possible that you'll come across datasets with lots of numerical noise built in, such as lots of variance or differently-scaled data. The preprocessing solution for that is standardization.

2. What is standardization?
Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models assume that the data you are training on is normally distributed, and if it isn't, you risk biasing your model. You can standardize your data in different ways, but in this course we're going to talk about two methods: log normalization and scaling. It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. You'll learn methods for dealing with categorical data later in the course.

3. When to standardize: models
There are a few different scenarios in which you want to standardize your data. First, if you're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features you're giving it are related in a linear fashion, or can be measured with a linear distance metric. There are a number of models that deal with nonlinear spaces, but for those models that are in a linear space, the data must also be in that space. The case when a feature or features in your dataset have high variance is related to this. This could bias a model that assumes the data is normally distributed. If a feature in your dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset. Modeling a dataset that contains continuous features that are on different scales is another scenario to watch out for. For example, consider a dataset that contains a column related to height and another related to weight. In order to compare these features, they must be in the same linear space, and therefore must be standardized in some way. All of these scenarios assume you're working with a model that makes some kind of linearity assumptions, however. There are a number of models that are perfectly fine operating in a nonlinear space or do a certain amount of standardization upon input, but that's outside the scope of this course.

4. Let's practice!
Now that you've learned when to standardize your data, let's test your knowledge.

### 1.1. When to standardize

Now that you've learned when it is appropriate to standardize your data, which of these scenarios would you NOT want to standardize?

Possible Answers:
- A column you want to use for modeling has extremely high variance.
- You have a dataset with several continuous columns on different scales and you'd like to use a linear model to train the data.
- The models you're working with use some sort of distance metric in a linear space, like the Euclidean metric.
- Your dataset is comprised of categorical data.

> Your dataset is comprised of categorical data.

### 1.2. Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the `wine` dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (`knn`) as well as the `X` and `y` sets you need to fit and score on.

- Getting everything ready.

In [2]:
# Reading the data:
wine = pd.read_csv("./data/wine.csv")

In [3]:
# Exploring the shape:
wine.shape

(178, 14)

In [4]:
# Exploring the first 5 rows:
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [5]:
# Creating the feature matrix (X):
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']].copy()

In [6]:
# Creating the target column (y):
y = wine['Type'].copy()
y.shape

(178,)

In [7]:
# Initializing the model:
knn = KNeighborsClassifier()

- Split up the `X` and `y` sets into training and test sets using `train_test_split()`.

In [8]:
# Splitting the data into training & hold-out sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)

- Use the `knn` model's `.fit()` method on the `X_train` data and `y_train` labels, to fit the model to the data.

In [9]:
# Fitting the model:
knn.fit(X_train, y_train)

KNeighborsClassifier()

- Print out the `knn` model's `.score()` on the `X_test` data and `y_test` labels to evaluate the model.

In [10]:
# Evaluating the model performance using .score():
print(knn.score(X_test, y_test))

0.6666666666666666


## 2. Log normalization

1. Log normalization
The first method we'll cover for standardization is log normalization.

2. What is log normalization?
Log normalization is a method for standardizing your data that can be useful when you have a particular column with high variance. As you saw in the previous section's exercise, training a k-nearest neighbors classifier on that subset of the wine dataset didn't get a very high accuracy score. This is because within that subset, the Proline colummn has extremely high variance, which is affecting the accuracy of the classifier. Log normalization applies a log transformation to your values, which transforms your values onto a scale that approximates normality, an assumption about your data that a lot of models make. The method of log normalization we're going to work with in Python takes the natural log of each number in the left hand column, which is simply the exponent you would raise above the mathematical constant e (approximately equal to 2.718) to get that number. So, looking at the table on the slide, the log of 30 is 3.4, because e to the power of 3.4 equals 30. Log normalization is a good strategy when you care about relative changes in a linear model, when you still want to capture the magnitude of change, and when you want to keep everything in the positive space. It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.

3. Log normalization in Python
Applying log normalization to data in Python is fairly straightforward. We can use the log function from Numpy to do the trick. Here we have a dataframe of some values. If we check the variance of the columns, you can see that column 2 has a significantly higher variance than column 1. To apply log normalization to column 2, we need the log function from numpy. we can pass the column we want to log normalize directly into the function. If we take a look at both column 2 and the log-normalized column-2, you can see that the transformation has scaled down the values. If we check the variance of both column 1 and the log-normalized column 2, you can see that the variances are much closer together now.

4. Let's practice!
Now it's your turn! Let's take a look at the wine dataset again and do some normalization.

### 2.1. Checking the variance

Check the variance of the columns in the `wine` dataset. Out of the four columns listed in the multiple choice section, which column is a candidate for normalization?

In [11]:
# Checking the variance of different features:
wine.var()

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

Possible Answers:
- Alcohol.
- Proline.
- Proanthocyanins.
- Ash.

> Proline.

### 2.2. Log normalization in Python

Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it.

Numpy has been imported as `np` in your workspace.

- Print out the variance of the `Proline` column for reference.

In [12]:
# Checking the variance of Proline column:
wine['Proline'].var()

99166.71735542428

- Use the `np.log()` function on the `Proline` column to create a new, log-normalized column named `Proline_log`.

In [13]:
# Normalizing the Proline column using np.log():
wine['Proline_log'] = np.log(wine['Proline'])

- Print out the variance of the `Proline_log` column to see the difference.

In [14]:
# Checking the variance of new Proline_log column:
wine['Proline_log'].var()

0.17231366191842018

## 3. Scaling data for feature comparison

1. Scaling data
Let's move on to talking about scaling our data.

2. What is feature scaling?
Scaling is a method of standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales, and you're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors). Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features. This is a requirement for many models in scikit-learn.

3. How to scale data
Let's take a look at another dataframe. In each column, we have numbers that are relatively close within the column, but not across columns. If we look at the variance, it's relatively low across columns. To better model this data, scaling would be a good choice here.

4. How to scale data
Scikit-learn has a variety of scaling methods, but we're only going to focus on the standard scaler method, imported from preprocessing. This method works by removing the mean and scaling each feature to have unit variance. There's a simpler scale function in scikit-learn, but the benefit of using the Standard Scaler object is that you can apply the same transformation on other data, like a test set, or new data that's part of the same set, for example, without having to rescale everything. So once we have the standard scaler method, we can apply the fit transform function on the dataframe. We can reconvert the output of fit transform, which is a numpy array, to a dataframe to look at it more easily. If we take a look at the newly scaled dataframe, we can see that the values have been scaled down, and if we calculate the variance by column, it's not only close to 1, but it's now the same for all of our features.

5. Let's practice!
Now it's your turn to try scaling data in Python and scikit-learn.

### 3.1. Scaling data - investigating columns

We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

In [15]:
# Exploring some statistics:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Ash,178.0,2.366517,0.274344,1.36,2.21,2.36,2.5575,3.23
Alcalinity of ash,178.0,19.494944,3.339564,10.6,17.2,19.5,21.5,30.0
Magnesium,178.0,99.741573,14.282484,70.0,88.0,98.0,107.0,162.0


Possible Answers:
- The max of `Ash` is 3.23, the max of `Alcalinity of ash` is 30, and the max of `Magnesium` is 162.
- The means of `Ash` and `Alcalinity of ash` are less than 20, while the mean of `Magnesium` is greater than 90.
- The standard deviations of `Ash` and `Alcalinity of ash` are equal.
- 1 and 2 are true.

> 1 and 2 are true.

### 3.2. Scaling data - standardizing columns

Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

- Import `StandardScaler` from `sklearn.preprocessing`.

> Done

- Create the `StandardScaler()` method and store in a variable named `ss`.

In [16]:
# Initializing the scaler:
ss = StandardScaler()

- Create a subset of the wine DataFrame of the `Ash`, `Alcalinity of ash`, and `Magnesium` columns, store in a variable named `wine_subset`.

In [17]:
# Subsetting the data:
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

- Apply the `ss.fit_transform()` method to the `wine_subset` DataFrame.

In [18]:
# Scaling the data subset:
wine_subset_scaled = pd.DataFrame(ss.fit_transform(wine_subset), columns=wine_subset.columns)

In [19]:
# Checking the variance of the scaled data:
wine_subset_scaled.var()

Ash                  1.00565
Alcalinity of ash    1.00565
Magnesium            1.00565
dtype: float64

## 4. Standardized data and modeling

1. Standardized data and modeling
Now that we've learned a couple of different methods for standardization, it's time to put this into practice with modeling. As mentioned before, many models in scikit-learn require your data to be scaled appropriately across columns, otherwise you risk biasing your results. The last part of this section will be dedicated to modeling data on both unscaled and scaled data, so you can see the difference in model performance. The model we're going to use is k-nearest neighbors.

2. K-nearest neighbors
You should already be a little familiar with both k-nearest neighbors, as well as the scikit-learn workflow, based on previous courses, but we'll do a quick review of both. K-nearest neighbors is a model that classifies data based on its distance to training set data. A new data point is assigned a label based on the class that the majority of surrounding data points belong to. The workflow for training a model in scikit-learn is pretty simple. You will want to do all of your preprocessing first, of course. Before you train your model, it's very important that you split up your data into training and test sets, to avoid overfitting. That's easily done with scikit-learn's train test split function, shown here. Once you've done that, it's just a matter of fitting the model to your training set, and then running your unseen test set through the model. Most models have a score function, which can be used to evaluate your model's performance.

3. Let's practice!
All of that should look familiar to you. If not, check out some of the other courses related to Python and machine learning. Otherwise, it's your turn!

### 4.1. KNN on non-scaled data

Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data. The `knn` model as well as the `X` and `y` data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

- Getting everything ready.

In [20]:
# Reading the data:
wine = pd.read_csv("./data/wine.csv")

In [21]:
# Exploring the shape:
wine.shape

(178, 14)

In [22]:
# Exploring the first 5 rows:
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [23]:
# Creating the feature matrix (X):
X = wine.drop(columns='Type')

In [24]:
# Creating the target column (y):
y = wine['Type'].copy()

In [25]:
# Initializing the model:
knn = KNeighborsClassifier()

- Split the dataset into training and test sets using `train_test_split()`.

In [26]:
# Splitting the data into training & hold-out sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)

- Use the `knn` model's `.fit()` method on the `X_train` data and `y_train` labels, to fit the model to the data.

In [27]:
# Fitting the model:
knn.fit(X_train, y_train)

KNeighborsClassifier()

- Print out the `knn` model's `.score(`) on the `X_test` data and `y_test` labels to evaluate the model.

In [28]:
# Evaluating the model performance:
print(knn.score(X_test, y_test))

0.7111111111111111


### 4.2. KNN on scaled data

The accuracy score on the unscaled `wine` dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the `knn` model as well as the `X` and `y` data and labels set have already been created for you.

- Create the `StandardScaler()` method, stored in a variable named `ss`.

In [29]:
# Initializing the scaler:
ss = StandardScaler()

- Apply the `ss.fit_transform()` method to the `X` dataset.

In [30]:
# Scaling the data subset:
X_scaled = pd.DataFrame(ss.fit_transform(X), columns=X.columns)

In [31]:
# Splitting the sclaed version of data into training & hold-out sets:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

- Use the `knn` model's `.fit()` method on the `X_train` data and `y_train` labels, to fit the model to the data.

In [32]:
# Fitting the model:
knn.fit(X_train, y_train)

KNeighborsClassifier()

- Print out the `knn` model's `.score()` on the `X_test` data and `y_test` labels to evaluate the model.

In [33]:
# Evaluating the model performance:
print(knn.score(X_test, y_test))

0.9777777777777777
