---
title: "Training and Testing"
description: "Sklearn's train_test_split method splits data into random training and testing subsets. Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation, and visualization algorithms."
tags: Scikit learn, Sklearn, Testing
URL: https://www.datacamp.com/community/blog/scikit-learn-cheat-sheet https://github.com/kailashahirwar/cheatsheets-ai/blob/master/Scikit%20Learn.png
Licence: 
Creator: 
Meta: ""

---

# Key Code&

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.20, random_state = 13)

`test_size` is a float between 0 and 1 that represents the proportion of the dataset to include in the testing batch. It's 0.25 by default. <br>
`train_size` is a float between 0 and 1 that represents the proportion of the dataset to include in the training batch. It's the complement of `test_size` by default. <br>
`random_state` is an integer that is the seed used by the random number generator.

# Example&

## Preliminaries

In [2]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

## Load some example data

This is the dataset of house prices in Boston. See the learn more to read more about this example dataset.

In [16]:
boston = datasets.load_boston()
columns = boston.feature_names
columns

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [19]:
bos_df = pd.DataFrame(boston.data,
                    columns = columns)
bos_df['PRICE'] = boston.target
bos_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Use `test_train_split`

In [20]:
X = bos_df.drop('PRICE', axis = 1)
y = bos_df['PRICE']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.20, random_state = 13)

In [23]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)


In [26]:
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
51,0.04337,21.0,5.64,0.0,0.439,6.115,63.0,6.8147,4.0,243.0,16.8,393.97,9.43
265,0.76162,20.0,3.97,0.0,0.647,5.56,62.8,1.9865,5.0,264.0,13.0,392.4,10.45
66,0.04379,80.0,3.37,0.0,0.398,5.787,31.1,6.6115,4.0,337.0,16.1,396.9,10.24
12,0.09378,12.5,7.87,0.0,0.524,5.889,39.0,5.4509,5.0,311.0,15.2,390.5,15.71
347,0.0187,85.0,4.15,0.0,0.429,6.516,27.7,8.5353,4.0,351.0,17.9,392.43,6.36


# Learn More&

In [18]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

---
title: "Standardize Features with Sklearn's StandardScaler"
description: "Sklearn can easily standardize features by turning into a Gaussian (or normal) distribution, meaning a mean of 0 and standard deviation of 1. Centering and scaling happen independently on each feature. Standardized features are a common requirement for many machine learning estimators."
tags: Scikit learn, Sklearn, Data Cleaning / Preprocessing
URL: https://www.datacamp.com/community/blog/scikit-learn-cheat-sheet https://github.com/kailashahirwar/cheatsheets-ai/blob/master/Scikit%20Learn.png
Licence: 
Creator: 
Meta: "fit_transform"

---

# Key Code&

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# for one np array named data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

In [None]:
# for training / testing batch
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

# Example&

## Preliminaries

In [28]:
import numpy as np
from sklearn.preprocessing import StandardScaler

## Example data

In [34]:
data = np.array([[0,0], [1,1], [1,0], [0,1]])

## Standardize it

In [35]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
scaled_data

array([[-1., -1.],
       [ 1.,  1.],
       [ 1., -1.],
       [-1.,  1.]])

In [37]:
scaled_data.mean(axis = 0)

array([0., 0.])

In [38]:
scaled_data.std(axis = 0)

array([1., 1.])

# Example&

## Preliminaries

In [32]:
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## Example data

This is the dataset of house prices in Boston. See the learn more to read more about this example dataset.

In [39]:
boston = datasets.load_boston()
columns = boston.feature_names
bos_df = pd.DataFrame(boston.data, columns = columns)
bos_df['PRICE'] = boston.target
bos_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Split into test and train batches

In [40]:
X = bos_df.drop('PRICE', axis = 1)
y = bos_df['PRICE']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.20, random_state = 13)

## Using StandardScaler

In [51]:
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

# Learn More&

In [52]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

---
title: "Normalize Features with Sklearn's Normalizer"
description: "Sklearn can easily normalize features. Normalization is the process of scaling individual samples to have unit norm. Each sample (row of data) is rescaled independently of other samples so that it's norm equals 1."
tags: Scikit learn, Sklearn, Data Cleaning / Preprocessing
URL: https://www.datacamp.com/community/blog/scikit-learn-cheat-sheet https://github.com/kailashahirwar/cheatsheets-ai/blob/master/Scikit%20Learn.png
Licence: 
Creator: 
Meta: "fit_transform"

---

# Key Code&

In [None]:
from sklearn.preprocessing import Normalizer

In [None]:
# for one np array named data
scaler = Normalizer()
normalized_data = scaler.fit_transform(data)

In [None]:
# for training / testing batch
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

# Example&

## Preliminaries

In [53]:
import numpy as np
from sklearn.preprocessing import Normalizer

## Example data

In [61]:
data = np.array([[1,2,2,4], 
                 [1,3,3,9], 
                 [1,5,5,7]])

## Standardize it

In [62]:
scaler = Normalizer()
normalized_data = scaler.fit_transform(data)
normalized_data

array([[0.2, 0.4, 0.4, 0.8],
       [0.1, 0.3, 0.3, 0.9],
       [0.1, 0.5, 0.5, 0.7]])

# Example&

## Preliminaries

In [69]:
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split

## Example data

This is the dataset of house prices in Boston. See the learn more to read more about this example dataset.

In [70]:
boston = datasets.load_boston()
columns = boston.feature_names
bos_df = pd.DataFrame(boston.data, columns = columns)
bos_df['PRICE'] = boston.target
bos_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Split into test and train batches

In [71]:
X = bos_df.drop('PRICE', axis = 1)
y = bos_df['PRICE']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.20, random_state = 13)

## Using Normalizer

In [74]:
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

# Learn More&

In [75]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu