# The Validation and Scaling Challenge

In [4]:
# import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# The Data:

From [openmv.net](https://openmv.net/info/brittleness-index)

### Brittleness index

Description:	A plastic product is produced in three parallel reactors (TK104, TK105, or TK107). For each row in the dataset, we have the same batch of raw material that was split, and fed to the 3 reactors. These values are the brittleness index for the product produced in the reactor.
* Data source:	Simulated data
* Data shape:	15 rows and 3 columns
* Usage restrictions:	None
* Contact person:	Kevin Dunn
* Contact details:	kgdunn@gmail.com
* Added here on:	02 January 2011 20:05
* Last updated:	11 November 2018 16:30

'Type' column added by Josh Johnson.  It's just random integers.

In [5]:
#import the data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTnOkp-Udq9wDorqcZ2uZeqC_20sORk0zc_9jN6HIXWpsGnY1tCEMGKjrEZlImuxrhS08OYUNrU_FJc/pub?output=csv')
df.head()

Unnamed: 0,TK104,TK105,TK107,Type
0,254.0,263.0,338,0
1,440.0,,470,1
2,501.0,,558,2
3,368.0,451.0,426,2
4,697.0,709.0,733,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   TK104   20 non-null     float64
 1   TK105   21 non-null     float64
 2   TK107   23 non-null     int64  
 3   Type    23 non-null     int64  
dtypes: float64(2), int64(2)
memory usage: 864.0 bytes


## Challenge 1. 

Split the data into X and y.  The 'Type' column is your target.

In [8]:
X = df.drop(columns = 'Type')
y = df['Type']

## Challenge 2.

Perform a Validation Split, use 30% of the data for the test set (you may need to look up the documentation to find how to do this.)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size= .3)

## Challenge 3#

Standard Scale the values in X_train and y_train without leaking information from the test set into the training set.

Standard Scaler returns a numpy array.  You can either convert it back to a dataframe by using pd.DataFrame(X_train) to see if the scaling worked.

In [11]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [13]:
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

In [14]:
X_train_df = pd.DataFrame(X_train_transformed)
X_test_df = pd.DataFrame(X_test_transformed)

In [15]:
X_train_df.describe().round(2)

Unnamed: 0,0,1,2
count,14.0,15.0,16.0
mean,-0.0,0.0,-0.0
std,1.04,1.04,1.03
min,-2.15,-1.85,-1.97
25%,-0.54,-0.59,-0.41
50%,-0.15,-0.06,-0.14
75%,0.55,0.58,0.59
max,2.33,1.81,2.18


# Bonus Challenge

1. Load the original sales data from your project
2. Set 'Item_Outlet_Sales' as your target
3. Perform a validation split
4. Scale the data using the standard scaler object.
6. DO NOT LEAK DATA from the test set to the training set.