# Experiment 3: Cross Validation and Splitting Data

1. Implement K-fold cross validation techniques
2. Implement the program to avoid Data leakage with Naive Data preparation


## 1. K-fold cross validation technique

K-Folds cross-validator:      
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).       
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

methods:
- split(X, y=None, groups=None), yields train and test data indices for each fold
- get_n_splits(X=None, y=None, groups=None), returns the number of splitting iterations in the cross-validator

In [29]:
import numpy as np
from sklearn.model_selection import KFold

# dataset in the form of numpy array
X = np.array([[10, 20], [30, 40], [100, 2], [300, 400]])
y = np.array([1, 2, 3, 4])

# Define the split - into 2 folds, no shuffling by default
kf = KFold(n_splits=2, shuffle=False, random_state=None) 

# get the indices of the split
for train_index, test_index in kf.split(X):
  print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]

print(X_train)
print(X_test)
print(y_train)
print(y_test)

X_, y_= kf.split(X)
print(X_)
print(y_)

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]
[[10 20]
 [30 40]]
[[100   2]
 [300 400]]
[1 2]
[3 4]
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))


## 2. Naive (simple) Data preparation

### make_classification method:

Use make_classification() to create a variety of classification datasets.     
Parameters:
- n_samples: How many observations do you want to generate?
- n_features: The number of numerical features.
- n_informative: The number of features that are ‘useful.’ Only these features carry the signal that your model will use to classify the dataset.
- n_classes: The number of unique classes (values) for the target label.
- n_redundant: The number of redundant features (aka irrelevant features).  These features are generated as random linear combinations of the informative features.
- random_state: Set this value for reproducibility.

### MinMaxScaler method:

- transform features by scaling each feature to a given range.
- This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
- The transformation is given by:

```
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min, max = feature_range.
```
- This transformation is often used as an alternative to zero mean, unit variance scaling.

### train_test_split method:
Split arrays or matrices into random train and test subsets.
Parameter:
- test_size: represents the proportion of the dataset to include in the test set. default: 0.25


### 2.1 Data Leakage in this case: (Wrong approach)
fit_transform() on the whole dataset, then split into train and test sets

In [33]:
# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# standardize the dataset
# new_val = (old_val - min) / (max - min)
scaler = MinMaxScaler() # values between 0 and 1
X = scaler.fit_transform(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
print(model.score(X_test,y_test))

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))


0.8484848484848485
Accuracy: 84.848


### Data leakage:
 refers to a problem where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset. This leakage is often small and subtle but can have a marked effect on performance. A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

### 2.2 Without Data Leakage: (Right approach)
after splitting datasets, fit() on the train set, then transform() on train and test sets

In [None]:
# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define the scaler
# new_val = (old_val - min) / (max - min)
scaler = MinMaxScaler() # values between 0 and 1
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))


Accuracy: 85.455
