# Training

## Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we
have as follows:

### Training set  
- Model is trained
- Usually 80% of the dataset 

### Testing set
- Model gives predictions
- Unseen data

### Validation set
- Model is assessed
- Usually 20% of the dataset
- Also called hold-out
or development set

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen
test set.

## Cross Validation
Class of methods that estimate test error by holding out
a subset of training data from the fitting process.

- Validation Set:
    
split data into training set and validation set.
Train model on training and estimate test error
using validation. e.g. 80-20 split

- Leave-One-Out CV (LOOCV): 
    
split data intotraining set and validation set, 
but the validation set consists of 1 observation. 
Then repeat n-1 times until all observations have
been used as validation. Test erro is the average 
of these n test error estimates.

- k-Fold CV: 
    
randomly divide data into k groups (folds) of
approximately equal size. First fold is used as validation
and the rest as training. Then repeat k times and find
average of the k estimates.





In [3]:
import pandas as pd
import numpy as np
%matplotlib widget
#svm just for visulace metrics for acuricy
from sklearn import svm,metrics 

#this  is a diferent scaling values are ranges are 
from sklearn.preprocessing import MinMaxScaler

#croos_val for accuracy
from sklearn.model_selection import KFold,cross_val_score,LeaveOneOut

In [1]:
import matplotlib


In [4]:
iris=pd.read_csv("/home/cryzal/ml/dataset/iris.csv")
x=iris.drop("Species", axis=1)
y=iris["Species"]


## MinMaxScaler()

In [6]:
#min max scaler means range is only 0  and 1

x=MinMaxScaler().fit_transform(x)

In [7]:
#algorithmdefine talk later(dimension reduction)

svc=svm.SVC(kernel='linear',C=1,gamma='auto')

# Kfold validation 

## k-fold 
- Training on k − 1 folds and
assessment on the remaining one
- Generally k = 5 or 10


Leave-p-out
- Training on n − p observations and
assessment on the p remaining ones
- Case p = 1 is called leave-one-out



In [8]:
#k_fold set up for validation

k_fold=KFold(n_splits=5)

# Cross Validation
## 1. What is cross-validation? How to do it right?


It’s a model validation technique for assessing how the results of a
statistical analysis will generalize to an independent data set.
Mainly used in settings where the goal is prediction and one wants
to estimate how accurately a model will perform in practice. The
goal of cross-validation is to define a data set to test the model
in the training phase (i.e. validation data set) in order to limit
problems like overfitting, and get an insight on how the model will
generalize to an independent data set.
Examples: leave-one-out cross validation, K-fold cross validation
How to do it right?

•the training and validation data sets have to be drawn from
the same population

•predicting stock prices: trained for a certain 5-year period,
it’s unrealistic to treat the subsequent 5-year a draw from
the same population

•common mistake: for instance the step of choosing the kernel
parameters of a SVM should be cross-validated as well


## Cross-validation 
– Cross-validation, also noted CV, is a method that is used to select a
model that does not rely too much on the initial training set.


In [9]:
#built for cross_validation test
result=cross_val_score(svc,x,y,cv=k_fold)
result
#array([1.  #accuracy for first fold  100%    , 
#1.  accuracy 100% second fold      , 
#0.53333333 accuracy 53% third fold,
#0.93333333 accuracy 93%, 0.6  accuracy 60%     ])
# np.mean(result)

# we get 81%  

array([1.        , 1.        , 0.53333333, 0.93333333, 0.6       ])

## LEAVEONEOUT

In [10]:
loo = LeaveOneOut()


In [11]:
for train_index,test_index in loo.split(x):
    x_train,x_test = x[train_index],x[test_index]
    y_train,y_test = y[train_index],y[test_index]
    svc.fit(x_train,y_train)#fit the both training set in algorithm
    y_predt = svc.predict(x_test)#predict using algorithm and  x_test testing set of x
    acc=metrics.accuracy_score(y_test,y_predt)#to calculate accuracy using metrics and test set of y  and predict of algo y

In [12]:
acc

1.0