# Modeling exercise

## General Instructions

* Submission date: 4.5.2023
* Submission Method: Link to your solution notebook in [this sheet](https://docs.google.com/spreadsheets/d/1GNPESGIhJpPb7LwMAyjF5qpJfZQak_mLkE3i5Y7a_VA/edit?usp=sharing).

In [None]:
!pwd

In [None]:
import sys; sys.path.append('../src')
import numpy as np
import plotly.express as px

In [None]:
import pandas as pd
import ipywidgets as widgets

In [None]:
from datasets import make_circles_dataframe, make_moons_dataframe

## Fitting and Overfiting 

The goal of the following exercise is to:
* Observe overfitting due to insuffient data
* Observe Overfitting due to overly complex model
* Identify the overfitting point by looking at Train vs Test error dynamic
* Observe how noise levels effect the needed data samples and model capacity

To do so, you'll code an experiment in the first part, and analyze the experiment result in the second part.

### Building an experiment

datasetCode:

1. Create data of size N with noise level of magnitude NL from datasets DS_NAME. 
1. Split it to training and validation data (no need for test set), use 80%-20%. 
1. Use Logistic regression and Choose one complex model of your choice: [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [SVM with RBF kernel](https://scikit-learn.org/stable/modules/svm.html) with different `gamma` values or [Random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with differnt number of `min_samples_split`. 
1. Train on the train set for different hyper parameter values. compute:
   1. Classification accuracy on the training set (TRE)
   1. Classification accuracy on the validation set (TESTE)
   1. The difference beteen the two above (E_DIFF)
1. Save DS_NAME, N, NL, CLF_NAME, K, TRE, TESTE, E_DIFF and the regularization/hyper param (K, gamma or min_samples_split and regularization value for the linear regression classifier)

Repeat for:
* DS_NAME in Moons, Circles
* N (number of samples) in [5, 10, 50, 100, 1000, 10000]
* NL (noise level) in [0, 0.1, 0.2, 0.3, 0.4, 0.5]
* For the complex model: 10 Values of hyper parameter of the complex model you've chosen.
* For the linear model: 5 values of ridge (l2) regularization - [0.001, 0.01, 0.1, 1, 10, 100, 1000]

### Analysing the expermient results

1. For SVM only, For dataset of size 10k and for each dataset, What are the best model params? How stable is it? 
1. For SVM only, For dataset of size 10k and for each dataset, What is the most stable model and model params? How good is it in comparison to other models? Explain using bias and variance terminoligy.
1. Does regularization help for linear models? consider different datasets sizes. 
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing data sizes? (answer for svm and LR seperatly)
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing model complexity? (answer for svm and LR seperatly)
1. Are the noise level effect the number of datapoints needed to reach optimal test results? 

Bonus:

* For SVM: Select one dataset and with 0.2 noise level. Identify the optimal model params, and visualize the decision boundry learned. 
  * Hint: Use a grid. See classification models notebook 

## Tips and Hints

For buliding the experiment:

* Start with one dataframe holding all the data for both datastes with different noise level. Use the `make_<dataset_name>_dataframe()` functions below, and add two columns, dataset_name and noise_level, before appending the new dataset to the rest of the datasets. Use `df = pd.DataFrame()` to start with an empty dataframe and using a loop, add data to it using `df = df.append(<the needed df here>)`. Verify that you have 10k samples for each dataset type and noise level by a proper `.value_counts()`. You can modify the 
* When you'll need an N samples data with a specific noise level, use `query()` and `head(n)` to get the needed dataset. 
* Use sklearn `train_test_split()` method to split the data with `test_size` and `random_state` parameters set correctly to ensure you are always splitting the data the same why for a given fold `k`. Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) if needed. 
* You can also not create your own data splitter, and instead use `model_selection.cross_validate()` from sklearn. You'll need to ask for the train erros as well as the test errors, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).
* Use prints in proper location to ensure the progress of the experiment. 

**If you get stuck, and need refernce, scroll to the end of the notebook to see more hints!**

## Moons dataset

In [None]:
from sklearn.datasets import make_moons,make_circles


In [None]:
moons_df = make_moons_dataframe(n_samples=1000, noise_level=0.1)
moons_df.head()

In [None]:
@widgets.interact
def plot_noisy_moons(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    moons_df = make_moons_dataframe(n_samples=1000, noise_level=noise_level)
    return px.scatter(moons_df, x='x', y='y', color = 'label')

## Circles Dataset

In [None]:
circles_df = make_circles_dataframe(n_samples=500, noise_level=0)
circles_df.head()

In [None]:
@widgets.interact
def plot_noisy_circles(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    df = make_circles_dataframe(1000, noise_level)
    return px.scatter(df, x='x', y='y', color = 'label')

## Appendix

### More hints!

If you'll build the datasets dataframe correctly, you'll have **one** dataframe that has dataset_name and noise_level colmuns, as well as the regular x,y,label colmns. To unsure you've appended everything correctly, groupby the proper colmuns and look at the size:

In [None]:
# Use proper groupby statement to ensure the datasets dataframe contains data as expected. You should see the following result:

Your 

You experiment code should look something like that:

In [None]:
datasets_type = ['circles', 'moons']
k_folds = 10
n_samples = [10, 50, 100, 1000, 10000]
noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
clf_types = ['log_reg', 'svm']
hp_range = <'Your hyper parameters ranges here'>
regularization_values = <'Your regularization values here'>
results = []
for ds_type in datasets_type:
    print(f'Working on {ds_type}')
    for nl in noise_levels:
        for n in n_samples:
            ds = datasets.query(<'your query here'>).head(n)
            print(f'Starting {k_folds}-fold cross validation for {ds_type} datasets with {n} samples and noise level {nl}. Going to train {clf_types} classifiers.')
            for k in range(k_folds):
                X, Y = <'Your code here'>
                x_train,x_test,y_train,y_test= <'Your code here'>
                for clf_type in clf_types:
                    if clf_type == 'log_reg':
                        for regularization_value in regularization_values:
                            train_acc, test_acc = <'Your code here'>
                            results.append(<'Your code here'>)
                    if clf_type == 'svm':
                        for gamma in hp_range:
                            train_acc, test_acc = <'Your code here'>
                            results.append(<'Your code here'>)

## Create data set

In [7]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [8]:
import numpy as np
import pandas as pd
import plotly.express as px
#import ipywidgets as widgets
from sklearn.datasets import make_moons,make_circles

def  make_circles_dataframe(n_samples=500, noise_level=0):
        features, true_labels =make_circles(n_samples=n_samples,noise=noise_level)
        circles_df = pd.DataFrame([[x, y, l] for (x,y),l in zip(features, true_labels)], columns=['x','y','label'])
        circles_df.label =circles_df.label.astype(str)
        circles_df['datasets_type']='circles' 
        circles_df['noise_levels']=noise_level 
        return  circles_df
    
def  make_moons_dataframe(n_samples=500, noise_level=0):
        features, true_labels = make_moons(n_samples=n_samples,noise=noise_level)
        moons_df = pd.DataFrame([[x, y, l] for (x,y),l in zip(features, true_labels)], columns=['x','y','label'])
        moons_df.label =moons_df.label.astype(str)
        moons_df['datasets_type']='moons' 
        moons_df['noise_levels']=noise_level
        return  moons_df   
    
circles_df = make_circles_dataframe(n_samples=500, noise_level=0)
moons_df = make_moons_dataframe(n_samples=1000, noise_level=0.1)
#circles_df
#moons_df
#fig = px.scatter(moons_df, x='x', y='y', color='label')
#fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
#fig.show() 
#print(moons_df.sample(25))

full_df=pd.DataFrame()
for noise in [0, 0.1, 0.2, 0.3, 0.4, 0.5]:
    circles_df = make_circles_dataframe(n_samples=10000, noise_level=noise)
    circles_df['x2']= circles_df['x']**2
    circles_df['y2']= circles_df['y']**2
   
    moons_df = make_moons_dataframe(n_samples=10000, noise_level=noise)
    moons_df['x2']= moons_df['x']**2
    moons_df['y2']= moons_df['y']**2
    
    full_df=full_df.append([circles_df, moons_df])
   
print(full_df.shape)
print(full_df.columns)
print(full_df.sample(10))
#print(full_df.groupby('datasets_type').noise_levels.value_counts())





(120000, 7)
Index(['x', 'y', 'label', 'datasets_type', 'noise_levels', 'x2', 'y2'], dtype='object')
             x         y label datasets_type  noise_levels        x2        y2
2568  1.048790 -0.492513     1         moons           0.3  1.099960  0.242569
9617  0.797077  0.711785     0         moons           0.2  0.635332  0.506637
2063  0.304127 -0.743441     0       circles           0.3  0.092493  0.552704
6166 -0.676227 -0.176884     0       circles           0.4  0.457283  0.031288
6699  0.101293 -0.351772     1         moons           0.2  0.010260  0.123743
9795 -1.099359  0.478906     0         moons           0.2  1.208591  0.229351
7401 -0.063479 -0.866252     0       circles           0.1  0.004030  0.750392
2698  0.919893 -0.278112     0       circles           0.3  0.846203  0.077346
2332 -0.667669  0.743002     0         moons           0.1  0.445782  0.552052
1443 -0.913174  0.287846     1       circles           0.3  0.833888  0.082855


In [10]:
# plot
full_df['str_label']=full_df['label'].apply(str)
import ipywidgets as widgets
@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=full_df.noise_levels.unique()):
    data = full_df.loc[full_df['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='x', y='y',color='str_label', facet_row='datasets_type',height=600)#,
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig




interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

In [4]:
# plot
full_df['str_label']=full_df['label'].apply(str)
import ipywidgets as widgets
@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=full_df.noise_levels.unique()):
    data = full_df.loc[full_df['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='x2', y='y2',color='str_label', facet_row='datasets_type',height=600)#,
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig

interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

## Building an experiment
### def

In [5]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier


def split_full_dataset(full_dataset,data_type,noise,n_samples):
    ds =full_dataset.query('datasets_type==@data_type and noise_levels==@noise').head(n_samples)
    return ds

def select_folds_train_test_split(split_dataset,k_folds,k,list_columns):
    row_in_folder=split_dataset.shape[0]/k_folds
    ds_test=split_dataset.iloc[int(k*row_in_folder):int(k*row_in_folder+row_in_folder)]
    ds_train=split_dataset[~split_dataset.index.isin(ds_test.index)]
    x_test,y_test = ds_test[list_columns],  ds_test['label']#.values
    x_train,y_train = ds_train[list_columns], ds_train['label']#.values
    return x_train,x_test,y_train,y_test      
                
def log_reg_model(regularization_value,x_train,x_test,y_train,y_test):
    log_reg = LogisticRegression(penalty='l2', C=regularization_value)
    log_reg.fit(x_train,y_train)
    predict_train=log_reg.predict(x_train)
    predict_test=log_reg.predict(x_test)
    TRE=accuracy_score(predict_train,y_train)
    TESTE=accuracy_score(predict_test,y_test)
    E_DIFF=TRE-TESTE

    return  predict_test,TRE,TESTE, E_DIFF

def knn_model(k, x_train,x_test,y_train,y_test):
    try:
        knn = KNeighborsClassifier(n_neighbors=k) 
        knn.fit(x_train,y_train)
        predict_train=knn.predict(x_train)
        predict_test=knn.predict(x_test)  
    except:
        knn = KNeighborsClassifier(n_neighbors=9)                         
        knn.fit(x_train,y_train)
        predict_train=knn.predict(x_train)
        predict_test=knn.predict(x_test)                     

    TRE=accuracy_score(predict_train,y_train)
    TESTE=accuracy_score(predict_test,y_test)
    E_DIFF=TRE-TESTE
    return  predict_test,TRE,TESTE, E_DIFF




### run

In [13]:
#data type
datasets_type = ['circles', 'moons']
n_samples =[10, 50, 100, 1000, 10000]
noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
# model name
clf_types = ['log_reg', 'knn']
# grid pramter
K_valuo =[1,3,5,11,21,31,41,51,53,55]
regularization_values =[0.001, 0.01, 0.1, 1, 10, 100, 1000]
k_folds = 10
results_log_reg =[]
results_knn = []

for ds_type,columns in zip(datasets_type,[['x2','y2'],['x','y']]): # data type    
    print(f'Working on {ds_type}')
    
    for nl in noise_levels: # select noise levels
        for n in n_samples: # select numver samples
            ds =split_full_dataset(full_df,ds_type,nl,n)
            print(f'Starting {k_folds}-fold cross validation for {ds_type} datasets with {n} samples and noise level {nl}. Going to train {clf_types} classifiers.')
            row_in_folder=n/k_folds
            for kf in range(k_folds): # K-Fold Cross-Validation and split to training set  validation set
                x_train,x_test,y_train,y_test=select_folds_train_test_split(ds,k_folds,kf,columns)
                for clf_type in clf_types:
                 
                    if clf_type == 'log_reg':
                        for regularization_value in regularization_values:
                            #print('regularization_value',regularization_value)
                            predict_test,train_acc, test_acc,diff_acc =log_reg_model(regularization_value,x_train,x_test,y_train,y_test)
                            #print( clf_type,  train_acc, test_acc,diff_acc)
                            results_log_reg.append((clf_type,ds_type,nl,n,kf,regularization_value, train_acc, test_acc,diff_acc ))
  
                                      
                    if clf_type == 'knn':
                        for k in K_valuo:
                            #print('k',k)
                            try:
                                predict_test ,train_acc, test_acc,diff_acc =knn_model(k, x_train,x_test,y_train,y_test)
                            except:
                                 predict_test,train_acc, test_acc,diff_acc =knn_model(9, x_train,x_test,y_train,y_test)
                            
                                      
                            results_knn.append((clf_type,ds_type,nl,n,kf,k, train_acc, test_acc,diff_acc ))
                      

                                                       
                           
log_reg_df=pd.DataFrame(results_log_reg ,columns=['model', 'ds_type','noise_levels','n_samples' ,'k_folds', 'regularization_value','TRE','TESTE', 'E_DIFF'])   
knn_df=pd.DataFrame(results_knn,columns=['model', 'ds_type','noise_levels','n_samples', 'k_folds', 'k','TRE','TESTE', 'E_DIFF']) 



#log_reg_df.to_csv('log_reg_df_nwe.csv')
#knn_df.to_csv('knn_df_nwe.csv')


Working on circles
Starting 10-fold cross validation for circles datasets with 10 samples and noise level 0. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 50 samples and noise level 0. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 100 samples and noise level 0. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 1000 samples and noise level 0. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 10000 samples and noise level 0. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 10 samples and noise level 0.1. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-fold cross validation for circles datasets with 50 samples and noise level 0.1. Going to train ['log_reg', 'knn'] classifiers.
Starting 10-

In [14]:
log_reg_df
knn_df

Unnamed: 0,model,ds_type,noise_levels,n_samples,k_folds,k,TRE,TESTE,E_DIFF
0,knn,circles,0.0,10,0,1,1.000000,1.000,0.000000
1,knn,circles,0.0,10,0,3,0.888889,1.000,-0.111111
2,knn,circles,0.0,10,0,5,0.666667,1.000,-0.333333
3,knn,circles,0.0,10,0,11,0.666667,1.000,-0.333333
4,knn,circles,0.0,10,0,21,0.666667,1.000,-0.333333
...,...,...,...,...,...,...,...,...,...
5995,knn,moons,0.5,10000,9,31,0.832333,0.823,0.009333
5996,knn,moons,0.5,10000,9,41,0.830556,0.824,0.006556
5997,knn,moons,0.5,10000,9,51,0.830889,0.826,0.004889
5998,knn,moons,0.5,10000,9,53,0.830444,0.827,0.003444


## Analysing the expermient results
1. For SVM only, For dataset of size 10k and for each dataset, What are the best model params? How stable is it?
1. For SVM only, For dataset of size 10k and for each dataset, What is the most stable model and model params? How good is it in comparison to other models? Explain using bias and variance terminoligy.
1. Does regularization help for linear models? consider different datasets sizes.
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing data sizes? (answer for svm and LR seperatly)
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing model complexity? (answer for svm and LR seperatly)
1. Are the noise level effect the number of datapoints needed to reach optimal test results?

#  get model results

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px 
import ipywidgets as widgets

log_reg_df=pd.read_csv('log_reg_df.csv')
knn_df=pd.read_csv('knn_df.csv')

data_model=log_reg_df
print(data_model.sample(10))
print(data_model.columns)
c=data_model.columns
print(data_model[c[1]].unique())
print(data_model[c[2]].unique())
print(data_model[c[3]].unique())
print(data_model[c[4]].unique())
print(data_model[c[5]].unique())
print(data_model[c[6]].unique())

      Unnamed: 0    model  ds_type  noise_levels  n_samples  k_folds  \
3776        3776  log_reg    moons           0.4       1000        9   
1033        1033  log_reg  circles           0.2      10000        7   
4019        4019  log_reg    moons           0.5        100        4   
636          636  log_reg  circles           0.1      10000        0   
1014        1014  log_reg  circles           0.2      10000        4   
3068        3068  log_reg    moons           0.2       1000        8   
433          433  log_reg  circles           0.1         50        1   
2324        2324  log_reg    moons           0.0       1000        2   
1152        1152  log_reg  circles           0.3         50        4   
2756        2756  log_reg    moons           0.1      10000        3   

      regularization_value       TRE  TESTE    E_DIFF  
3776                 1.000  0.831111  0.820  0.011111  
1033                10.000  0.508111  0.474  0.034111  
4019                 0.010  0.633333  0

### plot model results

### 1 - For SVM only, For dataset of size 10k and for each dataset
### What are the best model params? How stable is it?

In [42]:
def best_model_params(df):
    # find bast modsel pramter by max teste 
    # and stabble by std of teste in the difference folds
    best_TESTE=df.query('TESTE==TESTE.max()')[['TESTE']].iloc[0].values[0]
    best_pramter=df.query('TESTE==TESTE.max()')[['pramter']].iloc[0].values[0]
    stable=df['TESTE'].std()
    return  pd.Series({'valuo':best_TESTE,'pramter':best_pramter,'stable':stable})

# data
data=knn_df.query('n_samples==10000')
data=data.rename(columns={'k': 'pramter'})
#data2=data.query('ds_type =="moons" and noise_levels==0.1')
#print(data.sample(10))


# mean k folder 
type_data_grop=data.groupby(by=['ds_type','noise_levels','pramter'])['TESTE'].mean()
type_data_grop=type_data_grop.reset_index()
print(type_data_grop)

# best_model_params
type_data_grop2=type_data_grop.groupby(by=['ds_type','noise_levels']).apply(lambda df:  best_model_params(df))
type_data_grop2=type_data_grop2.reset_index()
print(type_data_grop2)

@widgets.interact
def plot_bar_best_model_params(col=['pramter','valuo','stable']):
    data = type_data_grop2[[col,'noise_levels','ds_type']]
    data=data.rename(columns={col: "y"})
    fig=px.bar(data, x='noise_levels', y='y', facet_row='ds_type', height=800)
    return  fig


######################################################################################
# the  bast modsel pramter depends on the noise level small for low noise level and large for high noise level

# the stabble is function of noize levels and is goes down When  the noise level goes up in mood data set

# the model results are better in moons  data det  and decreases with increase in noise level
####################################################################################

     ds_type  noise_levels  pramter   TESTE
0    circles           0.0        1  1.0000
1    circles           0.0        3  1.0000
2    circles           0.0        5  1.0000
3    circles           0.0       11  1.0000
4    circles           0.0       21  1.0000
..       ...           ...      ...     ...
115    moons           0.5       31  0.8190
116    moons           0.5       41  0.8224
117    moons           0.5       51  0.8233
118    moons           0.5       53  0.8239
119    moons           0.5       55  0.8239

[120 rows x 4 columns]
    ds_type  noise_levels   valuo  pramter    stable
0   circles           0.0  1.0000      1.0  0.000000
1   circles           0.1  0.8391     53.0  0.021031
2   circles           0.2  0.6825     53.0  0.028181
3   circles           0.3  0.6162     55.0  0.021752
4   circles           0.4  0.5812     53.0  0.019109
5   circles           0.5  0.5615     55.0  0.018865
6     moons           0.0  1.0000      1.0  0.000000
7     moons           0.

interactive(children=(Dropdown(description='col', options=('pramter', 'valuo', 'stable'), value='pramter'), Ou…

### 2 - For SVM only, For dataset of size 10k and for each dataset,
### What is the most stable model and model params? 
### How good is it in comparison to other models? Explain using bias and variance terminoligy.

In [6]:
def most_stable_model(df):
    # find bast modsel pramter by max teste 
    mean_TRE=df['TRE'].mean()
    stable=df['TESTE'].std()
    return  pd.Series({'valuo':mean_TRE,'stable':stable})


def most_stable_model_2(df):
    # find  stable model
    stable=df.query('stable== stable.min()')[['stable']].iloc[0].values[0]
    valuo=df.query('stable== stable.min()')[['valuo']].iloc[0].values[0]
    pramter=df.query('stable== stable.min()')[['pramter']].iloc[0].values[0]
    return  pd.Series({'stable':stable,'pramter': pramter,'valuo': valuo})

# data
data=knn_df.query('n_samples==10000')
data=data.rename(columns={'k': 'pramter'})

# mean k folder 
type_data_grop=data.groupby(by=['ds_type','noise_levels','pramter']).apply(lambda df:  most_stable_model(df))
type_data_grop=type_data_grop.reset_index()
print(type_data_grop)
print('aaaaaaaaaaaaaaaaaaaaaaaaaaaaa')


type_data_grop_2=type_data_grop.groupby(by=['ds_type','noise_levels']).apply(lambda df:  most_stable_model_2(df))
type_data_grop_2=type_data_grop_2.reset_index()
print(type_data_grop_2)




type_data_grop['str_pramter']=type_data_grop['pramter'].apply(str)
@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=type_data_grop.noise_levels.unique()):
    data =type_data_grop.loc[type_data_grop['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='stable', y='valuo',color='str_pramter', facet_row='ds_type',height=600)#,
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig

     ds_type  noise_levels  pramter     valuo    stable
0    circles           0.0        1  1.000000  0.000000
1    circles           0.0        3  1.000000  0.000000
2    circles           0.0        5  1.000000  0.000000
3    circles           0.0       11  1.000000  0.000000
4    circles           0.0       21  1.000000  0.000000
..       ...           ...      ...       ...       ...
115    moons           0.5       31  0.826789  0.015406
116    moons           0.5       41  0.827133  0.014766
117    moons           0.5       51  0.827556  0.014174
118    moons           0.5       53  0.827300  0.014888
119    moons           0.5       55  0.827289  0.014495

[120 rows x 5 columns]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    ds_type  noise_levels    stable  pramter     valuo
0   circles           0.0  0.000000      1.0  1.000000
1   circles           0.1  0.007630     21.0  0.847344
2   circles           0.2  0.016322      5.0  0.758267
3   circles           0.3  0.004984     53.0  0.639556

interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

### 3- Does regularization help for linear models? consider different datasets sizes.

In [10]:
data=log_reg_df
data=data.rename(columns={'regularization_value': 'pramter'})

print(data.columns)
type_data_grop=data.groupby(by=['ds_type','noise_levels','n_samples','pramter'])['TESTE'].mean()
                       
type_data_grop=type_data_grop.reset_index()
print(type_data_grop)

type_data_grop['str_n_samples']=type_data_grop['n_samples'].apply(str)
type_data_grop['log_pramter']=type_data_grop['pramter'].apply(np.log)
#fig = px.scatter(type_data_grop, x='log_pramter', y='TESTE',color='str_n_samples', height=600)
#fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
#fig.show()



@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=type_data_grop.noise_levels.unique()):
    data = type_data_grop.loc[type_data_grop['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='log_pramter', y='TESTE',color='str_n_samples', facet_row='ds_type',height=600)#,
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig


#################################
# regularization help for  small data set and does not affect large data set 

Index(['Unnamed: 0', 'model', 'ds_type', 'noise_levels', 'n_samples',
       'k_folds', 'pramter', 'TRE', 'TESTE', 'E_DIFF'],
      dtype='object')
     ds_type  noise_levels  n_samples   pramter   TESTE
0    circles           0.0         10     0.001  0.0000
1    circles           0.0         10     0.010  0.0000
2    circles           0.0         10     0.100  0.0000
3    circles           0.0         10     1.000  0.6000
4    circles           0.0         10    10.000  0.6000
..       ...           ...        ...       ...     ...
415    moons           0.5      10000     0.100  0.8064
416    moons           0.5      10000     1.000  0.8066
417    moons           0.5      10000    10.000  0.8068
418    moons           0.5      10000   100.000  0.8068
419    moons           0.5      10000  1000.000  0.8068

[420 rows x 5 columns]


interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

### 4 For a given noise level of your chioce, 
### How does the train, test and difference error changes with increasing data sizes? (answer for svm and LR seperatly)

In [19]:
#knn
data=knn_df#.query('noise_levels==0.5')


type_data_grop_knn=data.groupby(by=['ds_type','noise_levels','n_samples'])['E_DIFF'].mean()
type_data_grop_knn=type_data_grop_knn.reset_index()
print(type_data_grop_knn)   

type_data_grop_knn['log_n_samples']=type_data_grop_knn['n_samples'].apply(np.log)
#type_data_grop_knn['str_k']=type_data_grop_knn['k'].apply(str)
@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=type_data_grop_knn.noise_levels.unique()):
    data =type_data_grop_knn.loc[type_data_grop_knn['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='log_n_samples', y='E_DIFF', facet_row='ds_type',height=600)
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig




#log_reg
data=log_reg_df#.query('noise_levels==0.5')

type_data_groplog_reg=data.groupby(by=['ds_type','noise_levels','n_samples'])['E_DIFF'].mean()
type_data_groplog_reg=type_data_groplog_reg.reset_index()
print(type_data_groplog_reg)   

type_data_groplog_reg['log_n_samples']=type_data_groplog_reg['n_samples'].apply(np.log)
@widgets.interact

def plot_provider_success_rate_per_region(noise_levels=type_data_grop.noise_levels.unique()):
    data =type_data_groplog_reg.loc[type_data_groplog_reg['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='log_n_samples', y='E_DIFF',facet_row='ds_type',height=600)
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig




##################################################################################3
# The error difference decreases with an increase in the number of samples


##################################################################################

    ds_type  noise_levels  n_samples    E_DIFF
0   circles           0.0         10  0.445556
1   circles           0.0         50  0.070889
2   circles           0.0        100  0.056111
3   circles           0.0       1000  0.002756
4   circles           0.0      10000  0.000000
5   circles           0.1         10  0.062222
6   circles           0.1         50  0.124222
7   circles           0.1        100  0.150111
8   circles           0.1       1000  0.050644
9   circles           0.1      10000  0.041084
10  circles           0.2         10  0.113333
11  circles           0.2         50  0.163778
12  circles           0.2        100  0.116333
13  circles           0.2       1000  0.085289
14  circles           0.2      10000  0.085071
15  circles           0.3         10  0.070000
16  circles           0.3         50  0.195111
17  circles           0.3        100  0.161333
18  circles           0.3       1000  0.111811
19  circles           0.3      10000  0.108446
20  circles  

interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

    ds_type  noise_levels  n_samples    E_DIFF
0   circles           0.0         10  0.317460
1   circles           0.0         50  0.031429
2   circles           0.0        100  0.022857
3   circles           0.0       1000  0.044587
4   circles           0.0      10000  0.028690
5   circles           0.1         10  0.120635
6   circles           0.1         50  0.041587
7   circles           0.1        100  0.034444
8   circles           0.1       1000  0.017397
9   circles           0.1      10000  0.021763
10  circles           0.2         10  0.180952
11  circles           0.2         50  0.080952
12  circles           0.2        100  0.116984
13  circles           0.2       1000  0.037492
14  circles           0.2      10000  0.028460
15  circles           0.3         10  0.058730
16  circles           0.3         50  0.147619
17  circles           0.3        100  0.082698
18  circles           0.3       1000  0.041079
19  circles           0.3      10000  0.009321
20  circles  

interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

#### 5  For a given noise level of your chioce, How does the train, test and difference error changes
#### with increasing model complexity? (answer for svm and LR seperatly)

In [23]:
#knn 

data=knn_df#.query('noise_levels==0.5')


type_data_grop_knn=data.groupby(by=['ds_type','noise_levels','n_samples','k'])['E_DIFF'].mean()
type_data_grop_knn=type_data_grop_knn.reset_index()
print(type_data_grop)   

type_data_grop_knn['log_n_samples']=type_data_grop_knn['n_samples'].apply(np.log)
type_data_grop_knn['str_n_samples']=type_data_grop_knn['n_samples'].apply(str)
type_data_grop_knn['str_k']=type_data_grop_knn['k'].apply(str)

@widgets.interact
def plot_provider_success_rate_per_region(noise_levels=type_data_grop_knn.noise_levels.unique()):
    data =type_data_grop_knn.loc[type_data_grop_knn['noise_levels']==noise_levels].reset_index().reset_index()
    fig = px.scatter(data, x='str_k', y='E_DIFF',color='str_n_samples', facet_row='ds_type',height=600)
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig




#log_reg 
data=log_reg_df#.query('noise_levels==0.5')
type_data_grop_log_reg=data.groupby(by=['ds_type','noise_levels','n_samples','regularization_value'])['E_DIFF'].mean()
type_data_grop_log_reg=type_data_grop_log_reg.reset_index()
print(type_data_grop_log_reg.columns)
   

type_data_grop_log_reg['log_n_samples']=type_data_grop_log_reg['n_samples'].apply(np.log)
type_data_grop_log_reg['str_n_samples']=type_data_grop_log_reg['n_samples'].apply(str)
type_data_grop_log_reg['str_regularization_value']=type_data_grop_log_reg['regularization_value'].apply(str)
@widgets.interact

def plot_provider_success_rate_per_region(noise_levels=data.noise_levels.unique()):
    data =type_data_grop_log_reg.loc[type_data_grop_log_reg['noise_levels']==noise_levels].reset_index()
    fig = px.scatter(data, x='str_regularization_value', y='E_DIFF',color='str_n_samples',facet_row='ds_type',height=600)
    fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    return  fig

#########################################################3



########################################################

     ds_type  noise_levels  n_samples   pramter   TESTE str_n_samples  \
0    circles           0.0         10     0.001  0.0000            10   
1    circles           0.0         10     0.010  0.0000            10   
2    circles           0.0         10     0.100  0.0000            10   
3    circles           0.0         10     1.000  0.6000            10   
4    circles           0.0         10    10.000  0.6000            10   
..       ...           ...        ...       ...     ...           ...   
415    moons           0.5      10000     0.100  0.8064         10000   
416    moons           0.5      10000     1.000  0.8066         10000   
417    moons           0.5      10000    10.000  0.8068         10000   
418    moons           0.5      10000   100.000  0.8068         10000   
419    moons           0.5      10000  1000.000  0.8068         10000   

     log_pramter  
0      -6.907755  
1      -4.605170  
2      -2.302585  
3       0.000000  
4       2.302585  
..       

interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

Index(['ds_type', 'noise_levels', 'n_samples', 'regularization_value',
       'E_DIFF'],
      dtype='object')
aaaaaaaaaaaaaaaaaaaaaaaaa


interactive(children=(Dropdown(description='noise_levels', options=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0.0),…

#### 6 Are the noise level effect the number 
#### of datapoints needed to reach optimal test results?

In [9]:
def number_datapoints_optimalresults(df):
    # find bast modsel pramter by max teste 
    # and stabble by std of teste in the difference folds
    best_TESTE=df['TESTE'].mean()
    return  pd.Series({'valuo':best_TESTE})




def number_datapoints_optimalresults_2(df):
    # find bast modsel pramter by max teste 
    optimalresults=df['valuo'].max()
    number_datapoints=df.query('valuo==valuo.max()')[['n_samples']].iloc[0].values[0]
    return  pd.Series({'optimalresults':optimalresults,'number_datapoints':number_datapoints})





#6 og_reg
data=log_reg_df
print(data.columns)


#type_data_grop=data.groupby(by=['ds_type','noise_levels','n_samples','regularization_value'])['TESTE'].max()
type_data_grop=data.groupby(by=['ds_type','noise_levels','n_samples']).apply(lambda df:  number_datapoints_optimalresults(df))

type_data_grop=type_data_grop.reset_index()
print(type_data_grop)

type_data_grop_2=type_data_grop.groupby(by=['ds_type','noise_levels']).apply(lambda df:  number_datapoints_optimalresults_2(df))

type_data_grop_2=type_data_grop_2.reset_index()
print(type_data_grop_2)

print('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

#type_data_grop['str_n_samples']=type_data_grop['n_samples'].apply(str)
#fig = px.scatter(type_data_grop, x='noise_levels', y='TESTE',color='str_n_samples',facet_row='ds_type',height=600)
#fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
#fig.show() 

#@widgets.interact
#def plot_provider_success_rate_per_region(n_samples=type_data_grop.n_samples.unique()):
    #data =type_data_grop.loc[type_data_grop['n_samples']==n_samples].reset_index().reset_index()
    #fig = px.scatter(data, x='noise_levels', y='TESTE',facet_row='ds_type',height=600)
    #fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
   # return  fig
    
# 6 kkn
data=knn_df
print(data.columns)


type_data_grop=data.groupby(by=['ds_type','noise_levels','n_samples']).apply(lambda df:  number_datapoints_optimalresults(df))

type_data_grop=type_data_grop.reset_index()
print(type_data_grop)

type_data_grop_2=type_data_grop.groupby(by=['ds_type','noise_levels']).apply(lambda df:  number_datapoints_optimalresults_2(df))

type_data_grop_2=type_data_grop_2.reset_index()
print(type_data_grop_2)
#type_data_grop=data.groupby(by=['ds_type','noise_levels','n_samples','k'])['TESTE'].mean()
#type_data_grop=type_data_grop.reset_index()   
#type_data_grop['str_n_samples']=type_data_grop['n_samples'].apply(str)
#fig = px.scatter(type_data_grop, x='noise_levels', y='TESTE',color='str_n_samples',facet_row='ds_type',height=600)
#fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
#fig.show() 

#@widgets.interact 
#def plot_provider_success_rate_per_region(n_samples=type_data_grop.n_samples.unique()):
    #data =type_data_grop.loc[type_data_grop['n_samples']==n_samples].reset_index().reset_index()
    #fig = px.scatter(data, x='noise_levels', y='TESTE',facet_row='ds_type',height=600)
    #fig.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
    #return  fig
    

Index(['Unnamed: 0', 'model', 'ds_type', 'noise_levels', 'n_samples',
       'k_folds', 'regularization_value', 'TRE', 'TESTE', 'E_DIFF'],
      dtype='object')
    ds_type  noise_levels  n_samples     valuo
0   circles           0.0         10  0.342857
1   circles           0.0         50  0.614286
2   circles           0.0        100  0.564286
3   circles           0.0       1000  0.460429
4   circles           0.0      10000  0.468571
5   circles           0.1         10  0.600000
6   circles           0.1         50  0.525714
7   circles           0.1        100  0.521429
8   circles           0.1       1000  0.506429
9   circles           0.1      10000  0.484229
10  circles           0.2         10  0.442857
11  circles           0.2         50  0.457143
12  circles           0.2        100  0.420000
13  circles           0.2       1000  0.471000
14  circles           0.2      10000  0.479443
15  circles           0.3         10  0.657143
16  circles           0.3         50  0.

In [7]:
print('aaaaaaaaaaaa')

aaaaaaaaaaaa
