In [0]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display, HTML
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


#  Linear Regression


## Exercise 

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.
[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Iris)










### Understanding Data

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
    - Iris Setosa
    - Iris Versicolour
    - Iris Virginica

Descriptive analytics

- What questions would you ask to understand the data?
- What visualization tools to use?

In [0]:
# import some data to play with
iris = datasets.load_iris()
print(iris.data.shape) #get (numer of rows, number of columns or 'features')
print(iris.DESCR) #get a description of the dataset

In [0]:
#Using pandas
iris_df=pd.DataFrame(iris.data,columns=iris.feature_names)
iris_df['class']=iris.target
iris_df.plot.box(figsize=(20,10))
iris_df.describe(include='all')

### Preparing the data



In [0]:
# Normalize the data???
X = iris.data[:, :2] # solo tomemos las dos primeras entradas para este ejemplo visual 
y = iris.target

#### Split training and text data

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

### Modeling

#### Train the model

Train the models consist in  make  the optimization to obtain the long memory paramters of the model.

In [0]:
# Create linear regressor object (in an array to train all)
models = [svm.SVC(kernel='linear', C=1,gamma=0.1)]

for model in models:
  model.fit(X_train, y_train)
  # The coefficients
  print( model.get_params())
  print('Intercept',model.intercept_)
  print('Coeff',model.coef_)



Understanding Classification

**Not is a result only a way to understand classification**

Never conclude over the train data

In [0]:
'''
@param X  es usado para obtener el max y minimo valor para hacer la grilla
@param y solo es usado para comparar 
'''
def visual(svc,X,y,title):
  # create a mesh to plot in
  x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
  y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
  h = (x_max / x_min)/100
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
  np.arange(y_min, y_max, h))
  plt.subplot(1, 1, 1)
  Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)
  plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
  ##plot the data
  plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
  plt.xlabel('Sepal length')
  plt.ylabel('Sepal width')
  plt.xlim(xx.min(), xx.max())
  plt.title(title)
  plt.show()

In [0]:
for model in models:
    visual(model,X_train,y_train,'SVC with linear kernel')
    print(classification_report(y_test, model.predict(X_test), target_names=iris.target_names))

## Hyperparameters

Los hiperparámetros son parámetros que no se aprenden directamente dentro de los estimadores. 
En scikit-learn se pasan como argumentos al constructor de las clases de estimador.

Para esto existen dos enfoques genéricos en scikit ([Tuning the hyper-parameters of an estimator](https://scikit-learn.org/stable/modules/grid_search.html))   para muestrear candidatos de búsqueda: para valores dados, [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) considera exhaustivamente todas las combinaciones de parámetros, mientras que [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) puede muestrear un número dado de candidatos de un espacio de parámetros con una distribución específica.

Para probar cual hyperparametro es mejor debemos hacer cross_validation con algun metodo de dividir los datos de *train* , por ejemplo *k-folds* , y decidrile sobre que [score](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) que queremos trabajar (recall, precision,f1,....).  

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include gamma for Ridge, alpha for Lasso, etc.  It is possible and recommended to search the hyper-parameter space for the best cross validation score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use:

estimator.get_params()

Two generic approaches to sampling search candidates are provided in scikit-learn: for given values, [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) exhaustively considers all parameter combinations, while [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) can sample a given number of candidates from a parameter space with a specified distribution.

[More information](https://scikit-learn.org/stable/modules/grid_search.html)

In [0]:

import warnings
warnings.filterwarnings('ignore') 
from sklearn.model_selection import GridSearchCV


# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [0.1,1e-3, 1e-4],
                     'C': [1, 10, 100, 1000,10000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['poly'],'degree':[3,5,7], 'C': [1, 10, 100, 1000],'gamma': [1e-3]}
                   ]

scores = ['precision' ] #, 'recall','f1' 

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(X_train, y_train)
    print("Best parameters set found on development set:")
    print(clf.best_params_)
    print("Grid scores on development set:")
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))


    

## Exercise 

- Try to compare with different classificators

- Use the four input variables to classification


### Test the Model

Can we generalize our model to work good with other data?

In [0]:
from sklearn.metrics import accuracy_score,median_absolute_error
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score,mean_squared_log_error,explained_variance_score


y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
visual(clf,X,y,'Best model')

### Implementation


How it is going to work inside the process and organization?

## Ejercicio (Base de datos Fasecolda)

A partir de la comprensión inicial de los datos de Fasecolda (ejercicio 1)

- - ¿cuales serian las mejores variables de entrada para hacer clasificacion y porque?
- ¿Que otras fuentes de información utilizaría para para mejorar la predicción realizada?

- Que transformaciones requiere realizar sobre los datos

- Que ejercicio de clasificacion realizaria con los datos de los vehiculos presentados por Fasecolda?


- ¿que técnicas de visualización o muestra de resultados aplicaría?










In [0]:
# Load CSV using Pandas from URL
import pandas as pd
from IPython.display import display, HTML
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

data = pd.read_csv('guia_fasecolda.csv')

## Presente sus conclusiones sobre clasificadores


Se recomienda subir el notebook a github
