In [0]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display, HTML
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

#  Linear Regression


## Exercise 

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

Utilizar [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)





### Understanding Data

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per us\$10,000
- PTRATIO - pupil-teacher ratio by town
- B  $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in us\$1000's

Descriptive analytics

- What questions would you ask to understand the data?
- What visualization tools to use?

In [0]:

boston=load_boston()
boston_df=pd.DataFrame(boston.data,columns=boston.feature_names)

print(boston.data.shape) #get (numer of rows, number of columns or 'features')
print(boston.DESCR) #get a description of the dataset
boston_df.describe()

In [0]:
boston_df.plot.box(figsize=(20,10))

### Preparing the data



In [0]:
# add another column that contains the house prices which in scikit learn datasets are considered as target
boston_df['Price']=boston.target
boston_df.head(3)

#### Split training and text data

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

In [0]:

# split training and text data
X=boston_df.drop('Price',axis=1)
y=boston_df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1212)

### Modeling

#### Train the model

Train the models consist in  make  the optimization to obtain the long memory paramters of the model.

In [0]:
# Create linear regressor object (in an array to train all)
models = [linear_model.LinearRegression(),Ridge(alpha=0.01),Ridge(alpha=100),Lasso(alpha=0.001, max_iter=10e5),Lasso(alpha=0.7, max_iter=10e5)]


for regr in models:
  regr.fit(X_train, y_train)
  # The coefficients
  print('Coefficients: ', regr.coef_)
  print('Intercept: ', regr.intercept_)



Understanding the difference

In [0]:
names=['Linear Regression',r'Ridge; $\alpha = 0.01$',r'Ridge; $\alpha = 100$',r'Lasso, $\alpha = 0.001$',r'Lasso, $\alpha = 0.7$']
markers=['d','o','*','+','<','>']
plt.figure(figsize=(20,10))
for i in range(0,len(names)):
  plt.plot(X.columns,models[i].coef_,alpha=0.7,linestyle='none',marker=markers[i],markersize=5,label=names[i])
plt.xlabel('feature',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=1)
plt.show()


## Which is the best model?

When evaluating different settings (“hyperparameters”) for estimators or differents models, there are the risk of overfitting on the test set because the selection of the model can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). 

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include gamma for Ridge, alpha for Lasso, etc.  It is possible and recommended to search the hyper-parameter space for the best cross validation score.

Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator, use:

estimator.get_params()

Two generic approaches to sampling search candidates are provided in scikit-learn: for given values, [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) exhaustively considers all parameter combinations, while [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) can sample a given number of candidates from a parameter space with a specified distribution.

[More information](https://scikit-learn.org/stable/modules/grid_search.html)

In [0]:

from sklearn.model_selection import cross_val_score

#from https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
scoring =['explained_variance',
          'neg_mean_absolute_error',
          'neg_mean_squared_error',
          'neg_mean_squared_log_error',
          'neg_median_absolute_error',
          'r2']

for reg,name in zip(models,names):
    scores = cross_val_score(reg, X_train, y_train.ravel(), scoring=scoring[5],
                            cv=5)
    print('--------------------------------------')
    print('model {0:20} | score {1:20}'.format(name,scoring[5]))
    print('mean {0:22.2f}| std   {1:<22.2f}'.format(scores.mean(),scores.std()))

## Exercise (try other models)

Try to use [Elastic-NET](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html), [RANSAC](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RANSACRegressor.html) and [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)


### Test the Model

Can we generalize our model to work good with other data?

In [0]:
from sklearn.metrics import accuracy_score,median_absolute_error
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score,mean_squared_log_error,explained_variance_score


def test_model(X,y_test,regr):
  #prediction
  y_pred=regr.predict(X)
  ##graficas
  plt.scatter(y_pred, y_test,  color='black')
  plt.xlabel('expected value')
  plt.ylabel('Predicted value')
  plt.title("Predicted Price vs Actual Price: $Y_i$ vs $\hat{Y}_i$")
  plt.grid(True)
  # https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
  print('{0:30} | {1:9.2f}'.format('regressor score', regr.score(X,y_test)))
  print('{0:30} | {1:9.2f}'.format('Mean squared error', np.mean((y_pred - y_test) ** 2)))
  print('{0:30} | {1:9.2f}'.format('mean_squared_error', mean_squared_error(y_test,y_pred)))
  print('{0:30} | {1:9.2f}'.format('mean_absolute_error', mean_absolute_error(y_test,y_pred)))
  print('{0:30} | {1:9.2f}'.format('median_absolute_error', median_absolute_error(y_test,y_pred)))
  print('{0:30} | {1:9.2f}'.format('explained_variance_score', explained_variance_score(y_test,y_pred)))
  print('{0:30} | {1:9.2f}'.format('r ^ 2 score', r2_score(y_test,y_pred)))

models[0].fit(X_train,y_train)  
test_model(X_test,y_test,models[0])

### Implementation


How it is going to work inside the process and organization?

## Ejercicio (Base de datos Fasecolda)

A partir de la comprensión inicial de los datos de Fasecolda (ejercicio 1)

- - ¿cuales serian las mejores variables de entrada para hacer la regresión y porque?
- ¿Que otras fuentes de información utilizaría para para mejorar la predicción realizada?

- Que transformaciones requiere realizar sobre los datos

- Que ejercicio de regresión realizaria con los datos de los vehiculos presentados por Fasecolda?

- Seria util realizar una regresión de Lasso? por que?

- ¿que técnicas de visualización o muestra de resultados aplicaría?










In [0]:
# Load CSV using Pandas from URL
import pandas as pd
from IPython.display import display, HTML

data = pd.read_csv('guia_fasecolda.csv')

## Presente sus conclusiones sobre regresiones



Se recomienda subir el notebook a github
