# Intro to sklearn

## Diabetes dataset

__Import the diabetes dataset from sklearn. Describe it.__

In [55]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()
X = dataset['data']
y = dataset['target']

__Split the dataset into a training set (70%) and a test set (30%)__

__Train a linear model (with intercept) on the training set__

__Compute the fitting score on the test set. (Bonus: compare with your own computation of $R^2$)__

__Should we adjust the size of the test set? What would be the problem?__

__Implement $k$-fold model with $k=3$.__

__Bonus: use `statsmodels` (or `linearmodels`) to estimate the same linear model on the full sample. Is it always a superior method?__

## Sparse regressions on the Boston House Price Dataset


__Import the Boston House Price Dataset from sklearn. Describe it. Compute correlations.__

__Split the dataset into a training set (70%) and a test set (30%).__

__Train a lasso model to predict house prices. Compute the score on the test set.__

__Train a ridge model to predict house prices. Which one is better?__

__(bonus) Use statsmodels to build a model predicting house prices. What is the problem?__

## Predicting Breast Cancer

Sklearn includes the Winsconsin breast cancer database. It associates medical outcomes for tumor observation, with several characteristics. Can a machine learn how to predict whether a cancer is benign or malignant ?

__Import the Breast Cancer Dataset from sklearn. Describe it.__

In [1]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()

In [3]:
print( dataset['DESCR'] )

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [11]:
X = dataset['data']
y = dataset['target']

In [12]:
X.shape

(569, 30)

In [13]:
y.shape

(569,)

Let's normalize the dataset:

In [14]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
# X_test = sc.transform(X_test)

__Properly train a linear logistic regression to predict cancer morbidity. (bonus: use k-fold validation)__

Split the dataset, keeping 10% of all observations for test.

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1)

We create the model

In [21]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [22]:
model.fit(X_train, y_train)

LinearRegression()

Compute the score:

In [23]:
model.score(X_test, y_test)

0.7523303435448611

Let's do the k-fold validation, with k=3.

In [26]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

scores = []

for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    score = model.score(X_test, y_test)
    
    print(score)
    scores.append(score)

0.6827469154210248
0.7411274933586149
0.685526793901217


In [27]:
scores

[0.6827469154210248, 0.7411274933586149, 0.685526793901217]

__Try with other classifiers. Which one is best?__

In [38]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)

scores_dt = []

for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    
    score = model.score(X_test, y_test)
    
    print(score)
    scores_dt.append(score)

0.8842105263157894
0.9421052631578948
0.9047619047619048


In [39]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

from sklearn.linear_model import Ridge
model = Ridge(random_state=0)

scores_ridge = []

for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    
    score = model.score(X_test, y_test)
    
    print(score)
    scores_ridge.append(score)

0.6775479623518907
0.7498239109662781
0.6853304427611355


In [40]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

from sklearn.svm import SVC
model = SVC(random_state=0)

scores_sv = []

for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    
    score = model.score(X_test, y_test)
    
    print(score)
    scores_sv.append(score)

0.9368421052631579
0.9842105263157894
0.9788359788359788


In [41]:
print("Results:")
print("linear: ", scores)
print("ridge: ", scores_ridge)
print("svc: ", scores_sv)
print("decision tree: ", scores_dt)

Results:
linear:  [0.6827469154210248, 0.7411274933586149, 0.685526793901217]
ridge:  [0.6775479623518907, 0.7498239109662781, 0.6853304427611355]
svc:  [0.9368421052631579, 0.9842105263157894, 0.9788359788359788]
decision tree:  [0.8842105263157894, 0.9421052631578948, 0.9047619047619048]
