# 03 - Beware Overfitting

### **Overfitting**<br> 
When your model captures patterns in your training data too well - meaning it doesn't generalize well to unseen data.

## **Preventing Overfitting** 

**Regularization:** Introducing a penalty for overly complex features that reduces - or eliminates - their weight in our model.

Two common types of regularization include 
+ **Lasso or L1 regularization**
+ **Ridge or L2 regularization**.


Each of these will shrink the weights of coefficients in the model. But L1 can reduce the weight for some features to zero, thereby removing them entirely from the model.

------

# Data

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

In [4]:
cancer = load_breast_cancer()

In [6]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [16]:
df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
df['target'] = pd.Series(cancer['target'])

In [17]:
df.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0


--------

# Modelling

**Example: L1 regularization in a Logistic Regression model.**

### Train/Test Split

In [12]:
from sklearn.model_selection import train_test_split

In [33]:
X = df.drop('target', axis=1)
y = df['target']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#startify=y

### Scale features

In [35]:
from sklearn.preprocessing import StandardScaler

In [36]:
scaler = StandardScaler()

In [37]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

### Training Model

In [38]:
from sklearn.linear_model import LogisticRegression

In [42]:
lr = LogisticRegression(penalty='l1',
                       C=0.1,
                       solver='liblinear',
                       multi_class='ovr')

In [43]:
lr.fit(scaled_X_train, y_train)

LogisticRegression(C=0.1, multi_class='ovr', penalty='l1', solver='liblinear')

In [44]:
print('Training accuracy:', round(lr.score(scaled_X_train, y_train),3))
print('Test accuracy:', round(lr.score(scaled_X_test, y_test),3))

Training accuracy: 0.977
Test accuracy: 0.959
