# Supervised Learning

Requirement:
1. No missing values
2. Data in numeric format
3. Data in pandas.DataFrame or numpy.array

## Proprocess Data

#### Formatting Data

Because data has to be in numeric format, categorical data has to be converted. 

For example, a 'genre' column of an album list may contain values 'rock', 'r&b', 'pop'...

Can add columns ['rock'], ['r&b'], ['pop'] with value 0 and 1 to indicate False or True

pandas has built-in function get_dummies()
```python
dummies = pd.get_dummies(df['genre'])

x = dummies.drop('observation_col', axis=1).values
y = dummies['observation_col'].values

```

#### Impute

Has to remove all missing values

```python
from sklearn.impute import SimpleImputer

imp_cat = SimpleImputer(strategy = 'most_frequent')
x_train_cat = imp_cat.fit_transform(x_train_cat)
x_test_cat = imp_cat.transform(x_test_cat)
```

#### Standardize

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

```

#### Pipeline

Pipeline aggregate transformation and model-fitting

Use pipeline with impute
```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

steps = [('imputation', SimpleImputer()), ('logistic_regression', LogisticRegression())]
pipeline = Pipeline(steps)

pipeline.fit(x_train, y_train)
pipeline.score(x_test, y_test)
```

Use pipeline with scale and cross-validation
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)

cv = GridSearchCV(pipeline, param_gird={'knn__n_neighbors':np.arange(1,50)})
cv.fit(x_train, y_train)
y_pred = cv.predict(x_test)
```

In [87]:
data = pd.DataFrame({'Name':['A','B','C'], 'Genre':['Rock', 'Country', 'Pop']})
pd.get_dummies(data['Genre'])

Unnamed: 0,Country,Pop,Rock
0,0,0,1
1,1,0,0
2,0,1,0


In [71]:
a = np.repeat([2,3],2).reshape(2,2)
b = np.array([[1],[2]])
np.append(a,b, axis=1)

array([[2, 2, 1],
       [3, 3, 2]])

## scikit model syntax 
```python
from sklearn.module import Model

model = Model()
model.fit(x_training,y_training)
predictions = model.predict(x_)
```

`x_training`  and `y_training` as numeric columns

## Measure Model Performance

Accuracy = Correct Predictions / Total Observation

```python
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, stratify=y)
```

## K-nearest Neighbors (KNN)

Definition: Looking at the k closest labeled datapoints and take a mojority vote

```python
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 15)
knn.fit(x_training, y_training)
y_test = knn.predict(x_test)
accuracy = knn.score(x_test, y_test)
```

It is common to test different `n_neighbors` to see which yields best result

## Regression

```python
from sklearn.linear_model import LinearRegression

reg = LinearRegresion()
reg.fit(x_training, y_training)
y_predict = reg.predict(x_test)

print(y_predict)
print(y_test)
```

### Regression Methodology

**R-squared:**

Formula: 

1-\[sum(y - y_predict)^2 / sum(y - y_mean)^2]

Explanation: 
- 1 - \[variance between actual value and predicted value / variance between actual value and mean value]
- RESULT is a percentage!!!
- Higher R-squred means lower overlapping between two variances, thus model is a better fit than mean value slope
- Lower R-squred means high overlap between two variance, the model is similar to simply use y_mean to predict y

```python
reg = LinearRegression()
reg.fit(x_train, y_train)

# reg use x_test to get y_predict and compare that with y_test
reg.score(x_test, y_test)
```

**Mean Squared Error:**
Formula:

Mean\[sum((y - y_predict)^2)]

Explanation:
- Average squared difference between actual value and predicted value
- RESULT is a squred value!!!

```python
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)
```

Can sqrt(MSE) to get RMSE `mean_squared_error(y_test, y_pred, squared='False')`

## Regularize

Too many coefficients(explanatory variables, x) lead to overfitting

#### Ridge Regression
Instead of only using OLS (R-squared) to minimize loss(discrepency between actual observation and prediction), we can penalize large coefficients by adding a weight(alpha) to each coefficient using alpha * SUM(coefficients^2)

alpha = 0, no penalty
0< alpha < 1, penalize large coefficients
alpha >= 1, award large coeffcients

```python
from sklearn.linear_model import Ridge

ridge = Ridge(alpha = 0.1)
ridge.fit(x_train, y_train)

y_prediction = ridge.predict(x_test)

ridge.score(x_test, y_test)

```


#### Lasso Regression
Compared to **Ridge Regression** that use coefficients^2, **Lasso Regression** use alpha * SUM(abs(coefficients))

```python
from sklearn.linear_model import Lasso

lasso = Lasso(alpha = 0.1)
lasso.fit(x_train, y_train)

y_prediction = lasso.predict(x_test)

lasso.score(x_test, y_test)
```

## Metrics


#### Accuracy, Precision, Recall
||Predicted True|Predicted False|
|-|-|-|
|**Actual True**|True Positive|False Positive|
|**Actual False**|False Negative|True Negative|

Accuracy:

*TP+TN / TP+FP+FN+TN*

Precision:

*TP / TP+FP*

Recall:

*TP / TP+FN*

F1 Score:
2 * Precision * Recall / (Precision+Recall)

```python
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(x_test, y_test)

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))
```


#### ROC Curve

```python
from sklearn.metrics import roc_curve

# fpr = false positive rate
# tpr = true postive rate
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('fpr')
plt.ylable('tpr')
plt.show()

from sklearn.metrics import roc_auc_score

# if value > 0.5, better than random guess
roc_auc_score(y_test, y_pred_probs)
```

## Logistic Regression

Outputs probabilities

```python
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred_probs = logreg.predict_proba(x_test) 
```

## Hyperparameter Tunning

alpha in the **Regularize** and k in **KNN** are hyperparameters

## Cross Validation

A model is as good as the training dataset it is fed.

By splitting data into test and training sets, we introduced potential biases. To justify/correct such discrepency, we can cross validate: repeat the splitting process and test

```python
from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True)
reg = LinearRegression()

# cv_results test if there is bias in splitting process
cv_results = cross_val_score(reg, x, y, cv=kf)

print(np.mean(cv_results))
print(np.std(cv_results))
print(np.quantile(cv_results, [0.025, 0.975]))
```

## GridSearch Cross Validation

- Test all hyperparameters, folds, datapoints combination to find best-fit

- Resource intensive

```python
from sklearn.model_selection import GridSearchCV

kf = KFold(n_splits=5, shuffle=True)
param_grid = {'alpha': np.linspace(0.0001, 1, 10), 'solver':['sag', 'lsqr']}

ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(x_trian, y_train)

print(ridge_cv.best_params_)
print(ridge_cv.best_score_)

```

## Randomized Search Cross Validation

- Instead of test all combinations

```python
from sklearn.model_selection import RandomizedSearchCV

kf = KFold(n_splits=5, shuffle=True)
param_grid = {'alpha': np.linspace(0.0001, 1, 10), 'solver':['sag', 'lsqr']}

ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter = 2)
ridge_cv.fit(x_trian, y_train)

print(ridge_cv.best_params_)
print(ridge_cv.best_score_)

```