#  Supervised Learning

###  Types of supervised learning

**Classification**: Target variable consists of categories

**Regression**: Target variable is continuous

### Naming conventions

Feature = predictor variable = independent variable

Target variable = dependent variable = response variable


**Requirements:**
- No missing values
- Data in numeric format
- Data stored in pandas DataFrame or NumPy array
- Perform Exploratory Data Analysis (EDA) 

### Scikit-learn Syntax

In [None]:
from sklearn.module import Model
model = Model()
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)

**k-Nearest Neighbors**

Predict the label of a data point by Looking at the k closest labeled data pointsTaking a majority vote

**Using scikit-learn to fit a classifier**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
X = dataframe[["feature_1","feature_2","feature_n"]].values
y = dataframe["target_variable"].values
print(X.shape, y.shape)

In [None]:
knn = KNeighborsClassifier(n_neighbors=n) #n= how many neighbors do you want to consideer
knn.fit(X, y)

**Predicting on unlabeled data**

In [None]:
X_new = np.array([[56.8, 17.5],
                  [24.4, 24.1],
                  [50.1, 10.9]])
print(X_new.shape) #We are just passing some input values to the model. Then it will predict the output for y.

In [None]:
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))

## Measuring model performance

To do it, we need to split data in Training and test set.

In [None]:
accuracy = correct_predictions/total_observations

### Train/test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

**Model complexity**
- Larger k = less complex model = can cause underfiting
- Smaller k = more complex model = can lead to over

### Model complexity and over/underfitting

In [None]:
# Using this structure we can see diferent neighbor values and comparate who fits better.
train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)

for neighbor in neighbors:
    knn = KNeighborsClassifier(n_neighbors=neighbor)
    knn.fit(X_train, y_train)
    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)


**Plotting our results**

In [None]:
plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()

## Regression

**Creating feature and target arrays**

In [None]:
X = diabetes_df.drop("target_variable", axis=1).values
y = diabetes_df["target_variable"].values
print(type(X), type(y))


**Making predictions from a single feature**

In [None]:
X_bmi = X[:, 3]
print(y.shape, X_bmi.shape)

In [None]:
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)

### Plotting

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X_bmi, y)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

### Fitting a regression model

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_bmi, y)
predictions = reg.predict(X_bmi)
plt.scatter(X_bmi, y)
plt.plot(X_bmi, predictions) #line
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

## Linear regression using all features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_ state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)

**R-squared**

R² : quantifies the variance in target values explained by the features

Values range from 0 to 1

**R-squared in scikit-learn**

In [None]:
reg_all.score(X_test, y_test)

### Mean squared error and root mean squared error

- MSE is measured in target units, squared
- Measure RMSE in the same units at the target variable

**RMSE in scikit-learn**

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)

## Cross-validation

Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data
Solution: Cross-validation!

* 5 folds = 5-fold CV
* 10 folds = 10-fold CV
* k folds = k-fold CV

More folds = More computationally expensive


### Cross-validation in scikit-learn

In [None]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)

#### Evaluating cross-validation peformance

In [None]:
print(cv_results)

In [None]:
print(np.mean(cv_results), np.std(cv_results))

In [None]:
print(np.quantile(cv_results, [0.025, 0.975])

### Regularized regression

It's a technique used to avoid overfitting