# Machine Learning with Scikit Learn

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

Supervised learning is a type of machine learning where the values to be predicted are already known, and a model is built with the aim of accurately predicting values of previously unseen data.

2 kinds of supervised learning:
- Classification: target variable consists of categories
- Regression: target variable is continuous

Feature = predictor variable = independent variable
Target variable = dependent variable = response variable

Before using supervised learning:
- no missing values
- numeric format
- data stored in pandas dataframe or numpy arrays

EDA to be performed before doing supervised learning

> Scikit-learn follows the same syntax for all the models

```
from sklearn.module import Model

model = Model()
model.fit(X,y)
predictions = model.predict(X_new)
print(predictions)
```

# Classification

1. Build a model
2. Model learns from the labeled data
3. Pass unlabel data to the model as input
4. Model predicts the labels of unseen data

## k-Nearest Neighbors (KNN)

Predicts the label of a data point by
- Looking at the k closest labeled data points
- Taking a majority vote


In [None]:
churn = pd.read_csv('../data/telecom_churn_clean.csv')
churn.head()

In [None]:
sns.scatterplot(data=churn, x='account_length', y='customer_service_calls', hue='churn')

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# the .values converts the dataFrames into numpy arrays
X = churn[['account_length', 'customer_service_calls']].values
y = churn.churn.values

print(X.shape)
print(y.shape)

In [None]:
knn = KNeighborsClassifier(n_neighbors=15)

knn.fit(X, y)

In [None]:
X_new=np.array([[1, 1],
                [0, 0], 
               [-1, -1]])
print(X_new)

In [None]:
predictions = knn.predict(X_new)
print(predictions)

The `n_neighbors` parameter in a K-Nearest Neighbors (KNN) model specifies the number of closest data points (neighbors) the model considers when making a prediction. Here’s how it impacts the model:

1. **Small `n_neighbors` (e.g., 1 or 2):** The model becomes highly sensitive to noise in the data, as predictions are based on very few neighbors. This can lead to overfitting.

2. **Large `n_neighbors` (e.g., 10 or more):** The model becomes more robust to noise by averaging over more neighbors, but it may also oversmooth the data, potentially missing patterns and leading to underfitting.

In summary, `n_neighbors` controls the trade-off between bias and variance: a small value can capture more complex patterns but may be too sensitive to noise, while a large value smooths out noise but may miss finer details.


## Measuring model performance

Accuracy is a commonly used metric

$$
    accuracy = {correct\_predictions \over total\_observations}
$$

Split data into training and test set. Evaluation has to be performed on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

In [None]:
print(knn.score(X_test, y_test))

Lets try different n values

In [None]:
train_accuracies = {}
test_accuracies = {}
neighbors =  np.arange(1, 26)

for n in neighbors:
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train, y_train)
    train_accuracies[n] = knn.score(X_train, y_train)
    test_accuracies[n] = knn.score(X_test, y_test)


In [None]:
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors,train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors,test_accuracies.values(),  label="Test Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

# Introduction to Regression

In these kinds of problems the response variable is typically continuous.

In [None]:
diabetes_df = pd.read_csv('../data/diabetes_clean.csv')
diabetes_df

In [None]:
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df.glucose.values