<a href="https://colab.research.google.com/github/mayait/ClaseAnalisisDatos/blob/main/machine_learning/Supervised_Learning_Clasification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clasification

Classification is used to predict the label, or category, of an observation. For example, we can predict whether a bank transaction is fraudulent or not. As there are two outcomes here - a fraudulent transaction, or non-fraudulent transaction, this is known as binary classification. Regression is used to predict continuous values. 
For example, a model can use features such as number of bedrooms, and the size of a property, to predict the target variable, price of the property.

**Dataset**
https://archive.ics.uci.edu/ml/datasets/Iranian+Churn+Dataset
https://www.kaggle.com/datasets/royjafari/customer-churn


**Data Dictionary**

*   Column	Explanation
*   Call Failure	number of call failures
*   Complaints	binary (0: No complaint, 1: complaint)
*   Subscription Length	total months of subscription
*   Charge Amount	ordinal attribute (0: lowest amount, 9: highest amount)
*   Seconds of Use	total seconds of calls
*   Frequency of use	total number of calls
*   Frequency of SMS	total number of text messages
*   Distinct Called Numbers	total number of distinct phone calls
*   Age Group	ordinal attribute (1: younger age, 5: older age)
*   Tariff Plan	binary (1: Pay as you go, 2: contractual)
*   Status	binary (1: active, 2: non-active)
*   Age	age of customer
*   Customer Value	the calculated value of customer
*   Churn	class label (1: churn, 0: non-churn)

In [None]:
!wget https://raw.githubusercontent.com/mayait/ClaseAnalisisDatos/main/machine_learning/datasets/telecom_churn_clean.csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 100

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

from matplotlib.colors import ListedColormap
import seaborn as sns
%config InlineBackend.figure_format = 'retina' # sharper plots

churn_df = pd.read_csv("telecom_churn_clean.csv")
df = churn_df

In [None]:

plt.figure()
churn_df.plot.scatter(x='customer_service_calls', y='account_length', c='churn', cmap = plt.get_cmap('jet'))


plt.xlabel("customer_service_calls")
plt.ylabel("account_length")

plt.show()

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, neighbors
from mlxtend.plotting import plot_decision_regions

In [None]:
def knn_comparison(data, k):
  y = churn_df["churn"].values
  x = churn_df[["account_length", "customer_service_calls"]].values


  clf = neighbors.KNeighborsClassifier(n_neighbors=k)
  clf.fit(x, y)
  # Plotting decision region
  plot_decision_regions(x, y, clf=clf, legend=2)
  # Adding axes annotations
  plt.xlabel('account_length')
  plt.ylabel('customer_service_calls')
  plt.title('Knn with K='+ str(k))
  plt.show()

knn_comparison(churn_df, 5)

**k-Nearest Neighbors**
Let's build our first model! We'll use an algorithm called k-Nearest Neighbors, which is popular for classification problems. The idea of k-Nearest Neighbors, or KNN, is to predict the label of any data point by looking at the k, for example, three, closest labeled data points and getting them to vote on what label the unlabeled observation should have. KNN uses majority voting, which makes predictions based on what label the majority of nearest neighbors have.

**k-Nearest Neighbors: Fit**

In this exercise, you will build your first classification model using the churn_df dataset, which has been preloaded for the remainder of the chapter.

The features to use will be "account_length" and "customer_service_calls". The target, "churn", needs to be a single column with the same number of observations as the feature data.

You will convert the features and the target variable into NumPy arrays, create an instance of a KNN classifier, and then fit it to the data.

numpy has also been preloaded for you as np.

**Instructions**
* Import KNeighborsClassifier from sklearn.neighbors.
* Create an array called X containing values from the "account_length" and "customer_service_calls" columns, and an array called y for the values of the "churn" column.
* Instantiate a KNeighborsClassifier called knn with 6 neighbors.
Fit the classifier to the data using the .fit() method.

🌶️ Import KNeighborsClassifier

In [None]:
# Import KNeighborsClassifier
from ____.____ import ____ 

# Create arrays for the features and the target variable
y = ____["____"].values
X = ____[["____", "____"]].values

# Create a KNN classifier with 6 neighbors
knn = ____

# Fit the classifier to the data
knn.____(____, ____)

In [None]:
# SOLUCIÓN
# Import KNeighborsClassifier
import numpy as np
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the target variable
y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X,y)

**k-Nearest Neighbors: Predict**
Now you have fit a KNN classifier, you can use it to predict the label of new data points. All available data was used for training, however, fortunately, there are new observations available. These have been preloaded for you as X_new.

The model knn, which you created and fit the data in the last exercise, has been preloaded for you. You will use your classifier to predict the labels of a set of new data points:

```
X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])
```

In [None]:
X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])

🌶️ Predict the labels for the X_new

In [None]:
# Predict the labels for the X_new
y_pred = ____

# Print the predictions for X_new
print("Predictions: {}".format(____)) 

In [None]:
# SOLUCION
# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions for X_new
print("Predictions: {}".format(y_pred)) 

Great work! The model has predicted the first and third customers will not churn in the new array. But how do we know how accurate these predictions are?

# Measuring model performance

Now we can make predictions using a classifier, but how do we know if the model is making correct predictions? We can evaluate its performance!

In classification, accuracy is a commonly-used metric. Accuracy is the number of correct predictions divided by the total number of observations.

How do we measure accuracy? We could compute accuracy on the data used to fit the classifier. However, as this data was used to train the model, performance will not be indicative of how well it can generalize to unseen data, which is what we are interested in!



\begin{align}
        Accuracy = \frac{Correct Predictions}{Total Observations}
    \end{align}

-

\begin{align}
        Precisión = \frac{Predicciones Correctas}{Total Observaciones}
    \end{align}

[Aprende más en Google Machine Learning](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data?hl=en)



It is common to split data into a training set and a test set.

*   We fit the classifier using the training set,
*   then we calculate the model's accuracy against the test set's labels.

![img](https://developers.google.com/static/machine-learning/crash-course/images/PartitionTwoSets.svg)

![img](https://developers.google.com/static/machine-learning/crash-course/images/TrainingDataVsTestData.svg)

Validating the trained model against test data.

Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

**Train/test split + computing accuracy**

NumPy arrays have been created for you containing the features as X and the target variable as y. You will split them into training and test sets, fit a KNN classifier to the training data, and then compute its accuracy on the test data using the .score() method.

**Instructions**
* Import train_test_split from sklearn.model_selection.
* Split X and y into training and test sets, setting test_size equal to 20%, random_state to 42, and ensuring the target label proportions reflect that of the original dataset.
* Fit the knn model to the training data.
* Compute and print the model's accuracy for the test data.



In [None]:
# Import the module
____

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = ____(____, ____, test_size=____, random_state=____, stratify=____)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
____

# Print the accuracy
print(knn.score(____, ____))

In [None]:
## Solución
# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train,y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

**Overfitting and underfitting**

Interpreting model complexity is a great way to evaluate performance when utilizing supervised learning. Your aim is to produce a model that can interpret the relationship between features and the target variable, as well as generalize well when exposed to new observations.

You will generate accuracy scores for the training and test sets using a KNN classifier with different n_neighbor values, which you will plot in the next exercise.

The training and test sets have been created from the churn_df dataset and preloaded as X_train, X_test, y_train, and y_test.

**Instructions**
* Create neighbors as a numpy array of values from 1 up to and including 12.
* Instantiate a KNN classifier, with the number of neighbors equal to the neighbor iterator.
* Fit the model to the training data.
* Calculate accuracy scores for the training set and test set separately using the .score() method, and assign the results to the index of the train_accuracies and test_accuracies dictionaries, respectively.

In [None]:
# Create neighbors
neighbors = np.arange(____, ____)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = ____(____=____)
  
	# Fit the model
	knn.____(____, ____)
  
	# Compute accuracy
	train_accuracies[____] = knn.____(____, ____)
	test_accuracies[____] = knn.____(____, ____)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

In [None]:
# Solución
# Create neighbors
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)
  
	# Fit the model
	knn.fit(X_train,y_train)
  
	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

In [None]:
import warnings

warnings.filterwarnings("ignore",category=UserWarning)

for k in np.arange(1, 13):
  knn_comparison(churn_df, k)


**Visualizing model complexity**

Now you have calculated the accuracy of the KNN model on the training and test sets using various values of n_neighbors, you can create a model complexity curve to visualize how performance changes as the model becomes less complex!

The variables neighbors, train_accuracies, and test_accuracies, which you generated in the previous exercise, have all been preloaded for you. You will plot the results to aid in finding the optimal number of neighbors for your model.

**Instructions:**

* Add a title "KNN: Varying Number of Neighbors".
* Plot the .values() method of train_accuracies on the y-axis against neighbors on the x-axis, with a label of "Training Accuracy".
* Plot the .values() method of test_accuracies on the y-axis against neighbors on the x-axis, with a label of "Testing Accuracy".
* Display the plot.

In [None]:
# Add a title
plt.title("____")

# Plot training accuracies
plt.plot(____, ____, label="____")

# Plot test accuracies
plt.plot(____, ____, label="____")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
____

In [None]:
# Solución



# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, list(train_accuracies.values()),  label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, list(test_accuracies.values()),  label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()