# Chapter #1: Classification

## 1. Machine learning with scikit-learn

**What is machine learning?**
> - Machine learning is the process whereby computers learn to make decisions from data without being explicitly programmed.

**Examples of machine learning**
> - For example, learning to predict whether an email is spam or not spam given its content and sender.
> - Or learning to cluster books into different categories based on the words they contain, then assigning any new book to one of the existing clusters.

**Unsupervised learning**
> - Unsupervised learning is the process of uncovering hidden patterns and structures from unlabeled data.
> - For example, a business may wish to group its customers into distinct categories based on their purchasing behavior without knowing in advance what these categories are.
> - This is known as clustering, one branch of unsupervised learning.

<img src="./assets/ch01_01_ml_with_scikit_learn_img01.png">

**Supervised learning**
> - Supervised learning is a type of machine learning where the values to be predicted are already known, and a model is built with the aim of accurately predicting values of previously unseen data.
> - Supervised learning uses features to predict the value of a target variable, such as predicting a basketball player's position based on their points per game.
> - This course will exclusively focus on supervised learning.

<img src="./assets/ch01_01_ml_with_scikit_learn_img02.png">

**Types of supervised learning**
> - There are two types of supervised learning:
> 1. Classification is used to predict the label, or category, of an observation.
>> - For example, we can predict whether a bank transaction is fraudulent or not.
>> - As there are two outcomes here - a fraudulent transaction, or non-fraudulent transaction, this is known as binary classification.
> 2. Regression is used to predict continuous values.
>> - For example, a model can use features such as number of bedrooms, and the size of a property, to predict the target variable, price of the property.

**Naming conventions**
> - Note that what we call a feature throughout the course, others may call a predictor variable or independent variable.
> - Also, what we call the target variable, others may call dependent variable or response variable.

**Before you use supervised learning**
> - There are some requirements to satisfy before performing supervised learning:
>> - Our data must not have missing values,
>> - Our data must be in numeric format,
>> - Our data is stored as pandas DataFrames or Series, or NumPy arrays.
> - This requires some exploratory data analysis first to ensure data is in the correct format.
> - Various pandas methods for descriptive statistics, along with appropriate data visualizations, are useful in this step.

**scikit-learn syntax**
> - scikit-learn follows the same syntax for all supervised learning models, which makes the workflow repeatable.
> - Let's familiarize ourselves with the general scikit-learn workflow syntax, before we explore using real data later in the chapter:
>> - We import a `Model`, which is a type of algorithm for our supervised learning problem, from an `sklearn` module.
>>> - For example, the k-Nearest Neighbors model uses distance between observations to predict labels or values.
>> - We create a variable named `model`, and instantiate the `Model`.
>> - A `model` is fit to the data, where it learns patterns about the features and the target variable.
>>> - We fit the model to `X`, an array of our features, and `y`, an array of our target variable values.
>> - We then use the `model`'s `.predict()` method, passing six new observations, `X_new`.
>>> - For example, if feeding features from six emails to a spam classification model, an array of six values is returned.
>>> - A `1` indicates the model predicts that email is spam, and a `0` represents a prediction of not spam.

<img src="./assets/ch01_01_ml_with_scikit_learn_img03.png">

### 1.1. Binary classification

In the video, you saw that there are two types of supervised learning — classification and regression. Recall that binary classification is used to predict a target variable that has only two labels, typically represented numerically with a zero or a one.

A dataset, `churn_df`, has been preloaded for you in the console.

Your task is to examine the data and choose which column could be the target variable for binary classification.

In [1]:
import pandas as pd
churn_df = pd.read_csv("./datasets/churn.csv").drop(columns="Unnamed: 0")
churn_df.head()

Unnamed: 0,account_length,total_day_charge,total_eve_charge,total_night_charge,total_intl_charge,customer_service_calls,churn
0,101,45.85,17.65,9.64,1.22,3,1
1,73,22.3,9.05,9.98,2.75,2,0
2,86,24.62,17.53,11.49,3.13,4,0
3,59,34.73,21.02,9.66,3.24,1,0
4,129,27.42,18.75,10.11,2.59,1,0


- Possible Answers:
> - `"customer_service_calls"`
> - `"total_night_charge"`
> - **`"churn"`**
> - `"account_length"`

### 1.2. The supervised learning workflow

Recall that scikit-learn offers a repeatable workflow for using supervised learning models to predict the target variable values when presented with new data.

Reorder the pseudo-code provided so it accurately represents the workflow of building a supervised learning model and making predictions.

- Drag the code blocks into the correct order to represent how a supervised learning workflow would be executed.

<img src="./assets/ch01_01_02_the_supervised_learning_workflow_img01.png">

## 2. The classification challenge