# Lab 4: Binary classification.

---

## Will it rain tomorrow in Australia?

In this lab, we will work with a dataset containing daily weather data from several weather stations in Australia. Based on weather conditions at a given day, we will try to predict if rain will occur the next day.



In [22]:
import pandas as pd
df = pd.read_csv('data/weatherAUS.csv')
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


The dataset contains 23 columns. Some of them contain numerical data, and some of them are categorical. Please note the 'Date' column, which contains the date of the observation in yyyy-mm-dd format.

The target column is 'RainTomorrow', which contains information if it will rain tomorrow. The other columns are features that we can use to predict the target.

## Exercise 1: Data exploration and preprocessing (2 point)

- Extract names of all columns in the dataset. Which columns are numerical, and which are categorical? Check for the number of unique values in each categorical column and print the results. 
- Check if there are any missing values in the dataset. Print the percentage of missing values for each column.
- Draw distribution of the target column 'RainTomorrow' on a histogram. Is the dataset balanced? What is the problem with training a model on an imbalanced dataset?
- Check the size of the dataset before and after removing all missing values and print the results. Is it really a good idea to remove all missing values from the dataset?

In [15]:
# Your code goes here
...

Let's encode the 'RainTomorrow' and 'RainToday' columns, which contain two categories 'Yes' and 'No'. We will use an approach called **label encoding**. The idea is to assign a unique integer $(0,1,2,3,...)$ to each category.

<center>
<img src="imgs/label-encoding.png" width=500>
</center>

For example, if the 'RainTomorrow' column contains 'Yes' and 'No', we can assign $0$ to 'No' and $1$ to 'Yes'. One simple way to do this is to create a dictionary with the mapping and use the `map` function from the `pandas` library.

In [20]:
mapping = {'No': 0, 'Yes': 1}

df['RainTomorrow'] = df['RainTomorrow'].map(mapping)
df['RainToday'] = df['RainToday'].map(mapping)

df.head() # check the results

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,,
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,,
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,,
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,,
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,,


**This approach is simple and works well for some models, but it has a drawback**. If we encoded the 'Location' column with the following strategy:

```python
mapping = {"Albury": 0, 
           "Sydney": 1, 
           "Melbourne": 2,
           ...}

df['Location'] = df['Location'].map(mapping)
```

The model might deduce that 'Melbourne' is somehow more similar to 'Sydney' than to 'Albury', as 2 is closer to 1 than to 0. It may also expect that there is some order in the locations, which is, obviously, not the case. Label encoding may be especially harmful for linear models, as we risk introducing an artificial order in the data.

To avoid those problems, we can use **one-hot encoding** (OHE). In this approach, we create a new binary column for each category. 

<center>
<img src="imgs/ohe-encoding.png" width=500>
</center>

All columns are independent, and the model will not assume any order in the data. The drawback is that the number of dimensions in the data increases significantly, which may lead to some problems related to [the curse of dimensionality](https://www.nature.com/articles/s41592-018-0019-x).

Although you may have an idea of how to implement one-hot encoding with some `pandas` and dataframe manipulations, OHE can be easily executed using the `get_dummies` function, as demonstrated below.

In [19]:
locations_ohe = pd.get_dummies(df['Location'])
locations_ohe.head() # check the results

Unnamed: 0,Adelaide,Albany,Albury,AliceSprings,BadgerysCreek,Ballarat,Bendigo,Brisbane,Cairns,Canberra,...,Townsville,Tuggeranong,Uluru,WaggaWagga,Walpole,Watsonia,Williamtown,Witchcliffe,Wollongong,Woomera
0,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### *Implement one-hot encoding yourself, with the use of the `pandas` library.

If you are feeling confident with your `pandas` skills, you can try to implement one-hot encoding yourself.

The function `ohe_encode` should take a single argument, `column`, which is a pandas Series containing categorical data. The function should return a dataframe with one-hot encoded data, where the columns are named after the categories present in the input.

In [None]:
def ohe_encode(column):
    # Your code goes here
    ...

## Exercise 2: Encoding categorical features (1 point)

- Encode 'Locations', 'WindGustDir', 'WindDir9am', 'WindDir3pm' columns using **one-hot encoding** strategy. Join the results with the original dataframe. Remember to drop the original columns containing non-encoded data from the resulting dataframe.What is the problem with training a model on an imbalanced dataset?

In [21]:
# Your code goes here
...

## Exercise 3: Encoding the 'Date' column (1 point)

The 'Date' column contains information about the date of the observation.

- How many unique dates are present in the 'Date' column? Print the number of unique values.
- **What would be the problem with encoding the 'Date' column using either label encoding or OHE?** Think about how to encode the 'Date' column in a meaningful way. There are many possible strategies, so choose the one that seems the most reasonable to you and encode the 'Date' column accordingly. Join the results with the original dataframe and remember to drop the original 'Date' column.

In [24]:
# Your code goes here
...


You are already familiar with the train-test split procedure. We will use it to split the dataset into training and test sets. We will also deal with missing values in the dataset.

As you have seen, dropping all the rows with missing values reduces the size of our dataset by over 60%. This is unacceptable, and we will circumvent this problem by using imputation.

Imputation is a process of replacing missing values with some estimated values. One of the simplest strategies is to replace missing values with the **mean** or the **median** of the column. You should be wary of using the mean, as it is sensitive to outliers. The median is more robust in this regard.

In case of categorical data, you can replace missing values with the most frequent value in the column.

**Remember that you should calculate the mean or median on the training set and use the same values to impute missing values in the test set**. This is crucial, as you should not assume that you have access to the test set during the training phase, which would be an obvious case of **data leakage**.

## Exercise 4: Train-test split and fixing missing values (2 points)

- Split the dataframe into X (features) and $y$ (target). The target column is 'RainTomorrow', and the features are all other columns.
- Split the features $X$ and labels $y$ into training and testing sets. Use 20% of the data for testing.
- Calculate the **median** of each numerical column in the training set. Use the median to impute missing values in the training set. For categorical columns, use the most frequent value in the column for imputation.
- Apply the same imputation to the test set.

In [25]:
from sklearn.model_selection import train_test_split

# Your code goes here
...

## Measuring performance of a classifier

Before we go on to training your weather predictor, let's discuss how to measure the performance of a trained classifier.

In binary classification, we have four possible outcomes for each sample:
- **True positive (TP)**: The classifier correctly predicted the positive class.
- **True negative (TN)**: The classifier correctly predicted the negative class.
- **False positive (FP)**: The classifier incorrectly predicted the positive class.
- **False negative (FN)**: The classifier incorrectly predicted the negative class.

Those outcomes can be summarized in what is known as a **confusion matrix**.

The two most common metrics for binary classification are **accuracy** and **ROC AUC**.

**Accuracy** is the ratio of correctly predicted observations to the total observations. It is a simple metric, but it can be misleading when the dataset is imbalanced. For example, if 95% of the samples belong to class 0, a classifier that always predicts class 0 will achieve 95% accuracy. 

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{correct predictions}}{\text{all predictions}}$$
    
**ROC AUC** is a more robust metric for imbalanced datasets. You can read about the math behind it [at this cool online ML course by Google](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc). ROC AUC score is the probability that the classifier will rank a randomly chosen positive sample higher than a randomly chosen negative sample.

Accuracy and ROC AUC metrics are implemented in the `sklearn` library, and to import them, you can use the following code:
    
```python
from sklearn.metrics import accuracy_score, roc_auc_score
```

## Training a model

If you have successfully completed the previous exercises, you should have a dataset without any missing values, split into training and test sets, with all the categorical columns encoded.

Now you can train a model that predicts rain based on the weather conditions. There are many classifier models implemented in `sklearn` library that you can use. I suggest starting with the `LogisticRegression` model, and you can try other models later.

## Exercise 5: Training a model (2 points)

- Train a `LogisticRegression` model on the training set. Use the default parameters.
- Calculate the accuracy of the model on the test set and print it.
- Calculate ROC AUC score for the model on the test set and print it.
- Try at least one other model from the `sklearn` library. Compare the results of both models. Which model performs better?

    Among many classifier models implemented in `sklearn`, I suggest trying:
    - `SVC`
    - `RandomForestClassifier`
    - `KNeighborsClassifier`

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score

# Your code goes here
...

## Confusion matrix

Accuracy and ROC AUC are useful metrics, but they do not provide detailed information about the classifier's performance. Let's take a look at the confusion matrix and derive two more important metrics: **precision** and **recall**.



- **Precision** is a useful metric when the cost of producing a false positives is high. Imagine that your classifier model is trained to predict if a mushroom is edible or poisonous. The cost of making a mistake and classifying a poisonous mushroom as edible (**FP**) is very high, as we may happen to need a liver transplant in that case. With high precision we sacrifice some **TP** for the sake of avoiding **FP**.

- **Recall** is a useful metric when we do not care about false positives too much, and just want to catch as many positive samples as possible. If your classifier is trained for some medical screening task, we want to catch as many sick patients as possible (**TP**), even if it means that some healthy patients will be classified as sick (**FP**). The cost of false positive is not high in this case, as the healthy patient will be correctly diagnosed later, in more rigorous tests.

## *Class competition!
Try to achieve the highest possible ROC AUC on the test set. You can experiment with different models, hyperparameters, and feature engineering techniques. Post your scores in an online leaderboard. The 3 students with the highest ROC AUC score will receive a cool sticker and will be encouraged to present their solutions during the next lab.