# Exercise Notebook

This notebook will guide you through using Python libraries to analyse a data set, visualize relevant information, and build a predictive model.

### General how-to:

- Your files are visible on the right in the **File Browser** tab. You can even open the data files.
- Change to the <img src="https://img.icons8.com/material-outlined/344/list.png" width=15 height=15 /> tab to see the **table of contents** for the notebook. Now you jump between sections/exercises if you need to.
- Every exercise has an **answer cell** for you to write your answer in (`# your code here`). There is an ellipsis (**...**) wherever you need to complete a command.
- Every exercise has a cell with a correct solution. The **solution cell** is collapsed and not visible until you click on the ellipsis symbol <img src="https://img.icons8.com/ios-filled/344/ellipsis.png" width=20 height=15 />  below the answer cell. After checking your solution, you can collapse the cell again by clicking on the blue vertical line on the right.
- If you can't think of the solution immediately, you have **a few options**:
    - Trial and error (best option)
    - Google search (e.g. 'pandas replace nan values')
    - Copy from the intro notebook (easiest, but might fail sometimes)


In [None]:
# Importing the most important libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_theme()

In [None]:
# Table formatting

from IPython.display import display, HTML
display(HTML("<style>.container { width:60% !important; margin: 0 auto; }}</style>"))

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 1. Features

The data you will work with consists of Portuguese students' profiles and their language course marks. [Source](https://www.kaggle.com/datasets/impapan/student-performance-data-set).  [License](https://creativecommons.org/licenses/by/4.0/)

There are 30 attributes, or **features**, available per student:

Feature ID| Feature Name |Description | Type | Values
--|-----|-----|----|---
1 | school | school name | binary | "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira
2 | sex | student's sex | binary | "F" - female or "M" - male
3 | age | student's age | numeric| from 15 to 22
4 | address | home address type | binary | "U" - urban or "R" - rural
5 | famsize | family size | binary | "LE3" - less or equal to 3 or "GT3" - greater than 3
6 | Pstatus | parent's cohabitation status | binary| "T" - living together or "A" - apart
7 | Medu | mother's education |numeric | 0 - none,  1 - 4th grade, 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education
8 | Fedu | father's education |numeric | 0 - none,  1 - 4th grade, 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education
9 | Mjob | mother's job area |categorical | "teacher", "health", "services" (e.g. administrative or police), "at_home" or "other"
10 | Fjob | father's job area |categorical | "teacher", "health", "services" (e.g. administrative or police), "at_home" or "other"
11 | reason | reason to choose this school | categorical | close to "home", school "reputation", "course" preference or "other"
12 |guardian | student's guardian | categorical | "mother", "father" or "other"
13 |traveltime | home to school travel time |numeric| 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour
14 |studytime | weekly study time | numeric| 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours
15 |failures | number of past class failures |numeric| n if 1<=n<3, else 4
16 |schoolsup | extra educational support |binary| yes or no
17 |famsup | family educational support |binary| yes or no
18 |paid | extra paid classes in Math |binary| yes or no
19 |activities| extra-curricular activities |binary| yes or no
20 |nursery | attended nursery school|binary| yes or no
21 |higher | wants to take higher education |binary| yes or no
22 |internet | Internet access at home |binary| yes or no
23 |romantic | with a romantic relationship |binary| yes or no
24 |famrel | quality of family relationships |numeric| from 1 - very bad to 5 - excellent
25 |freetime | free time after school |numeric| from 1 - very low to 5 - very high
26 |goout | going out with friends |numeric| from 1 - very low to 5 - very high
27 |Dalc|  workday alcohol consumption |numeric| from 1 - very low to 5 - very high
28 |Walc|  weekend alcohol consumption |numeric| from 1 - very low to 5 - very high
29 |health | current health status |numeric| from 1 - very bad to 5 - very good
30 |absences | number of school absences |numeric| from 0 to 93



## 2. Target variables

Target variables are what we want to predict, using the features defined above. They are also called labels.

In our data set, we have one target variable given, which represents the students' final mark in Portuguese.

Feature ID| Feature Name |Description | Type | Values
--|-----|-----|----|---
31 | G3 | final grade | numeric | from 0 to 20

In reality, target variables are something arbitrary. For example in this data set, we could also try predicting any one of the features, for example whether the student wants to go to university (feature `higher`) or not.

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 3. Data Loading and Basic Statistics

### 3.1. **Exercise:** Loading the data

Load the data from the given `csv` file and print it.

In [None]:
data_path = 'Data/student-por.csv'

# your code here
student_data = ...

In [None]:
student_data

____________________

### 3.2. **Exercise:** Data summary

Print a summary of the dataset's columns along with data types and memory usage.

In [None]:
# your code here


__________________

### 3.3. **Exercise:** Feature statistics

Print a (separate) statistical summary for the columns `reason` and `G3`. 

- Why are the outputs different? 
- Write some notes/speculate about what you see (if you want).

In [None]:
# your code here - reason


In [None]:
# your code here - G3


- Different count
- Mean grade is 11.9
- ...


-------------

### 3.4. **Exercise:** Feature distributions

Visualize the feature distributions for the feature `reason` and the target `G3`.
 - Use a [count plot](https://seaborn.pydata.org/generated/seaborn.countplot.html) for `reason` and a [histogram plot](https://seaborn.pydata.org/generated/seaborn.histplot.html) (optionally with KDE) for `G3`. 

In [None]:
# your code here - reason


Most of the students chose their school based on their course preference.

In [None]:
# your code here - G3


We can see that most students had average grades, with a few failing (grade 0).

Feel free to plot some other features - just insert some cells below with the 5th symbol on the upper right and have fun with it :)

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 4. Data Exploration

In this section you will try out different visualizations to represent the data. 
The exercises are stated as questions to give you an idea of the flexibility of data analysis.

### 4.1. **Exercise:** Performance per school

**Do students from one school have higher grades (`G3`) on average than students of the other school?**

- Use a box plot or a violin plot to answer the question.
- Advanced: Separate students by gender as well, using the `hue=` parameter.

In [None]:
# your code here


It seems like students from the GP school have a higher grade average than students attending MS.

______________

### 4.2. **Exercise:** Relationship status

**Does being in a relationship affect a student's grades?**

- Use a box plot or a violin plot to answer the question.
- Advanced: Separate students by gender as well, using the `hue=` parameter.

In [None]:
# your code here


The student's relationship status seems to have little effect on his/her performance in school.

_______________________

### 4.3. **Exercise:** Failures

**How does the expected performance change depending on the number of previous failures?**

- Use a [point plot](https://seaborn.pydata.org/generated/seaborn.pointplot.html) to better see the general trend or a [violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html) to compare distributions in detail.
- Advanced: Separate students by school as well, using the `hue=` parameter. Set `dodge=True` or `split=True` for better visibility.

In [None]:
# your code here


Students that have failed the course before have much lower average grades. Then again, here it is also important to look at how many students have failed 1 or more times:

In [None]:
sns.countplot(x='failures', data=student_data)

Since the students that have failed the course in the past are a very small number, the confidence intervals in the point plot are very large, which means there is higher uncertainty about the truth of the statistic.

___________________

### 4.4. **Exercise:** Study time

**How does the expected performance (`G3`) change depending (reported) study time?**

- Use a [cat plot](https://seaborn.pydata.org/generated/seaborn.catplot.html) to show point plots (`kind='point'`) and separate by school (`col='school'`) and sex (`hue='sex'`).
- Set `dodge=True` for better visibility.

In [None]:
# your code here


We observe the expected increase in performance with longer study times, but it's interesting to see that male students' grades dropped when studying more than 10 hours per week (category 4).

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 5. Data Transformation

In this section we will transform some features. There is a single exercise with multiple parts.

### 5.1. **Exercise:** Numerical to binary

**Transform the numerical 'absences' attribute into a binary attribute 'was_absent' which is set to `yes` if the student was absent more than 7 days and `no` otherwise.**

First, let's **visualize** the `absences` attribute. Use whatever plot you want.

In [None]:
# your code here


We can see that most of the students had no absences or 1 absence.  

**Create** the data for the new `was_absent` column using `np.where(condition, value_if_true, value_if_false)`.

In [None]:
# your code here

was_absent = ...

In [None]:
was_absent   # should be an array of yes/no strings

Add the array as a **new column** `was_absent` in the data set.

In [None]:
# your code here


In [None]:
student_data.head()  # the last column should be was_absent

Finally, let's **visualize** the grade distribution of the new feature `was_absent`. Use whatever plot you want.

In [None]:
# your code here


Great! Now the `absences` column is redundant, so let's **remove** it using `.drop()`. Remember to set the `axis` parameter correctly.

In [None]:
# your code here

student_data = ...

In [None]:
student_data.head()    # the absences column should be gone

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 6. Data Cleaning

Handling missing data is an important part of data analysis and machine learning. Algorithms don't usually know how to deal with missing values, so this preparation step is essential.

### 6.1. **Exercise:** Handling missing values

**Show the number of missing values per column.**

- The `.isna()` function gives a boolean DataFrame, where each cell tells us whether that cell in the original DataFrame contains a NaN value. 
- Hint: You can chain functions! 

In [None]:
# your code here


We see that the missing values are in the columns `Mjob`, `Fjob`, `paid` and `romantic`. We can handle them by:

- Deleting the entries containing missing values -> Since our data set is pretty small (~650 students), deleting >188 entries is a bad idea.
- Deleting the columns containing missing values -> For the columns with a lot of missing values, this is optimal.
- Imputation (replacing the NaN value with e.g. the mean value) -> We have categorical features, so not really an option. Taking the most common category could introduce a lot of noise.
- Replacing the value with a dummy value, e.g. `unknown` -> Could work if the students with NaN values have a different grade distribution. (That's not the case here, trust me.)

So, deleting it is. Drop the columns with `NaN` values. Make sure to set the correct `axis`.

In [None]:
# your code here

student_data = ...

Check again whether data is missing.

In [None]:
# your code here


If you only see zeros, all is good. 

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 7. Data Encoding

We've already cleaned up the missing values, but now we face another problem: machine learning algorithms prefer numerical data to strings. We still have many features with string values, so we need to perform some type of **encoding** to map the strings to numbers. 

________________

### 7.1. **Exercise:** Categorical features

**Make a list of all categorical columns in the data.**
- You can use the `.select_dtypes()` [function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) with the parameter `include=`. 
- Columns with string values have the dtype 'object'.
- `.columns` can be used to get a list of column names

In [None]:
# your code here

categorical_columns = ...

In [None]:
categorical_columns        # should be an Index object with 14 column names

______________

### 7.2. **Exercise:** Indicator variables

Indicator variables are an easy way to encode your categorical features. For example, the binary variable `address` with the categories `U` and 'R' can be encoded with two binary indicator variables `address_U` and `address_R`. 

**Use the `pd.get_dummies()` [function](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to create indicator variables for all categorical features.**
- The `columns` parameter allows you to set a subset of the columns to get dummies from

In [None]:
# your code here

num_student_data = ...

Let's take a look below. Do you notice anything?

In [None]:
num_student_data.head()

The number of columns has increased because we have a new column for each category. 

However, **do we need all of them**? For features with two categories (e.g. `activities`), the two indicator variables are negations of each other, and as such contain the exact same information. This redundancy only takes up more memory, so let's try again.

Try `pd.get_dummies()` again, but this time set `drop_first=True`. This parameter drops one of the categories (the first alphabetically) to remove redundancy.

In [None]:
# your code here

num_student_data = ...

In [None]:
num_student_data.head()

Back to 30 columns.

_____________________

### 7.3. **Exercise:** Correlation

The correlation coefficient is a value between -1 and 1. 
- A coefficient of 0 means that the two variables are not correlated, that is, we can't draw conclusions about one variable if we know the other.
- Coefficients >0 denote a positive correlation, meaning that an increase in one variable is connected to an increase in the other variable. A value of 1 means the variables are exactly equal.
- Coefficients <0 denote a negative correlation, meaning that an increase in one variable is connected to a decrease in the other variable. 

Now that we have all-numerical data, we can perform a correlation check. This check will tell us whether a (linear) relationship exists between each pair of features, and can help us reduce the number of features (if two features are very similar, a model using one of them is usually just as expressive).

**Calculate the correlation of `num_student_data`.**

In [None]:
# your code here

correlation = ...

**Plot the correlation matrix using a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html).** 

- The figure size is set for you.
- To see the correlation coefficient in each cell, set `annot=True`.
- To set the highest and lowest values for correlation, set `vmin=-1` and `vmax=1`
- Optional, but improves the visual - round all values in the correlation matrix to two decimals with `.round(decimals=2)`

In [None]:
plt.figure(figsize=(25,20))

# your code here


Several things to notice:

- The diagonal is all ones, because every feature is perfectly correlated with itself,
- The weekday alcohol consumption (`Dalc`) and the weekend alcohol consuption (`Walc`) have a strong positive correlation (0.62). Frequency of going out (`goout`) is positively correlated to `Dalc`. So students that drink during the week go out more :D
- Wanting to go to university (`higher_yes`) and studying more (`studytime`) are positively correlated with the final grade (`G3`),
- The number of failures is negatively correlated with the final grade (`G3`),

and so on.
 

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## 8. Predictive Modelling

Now that our data is numerical, we can think about what we want to predict - the students' performance in the form of their final grade `G3`.

------


### 8.1. **Exercise:** Features and labels

**Define the features and labels from the `num_student_data` DataFrame.**

In [None]:
# your code here

labels = ...
features = ...

___________

### 8.2. **Exercise:** Train and test data

**Split the data set into train and test data, using the default parameters.** 

- Set the `random_state` to 42. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# your code here

X_train, X_test, y_train, y_test = ...

Let's check the shapes of these new matrices.

In [None]:
print('X_train: ', X_train.shape)
print('X_test:  ', X_test.shape)
print('y_train:  ', y_train.shape)
print('y_test:  ', y_test.shape)

We now have our train and test data. The next step is choosing a model for the task.

`G3` is an continuous numerical variable with the range $[0, 20]$. In the Intro notebook, we predicted a binary **class** (good/bad risk). This time, we're solving a **regression** problem. 

**Classification** is the task of predicting a discrete class label. **Regression** is the task of predicting a continuous quantity.

______________________

### 8.3. **Exercise:** Standardization

Standardization works independently on each feature, by removing the mean and scaling to unit variance:

$$\text{new_value} = \frac{\text{old_value} - \text{mean}}{\text{stand. dev.}} $$

This step is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

**Use the `StandardScaler()` to standardize all features to mean 0 and standard deviation 1.** 


In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# your code here

scaler = ...            # create a scaler instance and fit it to the training data

**Transform the train and test data with the learned scaler.**

In [None]:
# your code here

X_train_std = ...
X_test_std = ...

**Convert them back to a DataFrame, using `X_train.columns` as columns.**

In [None]:
# your code here

X_train_std = ...
X_test_std = ...

________________________

### 8.4. **Exercise:** Model fitting

**Fit a `Ridge` linear regressor to the standardized data** 

- Set the `random_state` to 42. 

In [None]:
from sklearn.linear_model import Ridge

In [None]:
# your code here

ridge = ...

**Fit a Support Vector Regression (`SVR`) model to the standardized data `X_train_std`.** 

- Set the regularization parameter `C` to 2. 

In [None]:
from sklearn.svm import SVR

In [None]:
# your code here

svr = ...

**Fit a `Random Forest` model to the standardized data `X_train_std`.** 

- Set the `random_state` to 42. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# your code here

rforest = ...

________________

### 8.5. **Exercise:** Evaluation

**Evaluate all three models on the train and test data.** 

One of the most common evaluation metrics for regression is the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) $R^2$. It tells us how well the target variable can be predicted from the given features. The best possible $R^2$ is 1.0, or 100%. However, whether an $R^2$ score is 'good enough' is dependent on the problem. Since a model may be arbitrarily worse than perfect, negative values for the $R^2$ are also possible.

- You are given the `score_model` function that returns a dictionary of results, and the `models` dictionary that saves the results of all three models.

In [None]:
def score_model(model):
    return {'train': model.score(X_train_std, y_train), 'test': model.score(X_test_std, y_test)}

In [None]:
models = {'ridge': score_model(ridge),
          'svr': score_model(svr),
          'rforest': score_model(rforest)}

**Convert `models` into a DataFrame and print it. What can you tell from the $R^2$ values?**

In [None]:
# your code here

results_df = ...

In [None]:
results_df

Your notes here:

- 

Example: 

- The train $R^2$ is higher than the test $R^2$ for all models. This is normal since the models learned on the train data, while the test data the saw for the first time.
- SVR has the best performance on the test data, but still rather low.

______________

### 8.6. **Exercise:** Visualizing predictions

**Use the `SVR` model's `.predict()` function to obtain the actual grade predictions for the test set `X_test_std`.**

In [None]:
# your code here

svr_test_predictions = ...

**Create a DataFrame with the true test labels `y_test` and the predictions as columns.**

In [None]:
plt.figure(figsize=(20,6))
sns.lineplot(data=predictions_df)    # for lines
sns.scatterplot(data=predictions_df) # for dots and crosses

The model is not the worst, but tends to predict values around the mean of the training data. The model can't predict failures and high marks.

__________________________

## Conclusion

There are many ways to improve on these results, including [hyperparameter optimization](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/), or trying out different regression models. 

At the end of the day, however, model selection is a question of finding a **good** model for a given task or data set, not necessarily a perfect one.

Thank you for participating in this workshop, we hope you had fun!


________________________

## Useful Links

|Desc||Link|
|---|--|----|
`pandas` Cheat Sheet | https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
`pandas` Docs | https://pandas.pydata.org/docs/reference/index.html
ML algorithm Cheat Sheet | https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
ML tutorial | https://scikit-learn.org/stable/tutorial/

Author: Lyuba Dimitrova, 07/2022