<a href="https://colab.research.google.com/github/paola-md/APICall/blob/master/Homework01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 1: Data Handling**
---

## Introduction

In this homework, you will apply different data exploration, cleaning and visualization techniques. It is very import to take some time to understand the data. 

There homework is due **Mar 09, 2021 23:59 CET**. The fully-run notebook must be uploaded to your private GitHub private repository. We will only grade the cells with the following heading: 
```
#### GRADED CELL ####
```
There are 6 required exercises that will be graded and 1 optional exercise that will not be graded.

If you have any questions, feel free to use the Q&A forum in Moodle. 

**Required Exercises**

| Section | Part                                           | Required Function                     | Points 
|---------|:-                                             |:-                                     | :-:    
| 1       | [Data Exploration](#section1)                  | get_feature_stats    |  20    
| 2       | [Data Cleaning](#section2)     | handle_missing_values         |  20    
| 2       | [Data Cleaning](#section2)     | handle_inconsistent_data         |  20  
| 2       | [Data Cleaning](#section2)     | handle_skewness         |  20  
| 3       | [Visualization](#section3) | plot_correlation |  15    
| 3       | [Visualization](#section3) | plot_scatter_pairs |  5   
|         | Total Points                                   |                                       | 100    

**Optional Exercises**

| Section | Part                                                 | Required  Function | Points |
|:-------:|:-                                                    |:-:                  | :-:    |
| 4       | [Modeling](#section4)                   | get_r2_score | 0      |



In [None]:
# Scientific and vector computation for python
import numpy as np
import pandas as pd 

# Plotting library
from matplotlib import pyplot as plt
%matplotlib inline

## **0 Loading data set** 
---
The data set consist of 116,658 observations and 10 columns.

* Stundet ID: identifies uniquely every student. Note that, no two students have the same ID. 
* Gender
* School group
* Effort regulation (effort)
* Family stress level (stress)
* Help seeking behavior (feedback)
* Regularity patters of student throughout the course (regularity)
* Critical thinking skills (critical)
* Test duration in minutes (minutes)
* Exam grade (grade) 




In [None]:
df = pd.read_csv("./school_performance.csv")
df.head()

Unnamed: 0,student_id,gender,school_group,effort,stress,feedback,regularity,critical,minutes,grade
0,20404.0,male,99,5.997184,8.098143,9.722538,99.0,1.621112,20.0,99.0
1,26683.0,female,99,6.017588,9.696074,99.0,99.0,99.0,30.0,3.61
2,32954.0,99,99,6.070632,7.803463,9.448975,7.369845,99.0,99.0,3.32
3,42595.0,99,99,5.996371,99.0,99.0,5.69758,-0.051113,21.0,99.0
4,28603.0,male,99,99.0,6.780604,99.0,99.0,99.0,99.0,3.18


<a id="section1"></a>
## **1 Data Exploration** 
---

As mentioned in class, it is good practice to report the percentage of missing values per feature together with the features' descriptive
stats. In this exercise, we take as input a DataFrame and return some 
descriptive statistics and the percentage of missing values.

For the numerical features, we are interested in knowing:
- the mean 
- the standard deviation
- the median
- and the percentage of missing values


In the case of the categorical features, we want to know:
- the number of unique values
- the most frequent value
- the frequency of the most frequent value
- and the percentage of missing values

In [None]:
#### GRADED CELL ####
def get_feature_stats(df):
    """
    Obtains descriptive statistics for all features and percentage of missing 
    values
    
    Parameters
    ----------
    df : DataFrame
         Containing all data

    Returns
    -------
    stats : DataFrame
            With four rows only (mean, std, median and percentage of 
            missing values) containing the statistics for all features.
    """
    # ====================== YOUR CODE HERE ======================= 

    # =============================================================
    return stats

In [None]:
get_feature_stats(df)

<a id="section2"></a>
## **2 Data Cleaning** 
---

Carefully explore the data set and fill-out the following functions:
- handle missing values
  - Are there missing values? If so, how are the missing values encoded?
  - Why are there missing values? Is there a pattern in the values missing?
- handle inconsistent data
  - Are there columns with inconsistent data types?
  - Can you transform all columns to a consistent type? Categorical or numeric
- handle skewness
  - How are the features distributed?
  - What kind of transformations can you 

In [None]:
#### GRADED CELL ####
def handle_missing_values(df):
  """
  Identifies and removes all missing values
  
  Parameters
  ----------
  df : DataFrame
      Containing missing values

  Returns
  -------
  df : DataFrame
      Without missing values

  Hint:
  -----
  Understand the pattern in the missing values    
  """
  # ====================== YOUR CODE HERE ======================= 

  # =============================================================

  return df


In [None]:
df = handle_missing_values(df)
df.head()

In [None]:
#### GRADED CELL ####
def handle_inconsistent_data(df):
   """
  Identifies features with inconsistent data types and transforms features
  to the correct type
  
  Parameters
  ----------
  df : DataFrame
      Containing inconsistent data

  Returns
  -------
  df : DataFrame
       With consistent data. All columns must be either numerical or categorical

  Hint:
  -----
  See unique values per feature   
  """
  # ====================== YOUR CODE HERE ======================= 

  # =============================================================
  return df

In [None]:
df = handle_inconsistent_data(df)
df.head()

In [None]:
df.hist(bins=30, figsize=(15, 10))

In [None]:
#### GRADED CELL ####
def handle_skewness(df):
  """
  Identifies features skewed distribution and transforms features to make data
  more Gaussian-like 
  
  Parameters
  ----------
  df : DataFrame
      Containing skewed features

  Returns
  -------
  df : DataFrame
       With more Gaussian-like features

  Hint:
  -----
  Visualize each feature individually   
  """
  # ====================== YOUR CODE HERE ======================= 

  # =============================================================
  return df

In [None]:
df  = handle_skewness(df)

In [None]:
df.hist(bins=30, figsize=(15, 10))

<a id="section3"></a>
## **3 Visualization** 
---


In [None]:
#### GRADED CELL ####
import seaborn as sns
def plot_correlation(df):
  """
  Builds upper triangular heatmap with pearson correlation between variables

  Instructions
  ------------
  The plot must have:
  - An appropiate title
  - Only upper triangular elements
  - Annotated values of correlation coefficients rounded to three significant 
  figures
  - Negative correlation must be blue and possitive correlation red. 
  
  Parameters
  ----------
  df : DataFrame with data

  Returns
  -------
  heatmap : upper triangular showing correlations between features
  
  """
  # ====================== YOUR CODE HERE =======================

  # =============================================================


In [None]:
plot_correlation(df)

In [None]:
#### GRADED CELL ####
def plot_scatter_pairs(df, group):
  """
  Plot scatter plots for all possible combinations of numerical features with 
  different colors for the categorical feature group
  
  Parameters
  ----------
  df : DataFrame with data
  group: name of categorical feature 

  Returns
  -------
  plots : Scatter plots of numerical features with different colors depending
  on categorical feature
  """
  # ====================== YOUR CODE HERE ======================= 

  # =============================================================

In [None]:
sample = df.sample(n= 1000)
plot_scatter_pairs(sample, 'gender')


In [None]:
plot_scatter_pairs(sample, 'school_group')


<a id="section4"></a>
## **4 Modeling (optional)**
---

We will cover modelling techniques in the following weeks but if you already have some prior knowledge, what is the highest $R^2$ value you can get?

You can create as many new features and transform the features in any way.





In [None]:
from sklearn.linear_model import LinearRegression
def get_r2_score(df):
  """
  Runs a linear regression and returns the coefficient of determination 
  R**2 of the prediction.
  
  Parameters
  ----------
  df : DataFrame with data

  Returns
  -------
  score :  coefficient of determination  R**2 of the prediction
  """
  # ====================== YOUR CODE HERE ======================= 

  # =============================================================
  return score

In [None]:
get_r2_score(df)