# STUDENTS PERFORMANCE IN EXAMS: 
### EXPLORING CORRELATION + PPSCORE PACKAGE REVIEW

This is a [Kaggle task inspired notebook](https://www.kaggle.com/spscientist/students-performance-in-exams/tasks?taskId=280).

Main objective is to figure out if a correlation exists between the different attributes that are in the dataset. Working both with continuous and categorical variables.

To add a little more mystery the PPS package is also tested in this notebook. The PPS describes itself as an alternative to the correlation able to find more patterns in the data. I heard about this package in this [post](https://8080labs.com/blog/posts/rip-correlation-introducing-the-predictive-power-score-pps/).

In [None]:
# Import necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as ppscore
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

In [None]:
data=pd.read_csv('./input/StudentsPerformance.csv')
data.head()

In [None]:
data.info()

There is no null values in any variable, so by the moment no prior processing will take place.

Fields in detail:

In [None]:
for feature in data.columns:
    uniq = np.unique(data[feature])
    print('{}: {} distinct values -  {}'.format(feature,len(uniq),uniq))

### INDEX

* 1. [Correlation with Original Data](#first-bullet)
* 2. [Correlation encoding categorical variables](#second-bullet)
    * 2.1. [Label Encoding](#label-encoding) 
    * 2.2. [One-Hot Encoding](#one-hot-encoding)
* 3. [Bonus: Testing PPSCORE package](#third-bullet)


## 1. Correlation with Original Data <a class="anchor" id="first-bullet"></a>

We will use the Pandas function *dataframe.corr()* to find the correlation between numeric variables only. 
The return of this function give us a score ranging from -1 to 1 that indicates if there is a strong linear relationship in a positive or negative direction.

In [None]:
corr = data.corr()
print(corr)

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
plt.title('Correlation Analysis with Original Data')
# Draw the heatmap with the mask and correct aspect ratio
ca = sns.heatmap(corr, cmap='coolwarm',center=0, vmin = -1,
            square=True, linewidths=1, cbar_kws={"shrink": .8}, annot = True)

In [None]:
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data)

Looking at the scores and the graphs we can say that the three scores are highly related, students who do well in one subject are more likely to do well in the other subjects.

**Math, reading and writing score are have a strong positive linear relationship.**

## 2. Working with categorical variables <a class="anchor" id="second-bullet"></a>

We are going to explore two options here: label encoding and one-hot encoding.

### 2.1 Label Encoding <a class="anchor" id="label-encoding"></a>

This approach consists in converting each value in a column to a number: in column *Lunch* 'standard' will be represented by a 1 and 'free/reduced' by a 0.

In [None]:
data_label_encoding = data.copy()

In [None]:
# Another option using sklearn:

# from sklearn.preprocessing import LabelEncoder
# creating instance of labelencoder
# labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
# bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])

In [None]:
# converting type of columns to 'category'
data_label_encoding['gender']= data_label_encoding['gender'].astype('category')
data_label_encoding['race/ethnicity']= data_label_encoding['race/ethnicity'].astype('category')
data_label_encoding['parental level of education']= data_label_encoding['parental level of education'].astype('category')
data_label_encoding['lunch']= data_label_encoding['lunch'].astype('category')
data_label_encoding['test preparation course']= data_label_encoding['test preparation course'].astype('category')

In [None]:
# Assigning numerical values and storing in another column
data_label_encoding['gender_cat']= data_label_encoding['gender'].cat.codes
data_label_encoding['race/ethnicity_cat']= data_label_encoding['race/ethnicity'].cat.codes
data_label_encoding['parental level of education_cat']= data_label_encoding['parental level of education'].cat.codes
data_label_encoding['lunch_cat']= data_label_encoding['lunch'].cat.codes
data_label_encoding['test preparation course_cat']= data_label_encoding['test preparation course'].cat.codes

In [None]:
data_label_encoding.info()

In [None]:
corr_label_encoding = data_label_encoding.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
plt.title('Correlation Analysis with Label Encoding')
# Draw the heatmap with the mask and correct aspect ratio
ca = sns.heatmap(corr_label_encoding, cmap='coolwarm',center=0, vmin = -1,
            square=True, linewidths=1, cbar_kws={"shrink": .8}, annot = True)

Label encoding has one great disadvantage: the numeric values may be misinterpreted by algorithms as having some kind of order. If gender / race category assinged group A, B, C, D and E to values 0, 1, 2, 3 and 4 respectively it may be assumed by the algorithm that somehow group E is hierarchically greater than group A.

https://stackoverflow.com/questions/47894387/how-to-correlate-an-ordinal-categorical-column-in-pandas

### 2.2 One-Hot Encoding <a class="anchor" id="one-hot-encoding"></a>

This approach consists in breaking each possible option of each categorical variable to features of value 1 or 0.

In [None]:
data_onehotencoding = data.copy()

In [None]:
data_onehotencoding = pd.get_dummies(data_onehotencoding, columns=['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course'])

Another option is to use function OneHotEncoder() from sklearn. I do prefer this approach cause it allows you to encode as many category columns as you want while the sklearn method one takes one at a time and it drops the columns name. I consider this option more user-friendly and easy to understand.

In [None]:
corr_label_encoding = data_onehotencoding.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(14, 14))
plt.title('Correlation Analysis with One-Hot Encoding')
# Draw the heatmap with the mask and correct aspect ratio
ca = sns.heatmap(corr_label_encoding, cmap='coolwarm',center=0, vmin = -1,
            square=True, linewidths=1, cbar_kws={"shrink": .8}, annot = True)

Contra: harder to interpret

https://www.kaggle.com/shakedzy/alone-in-the-woods-using-theil-s-u-for-survival 

https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

But what about a pair of a continuous feature and a categorical feature? For this, we can use the Correlation Ratio (often marked using the greek letter eta). Mathematically, it is defined as the weighted variance of the mean of each category divided by the variance of all samples; in human language, the Correlation Ratio answers the following question: Given a continuous number, how well can you know to which category it belongs to? Just like the two coefficients we’ve seen before, here too the output is on the range of [0,1].

### 2.4 Conclusions <a class="anchor" id="second-bullet"></a>

In [None]:
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data= data_label_encoding)

# Map a scatter plot to the upper triangle
grid = grid.map_upper(plt.scatter)

# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, 
                     edgecolor = 'k')
# Map a density plot to the lower triangle
grid = grid.map_lower(sns.kdeplot)

grid.fig.set_size_inches(12,12)

By looking at 

### Gender <a class="anchor" id="gender-influence"></a>

In [None]:
dt_tmp = data_label_encoding[['math score', 'reading score', 'writing score', 'gender']]
dt_tmp = dt_tmp.melt(id_vars = ['gender'])

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 7))
plt.title('Gender influence in math, reading and writing scores')
violin_gender = sns.violinplot(x="variable", y="value", hue="gender",
                     data=dt_tmp, palette="coolwarm", split=True,
                     scale="count", inner="quartile", bw=.1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Test Preparation <a class="anchor" id="test-preparation-influence"></a>

In [None]:
dt_tmp = data_label_encoding[['math score', 'reading score', 'writing score', 'test preparation course']]
dt_tmp = dt_tmp.melt(id_vars = ['test preparation course'])

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 7))
plt.title('Test preparation course influence in math, reading and writing scores')
violin_gender = sns.violinplot(x="variable", y="value", hue="test preparation course",
                     data=dt_tmp, palette="coolwarm", split=True,
                     scale="count", inner="quartile", bw=.1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Race/Ethnicity <a class="anchor" id="race-ethnicity-influence"></a>

In [None]:
dt_tmp = data_label_encoding[['math score', 'reading score', 'writing score', 'race/ethnicity']]
dt_tmp = dt_tmp.melt(id_vars = ['race/ethnicity'])

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 7))
plt.title('Race / Ethnicity influence in math, reading and writing scores')
violin_gender = sns.violinplot(x="variable", y="value", hue="race/ethnicity",
                     data=dt_tmp, palette="coolwarm", 
                     scale="count", inner="quartile", bw=.1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Lunch <a class="anchor" id="lunch-influence"></a>

In [None]:
dt_tmp = data_label_encoding[['math score', 'reading score', 'writing score', 'lunch']]
dt_tmp = dt_tmp.melt(id_vars = ['lunch'])

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 7))
plt.title('Test preparation course influence in math, reading and writing scores')
violin_gender = sns.violinplot(x="variable", y="value", hue="lunch",
                     data=dt_tmp, palette="coolwarm", split=True,
                     scale="count", inner="quartile", bw=.1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Parental Education <a class="anchor" id="parental-education-influence"></a>

In [None]:
dt_tmp = data_label_encoding[['math score', 'reading score', 'writing score', 'parental level of education']]
dt_tmp = dt_tmp.melt(id_vars = ['parental level of education'])

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 7))
plt.title('Race / Ethnicity influence in math, reading and writing scores')
violin_gender = sns.violinplot(x="variable", y="value", hue="parental level of education",
                     data=dt_tmp, palette="coolwarm", 
                     scale="count", inner="quartile", bw=.1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

## 3. Bonus: Testing PPSCORE package <a class="anchor" id="third-bullet"></a>

In [None]:
# Reorder columns so we have scores in the same order as in section 2 and its easier to compare
data = data[['math score', 'reading score', 'writing score', 'gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course']]

In [None]:
ppmatrix = ppscore.matrix(data)

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
plt.title('Relationship Analysis with PPSCORE')

ra_ppscore = sns.heatmap(ppmatrix, vmin=0, vmax=1, cmap="coolwarm", linewidths=1, annot=True, 
            square = True, cbar_kws={"shrink": .8})

In [None]:
# Gender, lunch and Test preparation

## 4. Conclusions

Fun correlation does not imply causation example: http://web.stanford.edu/class/hrp259/2007/regression/storke.pdf