<a href="https://colab.research.google.com/github/luferIPCA/LESI-POO-2024-2025/blob/main/4_Data_Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# ML Modelling - Part II

**Contents**:

1.  **Data Correlation**



## Environment preparation


### Importing necessary Libraries

In [None]:
#!pip install pandas-profiling

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#from pandas_profiling import ProfileRepor   #see https://www.kaggle.com/discussions/general/233785
from scipy import stats

Mounting Drive

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

In [None]:
import os
print(os.getcwd())

## 1 - Correlations in Dataset

Essential Data Correlations  using Pandas *corr()*

### **Example 1:** Heart Disease

*Loading dataset*

In [None]:

import os
#print(os.getcwd())

filePath='/content/gDrive/MyDrive/Colab Notebooks/MIA - ML - 2024-2025/Datasets/'
ds = pd.read_csv(filePath+"heart-disease.csv")
pd.set_option("display.precision", 2)

In [None]:
ds.head()
#len(ds)

In [None]:
#Example of correlation distribution
ax1 = ds.head().plot.scatter(x='thalach',
                      y='chol',
                      c='DarkBlue')

***Correlation distribution***

In [None]:
#corr() calculate variables correlations
print(ds.corr())
# Answer
# Strong negative correlation example: between thal thalach

***Correlation Matrix with Seaborn***

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(ds.corr(),  annot=True)

***Dataframe Correlation using Pearson r***

In [None]:
# To find the correlation among all columns using pearson method
ds.corr(method='pearson')

***Dataframe Correlation using Kendal p***

In [None]:
# To find the correlation among all columns using kendall method
ds.corr(method='kendall')

### **Example 2:** Students and Classes

*Loading dataset*

In [None]:
# students and Classes
df = {
    "Faltas": [8, 2, 5, 12, 15, 9, 6],
    "Nota": [78,92,90,58,43,74,81]
}

data = pd.DataFrame(df)

ax1 = data.plot.scatter(x='Faltas',
                      y='Nota',
                      c='DarkBlue')

***Correlation distribution***

In [None]:
# Calculate the Pearson correlation
correlation = data["Faltas"].corr(data["Nota"])
print(f"Pearson Correlation: {correlation}")

In [None]:
#corr() calculate variables correlations using Person method, by default!
print(data.corr())
# Answer
# Strong negative correlation

In [None]:
# Calculate the Kendall correlation
kendall_correlation = data["Faltas"].corr(data["Nota"], method='kendall')
print(f"Kendall Correlation: {kendall_correlation}")

In [None]:
# Calculate the Spearman correlation
spearman_correlation = data["Faltas"].corr(data["Nota"], method='spearman')
print(f"Spearman Correlation: {spearman_correlation}")

***Visualizing correlations with Seaborn***

In [None]:
plt.figure(figsize=(10, 5))
sns.heatmap(data.corr().abs(),  annot=True)

\
### **Example 3:** - Games and Points

*Loading dataset*

In [None]:
# students and Classes
df = {
    "Jogos": [1, 2, 3, 4, 5, 6],
    "Pontos": [42,131,219,308,396,485]
}

data = pd.DataFrame(df)

ax1 = data.plot.scatter(x='Jogos',
                      y='Pontos',
                      c='DarkBlue')

***Using Seaborn Heatmap***


In [None]:
plt.figure(figsize=(10, 5))
sns.heatmap(data.corr(),  annot=True)

***Persons r Using Correlation Matrix***

In [None]:
#corr() calculate variables correlations
correlation_pearson = data.corr()
correlation_pearson
# Answer
# Strong positive correlation

***Spearman t Correlation***

In [None]:
# Calcula a correlação de Spearman
correlation_spearman = data.corr(method='spearman')
correlation_spearman


\


### ***Example 4: - Boston House***

See [Boston House Dataset](https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset)



In [None]:

ds2 = pd.read_csv(filePath+"BostonHousing.csv")
pd.set_option("display.precision", 2) #controls the decimal output

In [None]:
ds2.head()

In [None]:
# how many numerical features?
# see all existing features
ds2.select_dtypes(include=np.number).columns


In [None]:
ds2.dtypes
#answer
#none! all features are numerical

In [None]:
#other perspective
#select only the numeric columns in the DataFrame
ds2.select_dtypes(include=np.number)

In [None]:
#check features types
# Checking values type
result = ds2.applymap(np.isreal)

# Display result
print("Result:\n",result)
#true means numerical
#false means non numerical

Correlation Matrix

In [None]:
#get correlations
plt.figure(figsize=(20, 10))
sns.heatmap(ds2.corr(),  annot=True)



*   “tax” and “rad” columns are highly correlated with a value of 0.92 (positive correlation).
*   The columns LSTAT, INDUS, RM, TAX, NOX, PTRAIO has a correlation score above 0.5 with MEDV which is a good indication of using as predictors

Exercise: Calculate the most correlated features?

In [None]:
#?

#Got it?
#continue reading the notebook!

### **Filtering out self-correlations**

Self-correlations are always 1, thus it can be ignored!

In [None]:
import pandas as pd

# Extend the DataFrame with new features
data = {
    "Faltas": [8, 2, 5, 12, 15, 9, 6],
    "Nota": [78, 92, 90, 58, 43, 74, 81],
    "Estudo": [10, 25, 20, 5, 3, 15, 18],
    "Participação": [80, 95, 90, 50, 40, 75, 85]
}

df = pd.DataFrame(data)
print(df)

Compute the Correlation Matrix

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

print("Correlation Matrix:")
correlation_matrix

In [None]:
#Filtering Out Self-Correlations
# Unstack the correlation matrix
corr_pairs = correlation_matrix.unstack()

In [None]:
corr_pairs

In [None]:


# Filter out self-correlations (where feature pairs are the same)
filtered_corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)]

# Sort the remaining pairs in descending order of correlation
filtered_corr_pairs = filtered_corr_pairs.sort_values(kind="quicksort", ascending=True)

print(filtered_corr_pairs)

In [None]:
#What this instructuon does?
corr_pairs.index.duplicated()

#there is no repeated index

In [None]:
#Explanation: what does means?
#corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)]

#analyse separately
#corr_pairs
#corr_pairs.index
#print(type(corr_pairs.index))
#corr_pairs.index.get_level_values(0)

#get_level_values(0) and get_level_values(1):
#These retrieve the first and second elements of the multi-level index of the Series created by unstack().


End!

## 2 - Selecting Features

What feature can be "excluded" for the training process?
Whar are irrelevant features

In [None]:
import pandas as pd

# Extend the DataFrame with new features
data = {
    "Faltas": [8, 2, 5, 12, 15, 9, 6],
    "Nota": [78, 92, 90, 58, 43, 74, 81],
    "Estudo": [10, 25, 20, 5, 3, 15, 18],
    "Participação": [80, 95, 90, 50, 40, 75, 85]
}

df = pd.DataFrame(data)
print(df)

Compute the Correlation Matrix

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

print("Correlation Matrix:")
correlation_matrix

 Select Feature to Exclude
 * Suppose *Nota* as the target variable

 * Look for features with weak correlations with the target variable or high correlations with other features (redundancy).

* Visualize the correlations using a heatmap (with libraries like seaborn - optional).

### Look for weak correlations

In [None]:
#correlation between Nota and the others features
correlation_matrix['Nota']
#interpret it!

#extremely strong positive bewtween Nota and Study. What that means?
#Extremely strong positive bewtween Nota and Participation. What that means?
#Extremely strong negative bewtween Nota and Faltas. What that means?

In [None]:
correlation_matrix["Nota"].idxmin()

In [None]:
# Find the feature with the weakest correlation with 'Nota'
correlations_with_nota = correlation_matrix["Nota"].drop("Nota")
feature_to_exclude = correlations_with_nota.idxmin()

print(f"The feature to exclude is: {feature_to_exclude}")

### Visualizing

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(correlation_matrix, annot=True)
#what it does?
#sns.heatmap(correlation_matrix, annot=True,cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

**Exclude the selected feature**

In [None]:
# Drop the feature from the DataFrame
df_reduced = df.drop(columns=[feature_to_exclude])
#or
#df_reduced = df.drop(feature_to_exclude,axis=1)
print(df_reduced)

### **Another approach**

In [None]:

# Plot the heatmap of the correlation matrix
sns.heatmap(correlation_matrix, annot=True,cmap='coolwarm')
plt.show()

#corr_pairs = correlation_matrix.abs().unstack().sort_values(kind="quicksort", ascending=False)
#unstack(): pass 2D (matrix) for 1D serie, i.e, series of pairs of features
corr_pairs = correlation_matrix.unstack().sort_values(kind="quicksort", ascending=True)  #or mergesort; heapsort

# Identify highly correlated pairs (example: more than 80%)
threshold = 0.8
high_corr = [(a, b) for a, b in corr_pairs.index if a != b and corr_pairs[(a, b)] > threshold]

In [None]:
corr_pairs

In [None]:
high_corr


In [None]:
corr_pairs.index
#print(type(corr_pairs.index))
#index is a pair of strings, in this case!