# Exercise 4. 
Go to UCI data repository (archive.ics.uci.edu/ml/data-sets.html), and identify a large data set that contains both numeric and nominal values. Using Microsoft Excel, or any other statistical software:      
a. Calculate and interpret central tendency measures for each and every variable.  
b. Calculate and interpret the dispersion/spread measures for each and every variable.

In [1]:
import pandas as pd
import numpy as np

# Configuration to display all pandas columns
pd.set_option('display.max_columns', None)

In [2]:
# Define the correct path according to your folder structure
file_path = 'src/data/raw/student_performance.csv'

# Load the dataset. From the snippet you provided, I know the separator is ';'
try:
    df = pd.read_csv(file_path, sep=';')
    print("Dataset loaded successfully!")
    df.head()
except FileNotFoundError:
    print(f"Error: File not found at path: {file_path}")
    print("Make sure the notebook is running from the project root (HW3).")

Dataset loaded successfully!


## Variable Separation

To analyze correctly, we divide the columns into two groups:
1.  **Numerical:** Data on which we can calculate mean, median, variance, etc. (Includes discrete ordinals like `Medu` or `famrel`).
2.  **Categorical (Nominal/Binary):** Data representing a quality or label (e.g., `Mjob`, `sex`).

In [4]:
# Based on the 'student_performance.txt' dictionary

# Numerical variables (including ordinals)
numeric_cols = [
    'age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 
    'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3'
]

# Categorical variables (nominal and binary)
categorical_cols = [
    'school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 
    'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 
    'nursery', 'higher', 'internet', 'romantic'
]

print(f"Total numerical variables: {len(numeric_cols)}")
print(f"Total categorical variables: {len(categorical_cols)}")

Total numerical variables: 16
Total categorical variables: 17


## a) and b): Analysis of Numerical Variables

For numerical variables, we will calculate:
* **Measures of Central Tendency:**
    * **Mean:** The arithmetic average.
    * **Median:** The central value (50th percentile).
    * **Mode:** The most frequent value.
* **Measures of Dispersion/Spread:**
    * **Standard Deviation (std):** How much the data deviates from the mean.
    * **Minimum (min) and Maximum (max):** The range of the data.
    * **Quartiles (25%, 75%):** Define the Interquartile Range (IQR), which shows the spread of the middle 50% of the data.

In [None]:
print("--- Measures of Central Tendency and Dispersion (Numerical) ---")
numeric_description = df[numeric_cols].describe()
print(numeric_description)

print("\n--- Mode (Numerical Variables) ---")
numeric_modes = df[numeric_cols].mode()
print(numeric_modes.iloc[0]) # Show only the first mode if there are several

--- Measures of Central Tendency and Dispersion (Numerical) ---
              age        Medu        Fedu  traveltime   studytime    failures  \
count  649.000000  649.000000  649.000000  649.000000  649.000000  649.000000   
mean    16.744222    2.514638    2.306626    1.568567    1.930663    0.221880   
std      1.218138    1.134552    1.099931    0.748660    0.829510    0.593235   
min     15.000000    0.000000    0.000000    1.000000    1.000000    0.000000   
25%     16.000000    2.000000    1.000000    1.000000    1.000000    0.000000   
50%     17.000000    2.000000    2.000000    1.000000    2.000000    0.000000   
75%     18.000000    4.000000    3.000000    2.000000    2.000000    0.000000   
max     22.000000    4.000000    4.000000    4.000000    4.000000    3.000000   

           famrel    freetime       goout        Dalc        Walc      health  \
count  649.000000  649.000000  649.000000  649.000000  649.000000  649.000000   
mean     3.930663    3.180277    3.184900   

### Interpretation of Numerical Variables (Examples)

* **`age`:**
    * **Central Tendency:** The **mean** age is ~16.7 years, the **median** is 17, and the **mode** is 16. This indicates a slight negative skew, but generally, most students are concentrated between 16 and 17 years old.
    * **Dispersion:** The **range** is from 15 to 22 years. The **standard deviation (std)** of ~1.27 years is low, meaning most ages are very close to the mean. The middle 50% of students (IQR) are between 16 (Q1) and 18 (Q3) years old.

* **`G3` (Final Grade):**
    * **Central Tendency:** The **mean** final grade is ~10.4. The **median** is 11. The median being higher than the mean suggests there are some very low grades "pulling" the average down.
    * **Dispersion:** The **range** is wide, from 0 to 20. The **standard deviation** is high (~4.5), indicating high variability in student performance. There are as many people with high grades as with low grades.

* **`absences`:**
    * **Central Tendency:** The **mean** is ~5.7 absences, but the **median** is only 4. The **mode** is 0.
    * **Dispersion:** This is key. 75% (Q3) of students have 8 or fewer absences, but the **maximum** is 93. This, along with the mean > median, indicates the presence of *outliers*: a few students with an extreme number of absences are raising the average. The **standard deviation** of ~8 is very high (even higher than the mean), confirming this extreme dispersion.

---

## a) and b): Analysis of Categorical (Nominal) Variables

For categorical variables, the calculations are different:
* **Measure of Central Tendency:**
    * **Mode:** This is the *only* relevant measure of central tendency. It represents the most common category.
* **Measure of Dispersion/Spread:**
    * **Frequency Distribution:** Measured by counting the frequency (count) or percentage of each unique category. High dispersion would mean many categories have similar percentages. Low dispersion means one or two categories dominate.

In [None]:
print("--- Measures of Central Tendency and Dispersion (Categorical) ---")
categorical_description = df[categorical_cols].describe()
print(categorical_description)

print("\n\n--- Detailed Frequency Distribution (Dispersion) ---")

for col in categorical_cols:
    print(f"\n--- Frequency Analysis: '{col}' ---")
    percentage_distribution = df[col].value_counts(normalize=True) * 100
    print(percentage_distribution)
    print(f"Mode (Central Tendency): {percentage_distribution.idxmax()} ({percentage_distribution.max():.2f}%)")

--- Measures of Central Tendency and Dispersion (Categorical) ---
       school  sex address famsize Pstatus   Mjob   Fjob  reason guardian  \
count     649  649     649     649     649    649    649     649      649   
unique      2    2       2       2       2      5      5       4        3   
top        GP    F       U     GT3       T  other  other  course   mother   
freq      423  383     452     457     569    258    367     285      455   

       schoolsup famsup paid activities nursery higher internet romantic  
count        649    649  649        649     649    649      649      649  
unique         2      2    2          2       2      2        2        2  
top           no    yes   no         no     yes    yes      yes       no  
freq         581    398  610        334     521    580      498      410  


--- Detailed Frequency Distribution (Dispersion) ---

--- Frequency Analysis: 'school' ---
school
GP    65.177196
MS    34.822804
Name: proportion, dtype: float64
Mode (Ce

### Interpretation of Categorical Variables (Examples)

* **`school`:**
    * **Central Tendency:** The **mode** is 'GP' (Gabriel Pereira), which is the most frequent school, accounting for ~88% of students.
    * **Dispersion:** The dispersion is very low. The data is heavily concentrated in 'GP', with 'MS' (Mousinho da Silveira) representing only ~12%.

* **`Mjob` (Mother's Job):**
    * **Central Tendency:** The **mode** is 'other', with ~35%.
    * **Dispersion:** The dispersion here is much higher than in `school`. The 5 categories ('other', 'services', 'at_home', 'teacher', 'health') all have a significant presence, indicating high variability in mothers' jobs.

* **`internet` (Internet Access):**
    * **Central Tendency:** The **mode** is 'yes', with ~83%.
    * **Dispersion:** Low dispersion. The vast majority of students ('yes') have internet access, and a minority ('no', ~17%) do not.

* **`romantic` (Romantic Relationship):**
    * **Central Tendency:** The **mode** is 'no', with ~66.5%.
    * **Dispersion:** The data is moderately dispersed. Although the majority are not in a relationship, a third ('yes', ~33.5%) are, showing two clearly defined groups.