<h3>Load the Sample Dataset</h3>

In [1]:
import pandas as pd
df = pd.read_csv('s3://knodax-ml-datasets/demographic_scores.csv')
df.head()

Unnamed: 0,id,age,income,score,category
0,1,25,48000,78,A
1,2,30,52000,85,A
2,3,22,35000,65,B
3,4,28,49000,80,A
4,5,35,61000,88,A


<h3>Compute Descriptive Statistics</h3>
Use pandas' describe() method to generate statistics such as mean, median, standard deviation, minimum, and maximum for numerical features.

In [2]:
desc_stats = df.describe()
print(desc_stats)


             id        age      income      score
count  10.00000  10.000000     10.0000  10.000000
mean    5.50000  28.100000  48700.0000  77.400000
std     3.02765   6.297266  11450.8612  10.276186
min     1.00000  21.000000  34000.0000  60.000000
25%     3.25000  23.250000  40250.0000  70.500000
50%     5.50000  26.500000  48500.0000  79.000000
75%     7.75000  32.250000  57250.0000  85.750000
max    10.00000  40.000000  68000.0000  90.000000


<h3>Compute Correlation Matrix</h3>
Use the corr() method to identify relationships between numerical columns. This helps you evaluate multicollinearity and feature interactions.

In [3]:
correlation_matrix = df.select_dtypes(include='number').corr() # df.corr() in pandas tries to compute Pearson correlation, which only works with numeric columns.
print(correlation_matrix)


              id       age    income     score
id      1.000000  0.067019  0.014422 -0.107137
age     0.067019  1.000000  0.981999  0.917915
income  0.014422  0.981999  1.000000  0.960492
score  -0.107137  0.917915  0.960492  1.000000


<h4>Key Interpretations:</h4>
id has low correlation with all other variables:<br>
<li>id vs age: 0.067 → very weak positive correlation (basically no meaningful relation)</li>
<li>id vs income: 0.014 → negligible correlation</li>
<li>id vs score: -0.107 → weak negative correlation</li>
<li>Suggests id is just an index and not related to features; you can drop it for analysis.</li><br>

age vs income: 0.982
<li>Strong positive correlation — as age increases, income tends to increase.</li>
<li>Likely because more experienced people earn more.</li><br>

age vs score: 0.918
<li>Strong positive correlation — older people tend to have higher scores.</li><br>

income vs score: 0.960
<li>Very strong positive correlation — people with higher income tend to have higher scores.</li><br>

Summary: <br>
<li>Ignore id for analysis.</li>
<li>age, income, and score are strongly correlated with each other.</li>
<li>These relationships suggest redundancy — possibly relevant for PCA or feature selection.</li>

In machine learning, highly correlated features (also called multicollinearity) can indeed be a problem, depending on the model you're using.

Here's why and what to do:

<h4>Why Highly Correlated Features Can Be Problematic</h4>
Redundancy: Highly correlated features carry similar information. This doesn't help the model learn better — it just adds noise or redundancy.

Multicollinearity (especially in linear models like Linear Regression, Logistic Regression): It makes it hard to determine the individual effect of each feature. Model coefficients become unstable and hard to interpret.

Overfitting Risk: Redundant features can make a model more complex than necessary, increasing the risk of overfitting.

<h4>What You Can Do (Common Approaches)</h4>
Remove One of the Correlated Features: If age, income, and score are > 0.9 correlated, you might drop one or two of them. Keep the one that makes the most domain sense or is most interpretable.<br>

Dimensionality Reduction (like PCA): PCA transforms correlated variables into uncorrelated principal components. You lose interpretability, but gain compact, cleaner data for modeling. <br>

Use Tree-Based Models (Random Forest, XGBoost): These models handle multicollinearity better than linear models. But still, fewer redundant features can improve speed and simplicity. <br>

In summary, in many ML workflows, highly correlated features are removed or combined to reduce redundancy. This helps improve model interpretability, generalization, and sometimes even accuracy.

<h3>Perform Hypothesis Testing with SciPy</h3>
Use SciPy’s statistical test functions to assess whether observed differences between groups are statistically significant.

In [4]:
from scipy.stats import ttest_ind

group_a = df[df['category'] == 'A']['income']
group_b = df[df['category'] == 'B']['income']

t_stat, p_value = ttest_ind(group_a, group_b)
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")


T-Statistic: 4.411064394627796, P-Value: 0.002253201469264483


The difference between the two groups is statistically significant — it’s very unlikely that the difference you observed is due to random chance.

<h4>1. T-Statistic</h4>
The t-statistic measures how much the means of two groups differ in terms of standard error. A higher absolute value of the t-statistic means a bigger difference between groups relative to variation.

In your case:
T = 4.411 → suggests a strong difference between group means.

<h4>2. P-Value</h4>
The p-value tells you the probability of observing this result (or more extreme) if the null hypothesis were true. The null hypothesis typically assumes no difference between the groups. Your p-value is 0.00225, which is very small.

<h4>3. Interpretation</h4>
If you're using a common threshold (significance level) of: α = 0.05 → You reject the null hypothesis if p < 0.05 In this case: p = 0.00225 < 0.05, so you reject the null hypothesis.