# Machine Learning and Analytics

## Homework Project 4

#### Marianna Kanellaki - S-001081

In [30]:
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

### 1. Galton and Pearson Regression

#### 1.1. Linear Regression on heights

In [31]:
df = pd.read_csv("data/par_off.csv", header=None, names=['ParentHeight', 'OffspringHeight'])
df

Unnamed: 0,ParentHeight,OffspringHeight
0,65.049,59.778
1,63.251,63.214
2,64.955,63.342
3,65.752,62.792
4,61.137,64.281
...,...,...
1073,66.997,70.752
1074,71.332,68.268
1075,71.783,69.306
1076,70.738,69.302


In [33]:
model = ols("OffspringHeight~ParentHeight", data=df)
fit = model.fit()
fit.summary()

0,1,2,3
Dep. Variable:,OffspringHeight,R-squared:,0.251
Model:,OLS,Adj. R-squared:,0.251
Method:,Least Squares,F-statistic:,361.2
Date:,"Sun, 15 Feb 2026",Prob (F-statistic):,1.12e-69
Time:,23:37:45,Log-Likelihood:,-2488.7
No. Observations:,1078,AIC:,4981.0
Df Residuals:,1076,BIC:,4991.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,33.8868,1.832,18.494,0.000,30.291,37.482
ParentHeight,0.5141,0.027,19.006,0.000,0.461,0.567

0,1,2,3
Omnibus:,17.606,Durbin-Watson:,0.767
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.851
Skew:,-0.051,Prob(JB):,2e-07
Kurtosis:,3.822,Cond. No.,1670.0


The results conform to Galton and Pearson's observations. The coefficient of 0.5141 is significantly less than 1.0. This shows that while there is a positive correlation between parent and offspring heights, the offspring inherit about half of the extra height, which regresses their height towards the population average. The intercept of 33.8868 ensures that the offspring's predicted height regresses towards the mean.

#### 1.2. Heritability

The function generate_data implements the idea that parents and offspring share a genetic factor and a different random factor. It creates a genetic factor (g) that is common for each family, using the mean, var, and hrd (percentage of the total variance that is due to genetic factors). It creates the gs array, copying g for each offspring, and creates the offspring data by adding gs and the random factor that is individual for each offspring.

I created the function generate_data_diff_means to use different means for parent and offspring. I create the genetic factor with mean 0. Then I create the parent data with mean_parent and the offspring data with mean_offspring

#### 1.3. Heritability and coefficients

In an idealised setting, the heritability and the regression coefficient are identical. The coefficient of 0.5141 represents the heritability, indicating that approximately 51% of the parents' deviation from the average height is passed down to the offspring.

#### 1.4. Heritability and percentage of variance

In [34]:
from generate_hereditary_data import *
g, p, x = generate_data_diff_means(175, 160, 100, 0.8, 2, 1000)
df = pd.DataFrame({
    'GeneticFactor': np.repeat(g, 2),
    'ParentHeight': p.flatten(),
    'OffspringHeight': x.flatten()
})
df

Unnamed: 0,GeneticFactor,ParentHeight,OffspringHeight
0,2.970073,175.974164,156.689439
1,2.970073,171.922533,162.902093
2,-0.332621,176.350457,168.154469
3,-0.332621,174.309121,164.981477
4,-9.441922,169.416390,149.910720
...,...,...,...
1995,2.319220,177.755337,167.889157
1996,0.240261,169.348927,166.060245
1997,0.240261,176.351759,156.826586
1998,7.797839,182.469928,172.979356


In [35]:
model = ols("OffspringHeight~ParentHeight", data=df)
fit = model.fit()
fit.summary()

0,1,2,3
Dep. Variable:,OffspringHeight,R-squared:,0.658
Model:,OLS,Adj. R-squared:,0.658
Method:,Least Squares,F-statistic:,3843.0
Date:,"Sun, 15 Feb 2026",Prob (F-statistic):,0.0
Time:,23:37:45,Log-Likelihood:,-6445.3
No. Observations:,2000,AIC:,12890.0
Df Residuals:,1998,BIC:,12910.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,18.1597,2.294,7.915,0.000,13.660,22.659
ParentHeight,0.8105,0.013,61.994,0.000,0.785,0.836

0,1,2,3
Omnibus:,3.312,Durbin-Watson:,1.914
Prob(Omnibus):,0.191,Jarque-Bera (JB):,3.003
Skew:,-0.03,Prob(JB):,0.223
Kurtosis:,2.82,Cond. No.,2960.0


The percentage of variance in offspring heights explained by the parent heights is equal to the square of the heritability (hrd^2). This is because both the parent's and the offspring's heights contain independent random noise.

In [36]:
model = ols("OffspringHeight~GeneticFactor", data=df)
fit = model.fit()
fit.summary()

0,1,2,3
Dep. Variable:,OffspringHeight,R-squared:,0.819
Model:,OLS,Adj. R-squared:,0.819
Method:,Least Squares,F-statistic:,9061.0
Date:,"Sun, 15 Feb 2026",Prob (F-statistic):,0.0
Time:,23:37:45,Log-Likelihood:,-5807.0
No. Observations:,2000,AIC:,11620.0
Df Residuals:,1998,BIC:,11630.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,160.0296,0.099,1620.755,0.000,159.836,160.223
GeneticFactor,1.0073,0.011,95.188,0.000,0.987,1.028

0,1,2,3
Omnibus:,2.513,Durbin-Watson:,1.947
Prob(Omnibus):,0.285,Jarque-Bera (JB):,2.498
Skew:,-0.05,Prob(JB):,0.287
Kurtosis:,3.142,Cond. No.,9.33


If the genetic factor (g) is known, the percentage of variance explained increases and becomes equal to the heritability (hrd). This is because g is the exact factor shared by the family, whereas the parents' height contains its own random noise.

In [37]:
model = ols("OffspringHeight~GeneticFactor + ParentHeight", data=df)
fit = model.fit()
fit.summary()

0,1,2,3
Dep. Variable:,OffspringHeight,R-squared:,0.819
Model:,OLS,Adj. R-squared:,0.819
Method:,Least Squares,F-statistic:,4530.0
Date:,"Sun, 15 Feb 2026",Prob (F-statistic):,0.0
Time:,23:37:45,Log-Likelihood:,-5806.7
No. Observations:,2000,AIC:,11620.0
Df Residuals:,1997,BIC:,11640.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,163.1488,3.816,42.759,0.000,155.666,170.632
GeneticFactor,1.0252,0.024,42.249,0.000,0.978,1.073
ParentHeight,-0.0178,0.022,-0.818,0.414,-0.061,0.025

0,1,2,3
Omnibus:,2.501,Durbin-Watson:,1.948
Prob(Omnibus):,0.286,Jarque-Bera (JB):,2.49
Skew:,-0.048,Prob(JB):,0.288
Kurtosis:,3.143,Cond. No.,6780.0


If we already know the genetic make-up g, we should not still take into account the parent height. It is a noisy version of g and it is statistically redundant.

### 2. Regularised Polynomial Regression

#### 2.1. Comment the code and see what the new features are. How can you use it for unregularised polynomial regression? 

#### 2.2. Generate data from the polynomial distribution and compare unregularised and regularised regression (for different (usually small) values of gamma). What do you observe? What is the optimal order of the polynomial under regularisation compared to unregularised? 

#### 2.3.  In particular, compare them on out-of distribution predictions (for u outside of the [l, u] interval used for generating the data) or for data to which single outliers have been added. What do you observe?  