# Homework06

Exercises to practice pandas, data analysis and regression

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build an intuition for different regression models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

from data_utils import object_from_json_url
from data_utils import StandardScaler
from data_utils import LinearRegression, SGDRegressor
from data_utils import regression_error

from data_utils import PolynomialFeatures


### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/DM-GY-9103-2024F-H/WK02).

This is the dataset that has anthropometric information about U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

#### Nested data

This is that *nested* dataset from Week 02.

# 🤔

Let's load it into a `DataFrame` to see what happens.

In [None]:
# Read into DataFrame
ansur_df = pd.DataFrame.from_records(ansur_data)
ansur_df.head()


# 😓🙄

That didn't work too well. We ended up with objects in our columns.

Luckily, our `DataFrame` library has a function called [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) that can help.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

Much better. `DataFrames` are magic.

#### Data Exploration

Before we start creating models, let's do a little bit of data analysis and get a feeling for the shapes, distributions and relationships of our data.

1. Print `min`, `max` and `average` values for all of the features.
2. Print `covariance` tables for `age`, `ear.length` and `head.circumference`.
3. Plot `age`, `ear.length` and `head.circumference` versus the $1$ *feature* that is most correlated to each of them.

Don't forget to *encode* and *normalize* the data.

In [None]:
# Work on Data Exploration here
# print("raw data describe:")
# display(ansur_df.describe())

### Encode non-numerical features

# need to encode "gender" col
genders = ['M', 'F']
print(genders)
gender_encoder = OrdinalEncoder(categories=[genders])
gender_vals = gender_encoder.fit_transform(ansur_df[["gender"]].values)
ansur_df[["gender"]] = gender_vals # gender is now 0 or 1


## 1. Print min, max, avg
print("encoded data describe:")
display(ansur_df.describe())

### Normalize all data
stdScaler = StandardScaler()
ansur_scaled = stdScaler.fit_transform(ansur_df)
print("normalized data describe:")
display(ansur_scaled.describe())

In [None]:
## 2. Print Covariances
cov_age = ansur_scaled.cov()["age"]
print("covariance age:")
display(cov_age.abs().sort_values())

cov_earL = ansur_scaled.cov()["ear.length"]
print("covariance ear length:")
display(cov_earL.abs().sort_values())

cov_headC = ansur_scaled.cov()["head.circumference"]
print("covariance head circumference:")
display(cov_headC.abs().sort_values())

In [None]:
## 3. Plot features most correlated to age, ear length and head circumference
ages = ansur_scaled["age"]
ear_Ls = ansur_scaled["ear.length"]
head_Cs = ansur_scaled["head.circumference"]

# age max cov: ear.length = 0.292098
plt.scatter(ear_Ls, ages, marker='o', linestyle='', alpha=0.3)
plt.xlabel("ear length")
plt.ylabel("age")
plt.show()

# ear length max cov: weight = 0.487481
weights = ansur_scaled["weight"]
plt.scatter(weights, ear_Ls,  marker='o', linestyle='', alpha=0.3)
plt.xlabel("weight")
plt.ylabel("ear length")
plt.show()

# head circumference max cov: head.height = 0.546534
head_Hs = ansur_scaled["head.height"]
plt.scatter(head_Hs, head_Cs, marker='o', linestyle='', alpha=0.3)
plt.xlabel("head height")
plt.ylabel("head circumference")
plt.show()

### Interpretation

<span style="color:hotpink;">
Does anything stand out about these graphs? Or the correlations?<br>
Are correlations symmetric? Does the feature most correlated to ear length also have ear length as its most correlated pair?
</span>

The graphs for ear length and head circumerference with their most correlated features seem to be a relatively linear upward trend. The graph for age however doesn't look as clear, and was also the lowest covariance number.

The 2 graphs that include ear length (age and ear length) seem to have discrete measurements, whereas the graph for head circumference is more clustered without discrete values. 

The feature most correlated to age is ear length, but the feature most correlated to ear length is weight, so they are not symmetrical. 

### Regression

Now, we want to create a regression model to predict `head.circumference` from the data.

From our [Week 06](https://github.com/PSAM-5020-2025S-A/WK06) notebook, we can create a regression model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (done! ⚡️)
3. Normalize the data (done! 🍾)
4. Separate the outcome variable and the input features
5. Create a regression model using all features
6. Run model on training data and measure error
7. Plot predictions and interpret results
8. Run model on test data, measure error, plot predictions, interpret results

In [None]:
# Work on Regression Model here

## Separate outcome variable and input features

# outcome variable: 
headCs = ansur_scaled["head.circumference"]
features = ansur_scaled.drop(columns=["head.circumference"])

# create extra features to improve the model
poly = PolynomialFeatures(degree=3, include_bias=False)
features_poly = poly.fit_transform(features)

## Create a regression model
model = LinearRegression()

# start with linear regression for all features
# model.fit(features, headCs)

# see if model with poly features is better
model.fit(features_poly, headCs)

## Measure error on training data
# predicted_scaled = model.predict(features)
predicted_scaled = model.predict(features_poly)
predicted = stdScaler.inverse_transform(predicted_scaled)

regression_error(predicted, ansur_df["head.circumference"])


In [None]:
# start by plotting the features that were most closely related to head circumference

for feature in ["head.height", "weight", "foot.length"]:
    feat = ansur_df[feature]
    plt.plot(feat, ansur_df["head.circumference"], marker='o', linestyle='', alpha=0.3)
    plt.plot(feat, predicted["head.circumference"], color='r', marker='o', linestyle='', alpha=0.3)
    plt.xlabel(feature)
    plt.ylabel("head.circumference")
    plt.show()

In [None]:
## Load Test Data
ANSUR_TEST_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/ansur-test.json"

ansur_test_data = object_from_json_url(ANSUR_TEST_FILE)
ansur_test_df = pd.json_normalize(ansur_test_data)

ansur_test_encoded_df = ansur_test_df.copy()

g_vals = gender_encoder.transform(ansur_test_df[["gender"]].values)
ansur_test_encoded_df[["gender"]] = g_vals

ansur_test_scaled_df = stdScaler.transform(ansur_test_encoded_df)

In [None]:
## Run model on test data

headCs_test = ansur_test_scaled_df["head.circumference"]
features_test = ansur_test_scaled_df.drop(columns=["head.circumference"])
# poly_test = PolynomialFeatures(degree=3, include_bias=False)
# features_poly_test = poly_test.fit_transform(features_test)

model.fit(features_test, headCs_test)

## Measure error on test data
predicted_scaled_test = model.predict(features_test)
predicted_test = stdScaler.inverse_transform(predicted_scaled_test)

display(regression_error(predicted_test, ansur_test_df["head.circumference"]))

## Plot predictions and interpret results

for feature in ["head.height", "weight", "foot.length"]:
    feat = ansur_test_df[feature]
    plt.plot(feat, ansur_test_df["head.circumference"], color='b', marker='o', linestyle='', alpha=0.3)
    plt.plot(feat, predicted_test["head.circumference"], color='r', marker='o', linestyle='', alpha=0.3)
    plt.xlabel(feature)
    plt.ylabel("head.circumference")
    plt.show()


### Interpretation

<span style="color:hotpink;">
How well does your model perform?<br>
How could you improve it?<br>
Are there ranges of circumferences that don't get predicted well?
</span>

A linear regression model with all of the features led to these regression_error results:
training: 13.910571985204234
test: 14.171132427777215
These numbers don't seem so bad for circumferences in the range of 502 - 653.

The errors are pretty similar which is a good sign, but the model isn't great at predicting circumference values on the low and high end in both the training and test data, roughly above 600 and below 540. 

When adding additional features using PolynomialFeatures the regression_error reduced a bit in training to 
12.626836323124518
but it still wasn't great at predicting the high and low values.

