In [2]:
import numpy as np
import pandas as pd

In [3]:
pd.set_option('precision', 1)

**Question 1** (25 points)

There is simulated data of 25,000 human heights and weights of 18 years old children at this URL
`http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html`

or this shortened version

`'https://rb.gy/u3dvsb'`

The original data has height in inches and weight in pounds. 

- Find the average height in cm, weight in kg, and BMI for each of the four categories defined below:

```
BMI             Category
18 or less      Underweight
18 – 22         Normal
22 – 25         Overweight
More than 25    Obese
```

- Do this in a reproducible way - that is, **all** processing of the data must be done in **code**
- Use **method chaining** to generate the summary

Your final table should look like this:

| status      |   ht_in_cm |   wt_in_kg |     bmi |
|:------------|-----------:|-----------:|--------:|
| Underweight |    173.1 |    51.4 | 17.1 |
| Normal      |    172.7 |    58.8 | 19.7 |
| Overweight  |    170.6 |    65.9 | 22.6 |
| Obese       |    166.1 |    70.0 | 25.4 |

Notes

- 1 inch = 2.54 cm
- 1 pound = 0.453592 kg
- BMI = weight in kg / (height in meters^2)

In [4]:
tables = pd.read_html("http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html",header=0,index_col=0)
df = tables[0]

In [16]:
def status(x):
    if x <=18:
        return "Underweight"
    elif (x > 18) and (x <= 22):
        return "Normal"
    elif (x > 22) and (x <= 25):
        return "Overweight"
    else:
        return "Obese"

In [40]:
dff = (
    df.assign(ht_in_cm = df["Height(Inches)"]*2.54)
    .assign(wt_in_kg = df["Weight(Pounds)"]*0.453592))
dff["bmi"] = dff.wt_in_kg / (dff.ht_in_cm/100)**2
dff['status'] = dff['bmi'].apply(status)

(dff.groupby('status').mean()
 .drop(["Height(Inches)","Weight(Pounds)"],axis=1)
 .sort_values('bmi')
 .reset_index()
)

Unnamed: 0,status,ht_in_cm,wt_in_kg,bmi
0,Underweight,173.1,51.4,17.1
1,Normal,172.7,58.8,19.7
2,Overweight,170.6,65.9,22.6
3,Obese,166.1,70.0,25.4


**Question 2** (25 points)

Using the `EtLT` data pipeline pattern.

- Using `requests`, download all people from the Star Wars REST API at https://swapi.dev/api and store the information about each person in a MongoDB database
- Extract the data from MongoDB database
- Flatten the nested JSON structure into a `pandas` DataFrame
- Save to an SQLite3 database
- Use SQL to transform the data in the SQLite3 dataase <font color=red>This question is not clear and was not considred in grading</font>

**Question 3** (50 points)

The data set in `dm.csv` contains 11 columns as described below. The first 10 columns are features (X), and the last is the target (y). The features have been transformed such that the mean = zero, and the sum of squares = 0.

```
age     age in years
sex
bmi     body mass index
bp      average blood pressure
s1      tc, T-Cells (a type of white blood cells)
s2      ldl, low-density lipoproteins
s3      hdl, high-density lipoproteins
s4      tch, thyroid stimulating hormone
s5      ltg, lamotrigine
s6      glu, blood sugar level
target  measure of disease severity at 1 year
```

- Split the data into X_train, X_test, y_train, y_test
- Plot the correlation matrix of X_train features as a heatmap using seaborn
- Perform a PCA on X_train
- Display a biplot of X_train projected on the first 2 principal components
- Write a function that returns the number of components needed to explain p% of the variance in X_train, and show the result for p=90
- Create a dummy regression model AND a proper regression model with sklearn's RandomForestRegressor to predict the target from the 10 features
    - Perform hyperparameter optimization on at least 2 parameters of the Random Forest model using  GridSearch and create a tuned RandomForestRegressor model with the best parameters
    - Plot the learning curve for the tuned Random Forest
    - Evaluate the model performance on test data using R^2 and mean absolute error for both dummy and Random Forest models
    - Plot feature importances for the Random Forest model using permutation importance and mean absolute Shapley values