# DX 602 Final Project

## Introduction

In this project, you will practice the skills that you have learned throughout this module with a heavy focus on building models.
Most of the problems and questions are open ended compared to your previous homeworks, and you will be asked to explain your choices.
Most of them will have a particular type of solution implied, but it is up to you to figure out the details based on what you have learned in this module.

## Instructions

Each problem asks you to perform build models, run a computation, or otherwise perform some analysis of the data, and usually answer some questions about the results.
Make sure that your question answers are well supported by your analysis and explanations; simply stating an answer without support will earn minimal points.

Notebook cells for code and text have been added for your convenience, but feel free to add additional cells.

## Example Code

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Submission

This project will be entirely manually graded.
However, we may rerun some or all of your code to confirm that it works as described.

### Late Policy

The normal homework late policy for OMDS does not apply to this project.
Boston University requires final grades to be submitted within 72 hours of class instruction ending, so we cannot accommodate 5 days of late submissions.

However, we have delayed the due date of this project to be substantially later than necessary given its scope, and given you more days for submission with full credit than you would have had days for submission with partial credit under the homework late policy.
The Thanksgiving holiday was also taken into account in setting the deadline.
Finally, the deadlines for DX 601 and DX 602 were coordinated to be a week apart while giving ample time for both of their projects.

## Shared Imports

For this project, you are forbidden to use modules that were not loaded in this template.
While other modules are handy in practice, modules that trivialize these problems interfere with our assessment of your own knowledge and skills.

If you believe a module covered in the course material (not live sessions) is missing, please check with your learning facilitator.

In [1]:
import math
import sys

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.linear_model

## Problems

### Problem 1 (5 points)

Pick one of the following data sets to analyze in this project.
Load the data set, and show a random sample of 10 rows.

* [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) ([PMLB - red subset only](https://github.com/EpistasisLab/pmlb/tree/master/datasets/wine_quality_red))
* [Body Fat](https://www.openml.org/search?type=data&status=active&id=560) ([PMLB](https://github.com/EpistasisLab/pmlb/tree/master/datasets/560_bodyfat))

The PMLB copies of the data are generally cleaner and recommended for this project, but the other links are provided to give you more context.
To load the data from the PMLB Github repository, navigate to the `.tsv.gz` file in GitHub and copy the link from the "Raw" button.

If the dataset has missing data, you should drop the rows with missing data before proceeding.
If the data set you choose has more than ten columns, you may limit later analysis that is requested per column to just the first ten columns.

In [3]:
# Read Body Fat tsv downloaded from PMLB Github repository 

bodyfat = pd.read_csv('560_bodyfat.tsv', sep='\t')

# Show a random sample of 10 rows from the bodyfat dataset 

bodyfat.sample(n=10)

Unnamed: 0,Density,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist,target
249,1.0328,72.0,186.75,66.0,38.900002,111.099998,111.5,101.699997,60.299999,37.299999,21.5,31.299999,27.200001,18.0,29.299999
105,1.0578,43.0,165.5,68.5,31.1,93.099998,87.300003,96.599998,54.700001,39.0,24.799999,31.0,29.4,18.799999,18.0
41,1.025,44.0,205.0,29.5,36.599998,106.0,104.300003,115.5,70.599998,42.5,23.700001,33.599998,28.700001,17.4,32.900002
168,1.018,35.0,228.25,69.5,40.400002,114.900002,115.900002,111.900002,74.400002,40.599998,24.0,36.099998,31.799999,18.799999,34.299999
76,1.079,57.0,162.5,69.5,38.700001,91.599998,78.800003,94.300003,56.700001,39.700001,24.200001,30.200001,29.200001,18.1,8.8
52,1.0807,51.0,137.25,67.75,36.5,89.699997,82.0,89.099998,49.299999,33.700001,21.4,29.6,26.0,16.9,8.0
87,1.0462,64.0,160.0,65.75,36.5,104.300003,90.900002,93.800003,57.799999,39.5,23.299999,29.200001,28.4,18.1,23.1
201,1.0484,43.0,150.0,69.25,35.200001,91.099998,85.699997,96.900002,55.5,35.700001,22.0,29.4,26.6,17.4,22.1
48,1.0678,45.0,135.75,68.5,32.799999,92.300003,83.400002,90.400002,52.0,35.799999,20.6,28.799999,25.5,16.299999,13.6
136,1.0491,39.0,166.75,70.75,37.0,92.900002,86.099998,95.599998,58.799999,36.099998,22.4,32.700001,28.299999,17.1,21.799999


In [4]:
# Check for any null values in the dataset

bodyfat.isnull().sum()

Density    0
Age        0
Weight     0
Height     0
Neck       0
Chest      0
Abdomen    0
Hip        0
Thigh      0
Knee       0
Ankle      0
Biceps     0
Forearm    0
Wrist      0
target     0
dtype: int64

In [19]:
# Set aside a data subset of first 10 columns for later analysis 

bodyfat_9_cols = bodyfat.iloc[:, :9].columns.tolist()
bodyfat_10_cols = bodyfat_9_cols + ['target']

bodyfat10 = bodyfat[bodyfat_10_cols]

Problem 1: Load File and Initial Preperation 

The Body Fat dataset was downloaded from the PMLB Github repository and uploaded to the Final Project codespace. The dataset was then read in using Pandas read_csv function. As the file is a .tsv, "sep='\t'" was included to ensure the tsv file is read in the appropriate format. A random sample of 10 rows was shown using the sample function, which shows that the dataset has loaded as expected and that there are 15 columns that are all in a numeric, float data type. There is a specified 'target' column, which will be the target variable in later model development. 

While the dataset is presumed to be in a clean and ready to use condition as it was sourced from the PMLB repository, the dataset was checked for any null values in each column using the isnull().sum() function. This shows that there are no nulls across all columns, and the dataset is ready to use. Additionally, a bodyfat10 variable was created that only includes the first 10 columns in the dataset. This was created by first indexing the first 9 columns in the bodyfat dataframe and getting the column names to a list. The 'target' column name was then appended to the list. This list was then used to filter the bodyfat dataframe to a 10-column version for later use. 

### Problem 2 (10 points)

List all of the columns and describe them in your own words.

In [21]:
# Get bodyfat column names as a list 

bodyfat.columns.to_list()

# Get bodyfat10 column names as a list for analysis 

bodyfat10.columns.to_list()

['Density',
 'Age',
 'Weight',
 'Height',
 'Neck',
 'Chest',
 'Abdomen',
 'Hip',
 'Thigh',
 'target']

In [22]:
# Get descriptive detail of each column 

bodyfat10.info()
bodyfat10.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Density  252 non-null    float64
 1   Age      252 non-null    float64
 2   Weight   252 non-null    float64
 3   Height   252 non-null    float64
 4   Neck     252 non-null    float64
 5   Chest    252 non-null    float64
 6   Abdomen  252 non-null    float64
 7   Hip      252 non-null    float64
 8   Thigh    252 non-null    float64
 9   target   252 non-null    float64
dtypes: float64(10)
memory usage: 19.8 KB


Unnamed: 0,Density,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,target
count,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0
mean,1.055574,44.884921,178.924405,70.14881,37.992064,100.824206,92.555952,99.904762,59.405952,19.150794
std,0.019031,12.60204,29.38916,3.662856,2.430913,8.430476,10.783077,7.164058,5.249952,8.36874
min,0.995,22.0,118.5,29.5,31.1,79.300003,69.400002,85.0,47.200001,0.0
25%,1.0414,35.75,159.0,68.25,36.400002,94.35,84.574999,95.5,56.0,12.475
50%,1.0549,43.0,176.5,70.0,38.0,99.649998,90.950001,99.300003,59.0,19.200001
75%,1.0704,54.0,197.0,72.25,39.425001,105.375002,99.324997,103.525,62.349999,25.299999
max,1.1089,81.0,363.149994,77.75,51.200001,136.199997,148.100006,147.699997,87.300003,47.5


Problem 2: List and Describe Body Fat Dataset Columns

The column names of the Body Fat dataset were listed and the .info() and .describe() function were used to help initially understand the dataset and its columns. 

There are 15 total columns in the Body Fat dataset, with all columns having the float64 data type. All values are continous. There are 252 rows with no null values across all columns. The dataset lists the underwater body weight (density) and measures of body part circumferences for 252 men. These measures together can help to determine the percentage of body fat ('target') for each individual.  

'Density' - The 'Density' column is the numeric weight of each man underwater, or body density, in the measure of grams per cubed centimeters (gm/cm^3).It is used to estimate body composition, as fat is less dense than fat-free mass (muscle, bone, water). A higher body density indicates a lower body fat percentage (target). There is low variability among samples, with a standard deviation of 0.0190 around the mean of 1.0555, and a narrow range of approx 0.1. 

'Age' - The 'Age' column corresponds to the age in years of each individual data is recorded for. Ages range from 22-81 years old, with over 75% of data representing individuals in the younger half of this range (under 59). While the dataset ingested this column as a float, it could also be saved as an integer with no value lost. Body fat typically increases as a person ages, indicating older age groups may have a higher value for the body fat 'target' variable. 

'Weight' - This is the weight in pounds of each individual, ranging from 118.5 - 363.15 lbs with a low variability around the mean. A weight measure is a sum of all body mass, including fat, muscle, bone, and water. While it can indicate body fat percentage in tandem with other body measurements, a high or low value may be a misleading indicator on its own as there are additional factors contributing to an individuals high or low body weight. For example, an individual with high muscle mass will most likely still have a higher weight, but not necessarily high body fat. 

'Height' - The 'Height' column is a continous measure in inches of each individuals vertical distance from the bottom of the feet to the top of the head. Height can be used in conjucntion with other measures to better understand body fat, however it is not a direct indicator on its own. For example, it can be used with weight to calculate individuals BMI, which may be more indicative of body fat than either measure alone. The distribution of heights appears to have a central tendency, but the minimum value of 29.5 inches is a potential outlier that warrants further investigation, as it could skew statistical analyses and model results. 

'Neck' - The 'Neck' column includes the measurements in centimeters of individuals neck circumference. A larger neck circumference can be a risk factor for certain health issues and may correspond with a higher measure of body fat percentage. Measurements range from approx 31 to 51, wi

'Chest' - 

'Abdomen' - 

'Hip' - 

'Thigh'- 

'target' - the target column is a measure of the percentage of body fat for each individual. This value can be determined by the other numeric values in the dataset and will be used as the target output variable for predictive modeling purposes. It is a continous target variable. 

### Problem 3 (50 points)

Perform an exploratory analysis of the data set.
After your exploratory analysis, pick 3 individual charts that you the think were particularly interesting.
Repeat those charts separately from your original analysis, and after each of those charts, explain what you thought was noteworthy.

In [8]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 4 (5 points)

Plot the correlation matrix of the numeric columns in the data set.
Which pair of different columns were highlighted as the most correlated?

In [9]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 5 (10 points)

Pick three different regression model classes to try in problem 6 from the scikit-learn documentation.
For each class, provide a link to the scikit-learn documentation, and a link to another web page describing how that kind of model works.
The second link should not be from scikit-learn, but Wikipedia is acceptable.
You do not need to understand the methods at this time, but it is good to be comfortable researching them.

In [10]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 6 (50 points)

Build three different regression models using the entire data set.
Plot the actual target vs the predicted values for each in one chart.
Compute the L2 and L1 losses for each of them.
You may use any regression class provided provided by scikit-learn, and you may reuse one class as long as you change its parameters enough to see different results.

In [11]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 7 (30 points)

Use 5-fold cross-validation to repeat building the same three kinds of regression models. Compare the L2 losses predicted by cross-validation against the L2 losses training against the whole data set. (The difference is likely from overfitting in the latter.)

In [12]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 8 (25 points)

Build three different regression models as in problem 6, but preprocess the data so that each column has mean zero and standard deviation one first.
For full credit, use a scikit-learn pipeline for each model.
For each model, compare the L2 losses -- which of them performed differently from your results in problem 6?

(This process will be covered in week 13.)

In [13]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 9 (5 points)

A colleague suggests that you find better models by repeatedly building decision trees with random depth limits.
They say that trying 1000 such models will likely find an improvement as long as you use cross validation.
Give a one sentence response to this suggestion. 

In [14]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 10 (10 points)

Pick a best model from all the models that you built and otherwise described in this project.
Explain how you picked it, including what criteria you chose, and how the other models compared by that criteria.
As much as possible, justify that problem in the context of the original data set. 

In [15]:
# YOUR CODE HERE

YOUR ANSWERS HERE