<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>
     </div>

**Clemson University**<br>
**Fall 2024**<br>
**Instructor(s):** Aaron Masino <br>

## Homework 3: Linear Regression
This homework is intended to assess your knowledge of linear regression concepts and implementation using Python statsmodels and scikit-learn. As presented in class, statsmodels is a Python library that provides many tools for developing and evaluating statistical models and scikit-learn is a Python library that provides many tools for machine learning model development and analysis. You may refer to the course lectures and labs while completing this assignment. For complete information, you may reference:
-  Python documentation [here](https://www.python.org/)
-  statsmodels documentation [here](https://www.statsmodels.org/stable/index.html)
-  scikit-learn documentation [here](https://scikit-learn.org/stable/index.html)
-  Pandas documentation [here](https://pandas.pydata.org/)
-  matplotlib documentation [here](https://matplotlib.org/)
-  seaborn documentation [here](https://seaborn.pydata.org/).


# Setup Instructions
In the exercises below, you will use data from the following files. Make sure you have copied these to the appropriate location (e.g., _YOUR_COURSE_DIR/data_):
- mtcars-simple.csv
- infrared_thermography_temperature.csv

### Before beginning the exercises: 
Execute code cell 1 below to import the required Python packages

To begin, first import the Python packages that are required for this homework:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

import numpy as np

SEED = 654321

# Excercise 1 (1 point) 
In the code cell below, develop a function that implements simple linear regression given a 1-D numpy array of observations of the predictor variable, `X`, and a 1-D numpy array of the associated outcome variable `y`. The model should assume the form:

$y = \beta_0 + \beta_1 * X$

Your function should solve for $\beta_0$ and $\beta_1$ using the equations presented in the linear regression lab:

$\beta_1 = \frac{\sum_{i=1}^n{(x_i-\bar{x})(y_i-\bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}$

$\beta_0 = \bar{y} - \beta_1 \bar{x}$

Your function should return the tuple `(beta_0, beta_1)` containing the values estimated from the data. You may use the numpy library in your solution. Do __not__ use any other Python libraries in your solution.

In [None]:
# solution - any solution that uses numpy or buil-in python functions to calculate the coefficients is accepatable
def simple_linear_regression(X, y):
    beta_0 = None
    beta_1 = None
    ########### START YOUR CODE HERE #############
    
    ########### END YOUR CODE HERE ###############
    return beta_0, beta_1

# Excercise 2 (1 point)
In the code cell below, complete the implementation of the function `simulate_linear_data`. This function should __generate__ noisy data that is governed by the model

$ y = \beta_0 + \beta_1 X + \epsilon$

The function inputs are:
- `beta_0` and `beta_1`:  the the _true_ values $\beta_0$ and $\beta_1$, respectively
- `noise_std`: the _true_ value of $\sigma_{\epsilon}$ (noise standard deviation)
- `n_points`: the number of points to generate. 

The `X` variable (the predictor) has been created for you. You can generate the noise term, $\epsilon$, using: `np.random.normal(loc=0, scale=noise_std, size=n_points)` see [documentation here](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html).

__Note__, this function is __NOT__ fitting a model to data. Rather, it is generating data from a specified model, that is, it will simulate data that can be used to evaluate how well a model performs. 

In [None]:
def simulate_linear_data(beta_0, beta_1, n_points, noise_std, x_range=(-10, 10)):
    X = np.random.uniform(x_range[0], x_range[1], n_points)
    y = None
    ########### START YOUR CODE HERE #############
    
    ########### END YOUR CODE HERE #############
    return X, y

# Excercise 3 (3 points)
In the code cell below, use the `simple_linear_regression` and the `simulate_linear_data` functions created in the previous exercises to 

- Simulate 100 noisy datasets each with 100 samples. Each dataset should be governed by the same equation: $y = 2.7 + 3.41 X + \epsilon$ where $\epsilon$ is normally distributed with zero (0) mean and unit (1) standard deviation.
- For each dataset, fit a simple linear regression model of the form $y = \beta_0 + \beta_1 x$
- Create a subplot with two plots. 
    - In one subplot, plot a histogram of the observed values $\beta_0$. Add a legend that includes the mean and standard deviation of the different values of $\beta_0$.
    - In the other subplot, plot a histogram of the observed values $\beta_1$. Add a legend that includes the mean and standard deviation of the different values of $\beta_1$. 

You may use _matplotlib_ or _seaborn_ libraries to create the histograms. Do __not__ use the statsmodels or scikit-learn libraries in your solution.

In [None]:
########### START YOUR CODE HERE #############


# Exercise 4 (1 point)
In the code cell below, data from the _mtcars-simple.csv_ file is loaded into to Pandas DataFrame, `df`. The first few rows of the DataFrame are displayed. The columns include information about different makes and models of vehicles:
- `car_make_model` : the make and model of the vehicle 
- `mpg` : average miles per gallon of the car
- `cyl` : number of cylindars 
- `hp` : horse power
- `wt` : weight (tons)
- `qsec` : fastest time to complete a quarter mile from stopped position

In code cell, use the Python _statsmodels_ library to fit a multiple linear regression model using `wt` and `hp` as the predictor variables and `qsec` as the outcome. You should add a constant term to the predictor values so that _statsmodels_ will include a _constant_ in the solution - the constant is the $\beta_0$. Print a summary of the model fit results using the _statsmodels_ `summary` method.

In [None]:
df = pd.read_csv('../data/mtcars-simple.csv')
display(df.head())

########### START YOUR CODE HERE #############


# Exercise 5 (2 points)
In this exercise, you will further explore the relation between vehicle characteristics and quater mile time. In the code cell below, data from the _mtcars-simple.csv_ file is re-loaded into to a new Pandas DataFrame, `df` which has the same variables:
- `car_make_model` : the make and model of the vehicle 
- `mpg` : average miles per gallon of the car
- `cyl` : number of cylindars 
- `hp` : horse power
- `wt` : weight (tons)
- `qsec` : fastest time to complete a quarter mile from stopped position

In the code cell, 
- Use the Pandas `get_dummies` method to convert the `cyl` variable to a set of binary variables that includes `cyl_6` and `cyl_8`. 
- Use the Python _statsmodels_ library to fit a multiple linear regression model using `wt`, `hp`, `cyl_6` and `cyl_8` as the predictor variables and `qsec` as the outcome. You should add a constant term to the predictor values so that _statsmodels_ will include a _constant_ in the solution - the constant is the $\beta_0$. Print a summary of the model fit results using the _statsmodels_ `summary` method.

In [None]:
df = pd.read_csv('../data/mtcars-simple.csv')

########### START YOUR CODE HERE #############


# Exercise 6 (2 points)
In this exercise you will use a cleaned and standardized version of the [Infrared Thermography Temperature](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset). This dataset contains temperatures read from various locations of inferred images of patients [Figure 2 in Wang et al.](https://www.semanticscholar.org/paper/Infrared-Thermography-for-Measuring-Elevated-Body-Wang-Zhou/443b9932d295ca3a014e7d874b4bd77a33a276bd/figure/3), with the addition of oral temperatures measured for each individual. The features consist of gender, age, ethnicity, ambiant temperature, humidity, distance, and other temperature readings from the thermal images.

In the cell below, using the data that has been loaded into a Panda's DataFrame, complete the following:
- split the data into training and test sets with 15% of the data in the test set
- Use _scikit-learn_ `LinearRegression` class to fit a multiple linear regression model to the training data where the outcome variable is `aveOralM` (oral temperature) and all remaining variables in the DataFrame, `df`, are predictor variables
- Use the trained model to predict the the `aveOralM` (oral temperature) for the test data
- Plot the residual values for the test set 

In [None]:
df = pd.read_csv('../data/infrared_thermography_temperature.csv')
df.head()

########### START YOUR CODE HERE #############
