# Background

The fishing industry uses numerous measurements to describe a specific fish.  Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product.  See below for the description of these measurments.  

## Data Description

The data consists of the following variables:

1. **Weight**: weight of fish in g (numerical)
2. **Species**: species name of fish (categorical)
3. **Body.Height**: height of body of fish in cm (numerical)
4. **Total.Length**: length of fish from mouth to tail in cm (numerical)
5. **Diagonal.Length**: length of diagonal of main body of fish in cm (numerical)
6. **Height**: height of head of fish in cm (numerical)
7. **Width**: width of head of fish in cm (numerical)

## Read the data

In [None]:
# Import library you may need
library(car)
# Read the data set
fishfull = read.csv("Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]

*Please use fish as your data set for the following questions unless otherwise stated.*

# Question 1: Exploratory Data Analysis [10 points]

**(a) Create a box plot comparing the response variable, *Weight*, across the multiple *species*.  Based on this box plot, does there appear to be a relationship between the predictor and the response?**

**(b) Create plots of the response, *Weight*, against each quantitative predictor, namely **Body.Height**, **Total.Length**, **Diagonal.Length**, **Height**, and **Width**.  Describe the general trend of each plot.  Are there any potential outliers?**

**(c) Display the correlations between each of the variables.  Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.**

**(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between *Weight* and the predictor variables?**



# Question 2: Fitting the Multiple Linear Regression Model [11 points]

*Create the full model without transforming the response variable or predicting variables using the fish data set.  Do not use fishtest*

**(a) Build a multiple linear regression model, called model1, using the response and all predictors.  Display the summary table of the model.**

**(b) Is the overall regression significant at an $\alpha$ level of 0.01?**


**(c) What is the coefficient estimate for *Body.Height*? Interpret this coefficient.**


**(d) What is the coefficient estimate for the *Species* category Parkki? Interpret this coefficient.**



# Question 3: Checking for Outliers and Multicollinearity [9 points]

**(a) Create a plot for the Cook's Distances. Using a threshold Cook's Distance of 1, identify the row numbers of any outliers.**

**(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with *Weight* as the response.  Display the summary of this model.**

**(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-$R^2$) what conclusions can you draw?**

# Question 4: Checking Model Assumptions [9 points]

*Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.*

**(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?**

**(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2.  Does the constant variance assumption appear to hold?  Do the errors appear uncorrelated?**

**(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?**

# Question 5 Partial F Test [6 points]

**(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only *Species* and *Total.Length* as predicting variables and *Weight* as the response.  Display the summary table of the model3.**

**(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an $\alpha$ level of 0.01?**

# Question 6: Reduced Model Residual Analysis and Multicollinearity Test [10 points]

**(a) Conduct a multicollinearity test on model3.  Comment on the multicollinearity in model3.**

**(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.**

# Question 7: Transformation [12 pts]

**(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3.  What transformation, if any, should be applied according to the lambda value?  Please ensure you use model3**

**(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.**

**(c) Perform Residual Analysis on model4. Comment on each assumption.  Was the transformation successful/unsuccessful?**

# Question 8: Model Comparison  [3pts]

**(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.**



# Question 9: Estimation and Prediction [10 points]

**(a) Estimate Weight for the last 10 rows of data (fishtest) using both model3 and model4.  Compare and discuss the mean squared prediction error (MSPE) of both models.**

**(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval.  Provide an interpretation of the prediction interval.**