# Welcome to Recitation 10!


#**Go Read `Rec10-Intro.pdf` if you have not already.** 

This lab will not make sense without it! 

## RECITATION ASSIGNMENT

This lab project will guide you through the analysis.

## Data

To obtain data on how various variable values affect pouring and cooling, a batch
of 100 castings is poured with random variations in the mold variables about their baseline
values. The data are available in the file `castdata.csv` on github. Each row contains parameter
values (the inputs), and the cast batch time. The first line in the file contains the header
with the names of the variables. The data start in the second row. The first row of data has
the baseline values, that is, the values of the variables used in the current casting approach.

## Variables

The following variables can be varied: `Riser Height`, `Riser Diameter`, `Riser 1 Position`, `Riser 2 Position`, `Gate Diameter`, `Cup Height`, `Sprue Height`, `Sprue Diameter Bottom`, and `Sprue Diameter Top` (see Figure 2). The response variable is “`BatchTime`”.

## Importing Data

We first need to import the libraries necessary for this recitation. Next, we have to upload our data into the notebook; Upload the data set `castdata.csv` from the course github. We will call the entire data set `df`.

In [0]:
#math and arrays
import numpy as np
#dataframes
import pandas as pd
#plotting
import matplotlib.pyplot as plt
import seaborn as sns
#linear regression and model selection 
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [0]:
# 'df' is common pandas dataframe nomenclature
df = pd.read_csv('castdata.csv')
df.head()

## Visual Analysis

To examine the relationships between all pairs of variables ("`feasible`" is ignored through the use of `df.loc`, which splices the dataframe in the same way that one would a regular Python list), plot the dataframe with the pairplot command using the seaborn library:

In [0]:
sns.pairplot(df.loc[:, :'BatchTime'])

When you pairplot a dataframe you get a scatterplot matrix. This matrix allows you to visualize all pairwise relationships between variables.

## Question 1

Based on the scatterplot matrix, which variable of the first 9 in df most affects "`BatchTime`"?

Ans:

## Multiple Linear Regression

We will now fit a multiple linear regression model for batch time using the mold variables as predictors. You'll see in the following code that we first import the statsmodels library. We then assign 'batchime' to be our dependent variable and 'X' to be our independent variables by splicing our dataframe as we did before. The line `X = sm.add_constant(X)` adds a column of all ones called `const` to X. This is needed to fit the intercept. 

Next, we create a regression model and get the results of the regression using model.fit(). Finally, we summarize these results. Run the code below to view the summary table.

In [0]:
batchtime = df["BatchTime"]
X = df.loc[:, :'Riser2Pos']
X = sm.add_constant(X) 

model = sm.OLS(batchtime, X)
results = model.fit()

# Print out the stats
results.summary()

This summary output may seem intimidating at first; let's focus on the independent variables (section that vertically lists our variables with "coef", "std err", ...). Specifically, we are interested in the column titled $P>|t|$, which gives the p-value or statistical significance for that variable. Here's a way to interpret these values: a p-value of .01 would indicate statistical significance at the 99% level for that variable. In determining the degree of effect a variable has on `BatchTime`, simply observe how large the coefficent is; this means that a small change in that variable has a significant impact on our dependent variable, `BatchTime`.

## Question 2
Observe the summary output in the model for `BatchTime`. Which variables appear to be statistically significant at the 95% level, and how do they affect the `BatchTime`? Which predictor(s) have the largest effect on `BatchTime`?

Ans:

Now, for each of our variables, we would like to plot the fitted values from our regression versus those for the actual BatchTime using Matplotlib; run the code below and observe the graphs.

In [0]:
fig, ax = plt.subplots(10,figsize =(10,45))
resid = results.resid
for i in range(10): # for each of the nine variables
  # create a fitted plot for that variable ('i' indexes the variables)
  ax[i].scatter(X.iloc[:,i],resid) 
  ax[i].set_xlabel(X.columns[i]) 
  ax[i].set_ylabel("residual")

## Question 3

Examine the nine plots. Do any of them show nonlinear patterns? That is, does the difference between the fitted and actual values change based on the x value? Or does there appear to be a relatively uniform disparity across the plot? Describe the nonlinear patterns, if any, that you see.


Ans: 

## Question 4


Are the other assumptions about the errors made by linear regression satisfied? In particular, address the following questions: Do the residuals appear mutually independent? Are there any problems with non-constant variance (heteroscedasticity)? Do the residuals appear normallly distributed?

Hint 1: You may want to refer to the `test-assumptions.ipynb` demo on github for ideas of visualizations that are helpful to answer these questions. 

Hint 2: Normally distributed residuals will form a line of a qq-plot, though not necessarily the 45-degree line. The slope is dependent on the variance of the residual. Try running the code below with various choices of `sigma` to convince yourself of this.  

In [0]:
#Code for Hint 2

sigma = 1 #rerun this code with various choices of sigma, e.g. 0.5, 5

n=500
eps = sigma * np.random.randn(n) # normal residuals 
x = 10*np.random.rand(n)
y = x + eps
model = sm.OLS(y,x).fit()

# scipy.stats.probplot(data, dist="norm", plot=plt);
plt.hist(eps)
sm.qqplot(model.resid, line='45');

Ans:

##Adding nonlinear terms

You may have noticed that `BatchTime` is nonlinearly related to `GameDiam`. Use the following commands to repeate the regression analysis with quadratic and cubic terms. 

In [0]:
X['GateDiamSquared'] = X['GateDiam']**2 #add quadratic term for GateDiam
X['GateDiamCubic'] = X['GateDiam']**3  #add cubic term
model2 = sm.OLS(batchtime, X)
results2 = model2.fit()

# Print out the stats
results2.summary()

##Question 5

Now based on the new model we have just fit, are the assumptions about the errors made satisfied? In particular, address the following questions: Do the residuals reveal any problems with the model? Are there any problems with non-constant variance (heteroscedasticity)? Do the residuals appear normall distributed?



Ans:

##Model Selection

Now we want see if there is a good model that is simplier, i.e. it uses fewer features. 

First we must split our data into training and testing sets so that we can test our models accurately. 

Run the following code to split your data. 


In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, batchtime, test_size=0.25, random_state=0)

##Model Selection Using AIC

Run the following code to select your model based on the training set using AIC, then test it using the testing set.

In [0]:
def minAIC(X,y):
    variables = X.columns
    model = sm.OLS(y,X[variables]).fit()
    while True:
        maxp = np.max(model.pvalues)
        new_variables = variables[model.pvalues < maxp]
        newmodel = sm.OLS(y,X[new_variables]).fit()
        if newmodel.aic < model.aic:
            model = newmodel
            variables = new_variables
        else:
            break
    return model,variables

# select on training set 
model,variables = minAIC(X_train, y_train)
print(variables)


y_pred = model.predict(X_test[variables])
print(mean_squared_error(y_test,y_pred))

##Question 6 

Which features does AIC select? What is the mean squared error for the testing set?

Ans:

##Question 7

Now train a model on the training set that uses all the features. 

In [0]:
#Code here

##Question 8

What is the mean squared error for your new model? Which model do you prefer and why?

In [0]:
#Code for MSE here

Ans: