# STAT 207 Homework 6 [25 points]

## Feature Selection for Linear Models

Due: Friday, October 6, end of day (11:59 pm CT)

<hr>

## Imports 

Run the following code cell to import the necessary packages into the file.  You may import additional packages, as needed for this assignment.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf

## The Data

A famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment) was conducted to determine what factors affected or predicted outcomes, including how long a patient remained in the hospital.

We will use a random sample of 580 seriously ill hospitalized patients from the SUPPORT study, with the following variables:

- **Days**: day to death or hospital discharge
- **Age**: age on day of hospital admission
- **Sex**: female or male
- **Comorbidity**: patient diagnosed with more than one chronic disease
- **EdYears**: years of education
- **Education**: education level: high or low
- **Income**: income level: high or low
- **Charges**: hospital charges, in dollars
- **Care**: level of care required: high or low
- **Race**: Non-white or white
- **Pressure**: Blood pressure, in mmHg
- **Blood**: white blood cell count, in gm/dL
- **Rate**: heart rate, in bpm

This is the same data studied in Homework 5.

Run the code in the cell below to read in the cleaned data for this document.  The data is saved as `df` with this code.  

In [3]:
df = pd.read_csv('hospital.csv')
df

Unnamed: 0,Days,Age,Sex,Comorbidity,EdYears,Education,Income,Charges,Care,Race,Pressure,Blood,Rate
0,8,42.258972,female,no,11,low,high,9914.0,low,non-white,84,11.298828,94
1,14,63.662994,female,no,22,high,high,283303.0,high,white,69,30.097656,108
2,21,41.521973,male,yes,18,high,high,320843.0,high,white,66,0.199982,130
3,4,41.959991,male,yes,16,high,high,4173.0,low,white,97,10.798828,88
4,11,52.089996,male,yes,8,low,high,13414.0,low,white,89,6.399414,92
...,...,...,...,...,...,...,...,...,...,...,...,...,...
575,13,70.091980,female,yes,9,low,low,14271.0,high,white,65,38.398438,104
576,4,81.721985,male,yes,8,low,low,3043.0,low,white,71,4.599609,67
577,3,74.521973,female,yes,8,low,high,2172.0,low,white,99,7.599609,110
578,6,49.116974,male,yes,8,low,low,5799.0,low,white,108,8.898438,87


## 1. Research Purpose [2 points]

One important step before beginning our analysis is to understand the purpose for the research.  This helps us to evaluate the importance of characteristics in our final model.  

For each of the questions below, indicate the **purpose** of the study (prediction or understanding structures) and what the **focus** of the final fitted model should be (small errors or low model complexity).

**a)** The hospital administration would like to estimate when a bed will be available for the next patient, so that they can report their availability for new patients and determine if they should add more beds to a ward.  To answer this question, they will search for an appropriate linear model using some of the variables above.

The goal of this study is to predict when a bed will be available for the next patient. In this study, the focus of the fitted model should be on small errors because if there are big errors, then it will lead to issues such as providing an accurate estimate of bed availability. Therefore, there should be minimal errors so that the model can be reliable for making bed availability predictions.

**b)** A health administrator would like to develop an intervention to shorten the length of hospital stays.  They would like to determine patient characteristics associated with longer hospital stays, so that they can target their intervention to those with the most at-risk demographic and health characteristics.  To answer this question, they will search for an appropriate linear model using some of the variables above.

The goal of this study is to understand structures, as the goal is to determine patient characteristics associated with longer hospital stays. Therefore, we want to understand the various factors associated with a longer hospital stay rather than making specific predictions. In this study, the focus of the fitted model should be on low model complexity because a simple model will be easier to interpret. As the goal is to identify patient characteristics associated with longer hospital stays, a low model complexity will allow the researchers to identify any relationships between different characteristics.

## 2. Fitting our Full Model [3 points]

We will begin by fitting a model that incorporates all of our possible predictor variables (as specified below).  

**a)** Create a training and testing dataset for this data, using a random state of 9876 with 80% of the observations in the training set and 20% in the test set.  **Note:** this is different from the training and test set separation in Homework 5.

In [4]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=9876)
df_train

Unnamed: 0,Days,Age,Sex,Comorbidity,EdYears,Education,Income,Charges,Care,Race,Pressure,Blood,Rate
163,11,66.345947,female,yes,14,high,high,12506.00000,low,white,103,7.500000,100
0,8,42.258972,female,no,11,low,high,9914.00000,low,non-white,84,11.298828,94
519,10,64.245972,female,yes,12,low,low,6996.00000,low,white,70,6.399414,130
15,62,38.496979,male,yes,17,high,high,463001.00000,high,white,73,10.099609,130
189,21,57.647980,female,yes,12,low,high,45167.53125,high,white,78,11.000000,112
...,...,...,...,...,...,...,...,...,...,...,...,...,...
405,3,75.096985,male,yes,12,low,low,1680.00000,low,white,99,9.099609,93
373,7,58.400970,female,yes,14,high,low,5292.00000,low,white,99,3.299805,80
27,8,66.140991,male,yes,16,high,high,21075.00000,low,white,108,14.099609,132
391,4,62.452972,male,yes,11,low,low,3041.00000,low,non-white,106,6.699219,126


In [5]:
df_test

Unnamed: 0,Days,Age,Sex,Comorbidity,EdYears,Education,Income,Charges,Care,Race,Pressure,Blood,Rate
427,18,80.462952,female,yes,8,low,low,30255.00000,high,white,124,9.798828,45
385,18,35.060974,female,yes,14,high,high,19369.00000,high,white,72,0.149994,96
380,3,60.566986,female,yes,10,low,low,2917.00000,low,white,87,6.599609,87
2,21,41.521973,male,yes,18,high,high,320843.00000,high,white,66,0.199982,130
211,12,56.308990,female,yes,12,low,high,16926.09375,low,white,81,11.298828,118
...,...,...,...,...,...,...,...,...,...,...,...,...,...
476,6,76.823975,female,yes,12,low,low,7858.00000,low,non-white,71,18.398438,80
400,5,31.202988,male,yes,13,high,low,7541.00000,high,white,77,18.000000,114
418,40,64.859985,male,no,12,low,high,110427.00000,high,white,118,8.099609,64
354,4,67.101990,male,yes,10,low,low,3856.00000,low,non-white,77,6.500000,50


**b)** For the purposes of this question, we want to be sure that we only include variables that could be determined or anticipated upon admittance to the hospital, so that our model can be applied to new patients.

Fit a linear model to the training data to predict the length of the hospital stay (**Days**) with the following predictor variables: Age, Sex, Comorbidity, EdYears, Education, Income, Care, Race, Pressure, Blood, and Rate.

In [6]:
fit_model = smf.ols("Days ~ Age + Sex + Comorbidity + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", df_train).fit()
fit_model.summary()

0,1,2,3
Dep. Variable:,Days,R-squared:,0.15
Model:,OLS,Adj. R-squared:,0.129
Method:,Least Squares,F-statistic:,7.23
Date:,"Fri, 06 Oct 2023",Prob (F-statistic):,2.13e-11
Time:,14:45:19,Log-Likelihood:,-2042.3
No. Observations:,464,AIC:,4109.0
Df Residuals:,452,BIC:,4158.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,16.3417,9.863,1.657,0.098,-3.041,35.724
Sex[T.male],-2.3662,1.912,-1.238,0.216,-6.124,1.391
Comorbidity[T.yes],-2.1125,3.139,-0.673,0.501,-8.281,4.056
Education[T.low],0.1577,2.969,0.053,0.958,-5.677,5.992
Income[T.low],-1.5625,2.090,-0.748,0.455,-5.670,2.545
Care[T.low],-11.5546,2.048,-5.641,0.000,-15.580,-7.529
Race[T.white],3.4544,2.379,1.452,0.147,-1.222,8.131
Age,-0.0800,0.063,-1.270,0.205,-0.204,0.044
EdYears,-0.3958,0.392,-1.011,0.313,-1.165,0.374

0,1,2,3
Omnibus:,390.399,Durbin-Watson:,1.981
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7914.321
Skew:,3.615,Prob(JB):,0.0
Kurtosis:,21.897,Cond. No.,1600.0


**c)** What proportion of the variability of the length of the hospital stay can be explained by this linear model?

In [7]:
fit_model.rsquared

0.1496322662718187

## 3. Checking Model Assumptions for our Full Model [4.5 points]

In Question 6 of Homework 5, we confirmed that two required assumptions are met (at least for the model that we fit last week).  In this question, we will check for strong multicollinearity to determine if our model is appropriate.

**a)** Do you have evidence of strong multicollinearity between your quantitative predictor variables in the training data?  You can either make a graph or calculate summary statistics to answer this question.

In [8]:
df_train.corr(numeric_only = True)

Unnamed: 0,Days,Age,EdYears,Charges,Pressure,Blood,Rate
Days,1.0,-0.097817,-0.043803,0.625903,0.142552,0.098109,0.214445
Age,-0.097817,1.0,-0.141759,-0.131381,-0.032725,0.060425,-0.170896
EdYears,-0.043803,-0.141759,1.0,0.064607,-0.041573,-0.081412,-0.050414
Charges,0.625903,-0.131381,0.064607,1.0,0.002613,0.120632,0.259968
Pressure,0.142552,-0.032725,-0.041573,0.002613,1.0,0.04519,0.07025
Blood,0.098109,0.060425,-0.081412,0.120632,0.04519,1.0,0.151982
Rate,0.214445,-0.170896,-0.050414,0.259968,0.07025,0.151982,1.0


**b)** Briefly interpret your results from **part a** to answer whether you have strong multicollinearity, and explain your reasoning.

Based on my results from part a, it seems that I don't have strong multicollinearity between my quantitative predictor variables. For something to have strong multicollinearity, the correlation value needs to be close to 1 or -1.Overall, there isn't a strong multicollinearity between the quantitative predictor variables.

**c)** We don't have an easy measure of multicollinearity for categorical variables, so we'll assess this one based on the meaning of our categorical variables.  Considering our categorical variables, what relationships might exist between categorical variables or between a categorical and a quantitative variable that you are concerned about?  Select at least one pair, explain why you are concerned, and examine the relationship visually or with summary statistics.

I am concerned about my variables Charges and Days because they have a correlation coefficient of 0.63, which is close to 1. If a correlation coefficient is close to 1 or -1, it could indicate multicollinearity.

In [9]:
df_train[['Charges','Days']].corr(numeric_only = True)

Unnamed: 0,Charges,Days
Charges,1.0,0.625903
Days,0.625903,1.0


**d)** Finally, explain why multicollinearity is a concern for linear models.

Multicollinearity is a concern for linear models because it makes it more difficult to predict each predictor variable accurately, as the standard errors of these predictor variables increase. Additionally, multicollinearity may contain interpretations that don't make sense. For example, with multicollinearity, there may be variables with opposite signs that you wouldn't expect, making it difficult to derive meaningful relationships between variables.

## 4. Metrics for Model Fit [3.5 points]

We know that our model from Question 2 has a large number of predictor variables.  We would like to select the right set of predictor variables so that our model is parsimonious.  We'll perform some model selection procedures to do so in this question and the next one.

Before we fit multiple models, let's use our theoretical understanding to answer the following questions.

**a)** How many possible models could be fit from the predictors allowed in Question 2b, assuming that we don't have any interaction terms between predictor variables?

In [10]:
2 ** 12

4096

**b)** Without fitting any models, can we determine the model that will have the largest $R^2$?  If so, explain which model will have the largest $R^2$.  If not, explain why not.

It is not possible to determine the largest R^2 without fitting any models because R^2 is the proportion of variance in the dependent variables and the independent variables explain that variation. The R^2 value is dependent on the relationship between the variables. If we don't fit a model, then we won't be able to determine which model has the largest R^2 value.

**c)** Without fitting any models, can we determine the model that will have the largest adjusted $R^2$, or $R^2_{\text{adj}}$?  If so, explain which model will have the largest $R^2_{\text{adj}}$.  If not, explain why not.

Again, it is not possible to determine the largest adjusted R^2 without fitting any models because the adjusted R^2 value depends on the variation between the dependent variable and how well the independent variable explains that variation. Additionally, if there is nonlinear relationships between variables, that would affect the adjusted R^2 value. Overall, it is not possible because the largest adjusted R^2 depends on the data and the model's complexity.

**d)** What two model characteristics are balanced in the measure of the $R^2_{\text{adj}}$?

Goodness of fit and model complexity.

## 5. Selecting a Parsimonious Model [7 points]

Now, we'll begin to search for our parsimonious model.  

**a)** First, we'll perform model selection using backwards elimination and using the $R^2_{\text{adj}}$ as our metric using our training data.  Be sure to show your work.  That is, don't delete any models that you fit.  Add as many code cells below as you need.

In [11]:
current_model = smf.ols(formula='Days~ Age+ Sex+ Comorbidity+ EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
current_model.rsquared_adj

0.41670841685242455

In [12]:
#remove age
test_model = smf.ols(formula='Days ~ Sex+ Comorbidity+ EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4173124379412425

In [13]:
#remove sex
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4166373960985531

In [14]:
#remove comorbidity
test_model = smf.ols(formula='Days ~ Age+ Sex + EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41572660981679455

In [15]:
#remove edyears
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.416572044595623

In [16]:
#remove education
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Income+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4178507384285448

In [17]:
#remove income
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Charges+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4156311100366198

In [18]:
#remove charges
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Care+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.12893747629170826

In [19]:
#remove Care
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.417907796496374

In [20]:
#remove race
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Care+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4115480863926647

In [21]:
#remove pressure
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Care+ Race+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.3989871528522797

In [22]:
#remove blood
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41783289067385465

In [23]:
#remove rate
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Care+ Race+ Pressure+ Blood', data=df_train).fit()
test_model.rsquared_adj

0.4172637897393483

In [24]:
#remove care from current model
current_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
current_model.rsquared_adj

0.417907796496374

In [25]:
#remove age
test_model = smf.ols(formula='Days ~ Comorbidity+ Sex + EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4184759104555196

In [26]:
#remove comorbidity
test_model = smf.ols(formula='Days ~ Age+ Sex + EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41659953250528947

In [27]:
#remove sex
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41782502866468707

In [28]:
#remove edyears
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ Education+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41775924937552156

In [29]:
#remove education
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41905038420578156

In [30]:
#remove income
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4167869504592884

In [31]:
#remove charges
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Income+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.06967923272540011

In [32]:
#remove race
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Income+ Charges+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4126971037695075

In [33]:
#remove pressure
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Income+ Charges+ Race+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4003040596784627

In [34]:
#remove blood
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Income+ Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41904097206363233

In [35]:
#remove rate
test_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Education+ Income+ Charges+ Race+ Blood+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.4183188032761911

In [36]:
#remove education from current model
current_model = smf.ols(formula='Days ~ Age+ Comorbidity + Sex+ EdYears+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
current_model.rsquared_adj

0.41905038420578156

In [37]:
#remove age
test_model = smf.ols(formula='Days ~  Comorbidity + Sex+ EdYears+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4195448214601327

In [38]:
#remove comorbidity
test_model = smf.ols(formula='Days ~  Age + Sex+ EdYears+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41778117780788737

In [39]:
#remove sex
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ EdYears+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4189546337623413

In [40]:
#remove EdYears
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ Income+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41629288025067224

In [41]:
#remove Income
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Charges+ Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4177584790948565

In [42]:
#remove charges
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Race+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.07169042487409505

In [43]:
#remove Race
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Charges+ Pressure+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4135778713431346

In [44]:
#remove pressure
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Charges+ Race+ Blood+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.40142831746748797

In [45]:
#remove blood
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4201723805339811

In [46]:
#remove rate
test_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Charges+ Race+ Pressure+ Blood', data=df_train).fit()
test_model.rsquared_adj

0.41950122954314306

In [47]:
#remove blood from current model
current_model = smf.ols(formula='Days ~  Age + Comorbidity+ Sex+ EdYears+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
current_model.rsquared_adj

0.4201723805339811

In [48]:
#remove age
test_model = smf.ols(formula='Days ~ Comorbidity+ Sex+ EdYears+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4205968640042699

In [49]:
#remove comorbidity
test_model = smf.ols(formula='Days ~ Age+ Sex+ EdYears+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.419044604152108

In [50]:
#remove sex
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ EdYears+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4201503299933751

In [51]:
#remove edyears
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ Income + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.41749586634854985

In [52]:
#remove income
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Charges+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4189507949126139

In [53]:
#remove charges
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Income+ Race+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.07207941954353858

In [54]:
#remove race
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Income+ Charges+ Pressure+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4148012932589584

In [55]:
#remove pressure
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Rate', data=df_train).fit()
test_model.rsquared_adj

0.4026945638896666

In [56]:
#remove rate
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.42068951998809423

In [57]:
#remove rate from current model
current_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
current_model.rsquared_adj

0.42068951998809423

In [58]:
#remove age
test_model = smf.ols(formula='Days ~ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.42083195756729097

In [59]:
#remove comorbidity
test_model = smf.ols(formula='Days ~ Age+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41954050257651243

In [60]:
#remove sex
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.42039916372437125

In [61]:
#remove edyears
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.4178182933794341

In [62]:
#remove income
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41925578163532684

In [63]:
#remove charges
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Income+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.04216393614979874

In [64]:
#remove race
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Income+ Charges+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.4153644740849932

In [65]:
#remove pressure
test_model = smf.ols(formula='Days ~ Age+ Comorbidity+ Sex + EdYears+ Income+ Charges+ Race', data=df_train).fit()
test_model.rsquared_adj

0.4026788255277326

In [66]:
#remove age from current model
test_model = smf.ols(formula='Days ~ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.42083195756729097

In [67]:
#remove comorbidity
test_model = smf.ols(formula='Days ~  Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41945154059753875

In [68]:
#remove sex
test_model = smf.ols(formula='Days ~  Comorbidity+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.4206659232313035

In [69]:
#remove Edyears
test_model = smf.ols(formula='Days ~  Comorbidity+ Sex + Income+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41849244600047686

In [70]:
#remove income
test_model = smf.ols(formula='Days ~  Comorbidity+ Sex + EdYears+ Charges+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41930337246251626

In [71]:
#remove charges
test_model = smf.ols(formula='Days ~  Comorbidity+ Sex + EdYears+ Income+ Race+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.03283096759452342

In [72]:
#remove race
test_model = smf.ols(formula='Days ~  Comorbidity+ Sex + EdYears+ Income+ Charges+ Pressure', data=df_train).fit()
test_model.rsquared_adj

0.41630618341515957

In [73]:
#remove pressure
test_model = smf.ols(formula='Days ~  Comorbidity+ Sex + EdYears+ Income+ Charges+ Race', data=df_train).fit()
test_model.rsquared_adj

0.40252035066624625

**b)** Report the final set of predictors selected for your parsimonious model.  

The final set of predictors from my parsimonious model would Comorbidity, Sex, EdYears, Income, Charges, Race, and pressure.

**c)** What is the $R^2_{\text{adj}}$ for this model?

In [74]:
current_model = smf.ols(formula='Days ~ Comorbidity+ Sex+ EdYears + Income+ Charges+ Race+ Pressure', data=df_train).fit()
current_model.rsquared_adj

0.42083195756729097

**d)** Is this the best possible $R^2_{\text{adj}}$ that could be acheived from any possible combination of these predictors?  Briefly explain.

Yes it is. When we perform each iteration, we remove one variable. At the end, we compare the adjusted R^2 values to the current model's adjusted R^2 value and if it is less than the current model's adjusted R^2 value, then we run through another iteration. After running through multiple iterations, we were able to find one variation of a test model whose adjusted R^2 value was larger than the current model of that iteration's adjusted R^2, making it the best possible adjusted R^2 value to be achieved. Therefore, the best adjusted R^2 value only contains the following predictors: comorbidity, sex, edyears, income, charges, race, and pressure. These predictors yielded the largest adjusted R^2 value, meaning that there is a relatively large proportion of variance.

**e)** What was the least helpful predictor variable in your original model?

Care was the least helpful predictor variable in my original model because the adjusted r^2 value was only 0.4179.

## 6. Exploring our Parsimonious Model [5 points] 

For this question, we'll use the parsimonious model that you selected in Question 5 above.  

**a)** Be sure that you are using the correct model selected in Question 5.  Print the coefficient estimates for this model below.  (It's ok if additional characteristics are printed, too.)

In [75]:
current_model.summary()

0,1,2,3
Dep. Variable:,Days,R-squared:,0.43
Model:,OLS,Adj. R-squared:,0.421
Method:,Least Squares,F-statistic:,49.06
Date:,"Fri, 06 Oct 2023",Prob (F-statistic):,7.61e-52
Time:,14:45:20,Log-Likelihood:,-1949.7
No. Observations:,464,AIC:,3915.0
Df Residuals:,456,BIC:,3948.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.8041,5.030,0.557,0.577,-7.081,12.689
Comorbidity[T.yes],-3.5495,2.456,-1.445,0.149,-8.375,1.276
Sex[T.male],-1.6372,1.539,-1.063,0.288,-4.663,1.388
Income[T.low],2.5287,1.702,1.485,0.138,-0.817,5.874
Race[T.white],4.0436,1.891,2.138,0.033,0.327,7.760
EdYears,-0.3850,0.228,-1.687,0.092,-0.833,0.063
Charges,0.0002,9.92e-06,17.526,0.000,0.000,0.000
Pressure,0.1138,0.029,3.931,0.000,0.057,0.171

0,1,2,3
Omnibus:,291.517,Durbin-Watson:,2.095
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6109.256
Skew:,2.311,Prob(JB):,0.0
Kurtosis:,20.165,Cond. No.,651000.0


**b)** Based on the categorical predictors in this model, what features are associated with longer hospitalizations?  In other words, for what groups of patients might we want to target an intervention to reduce the hospitalization time? 

Based on the categorical predictors, we would want to target race and income, specifically those who are white and have low income, as those are the ones who are associated with longer hospitalizations.

**c)** Calculate the RMSE for the parsimonious model on the training data and on the test set.  

In [76]:
from sklearn.metrics import mean_squared_error

X_train = df_train.drop(['Days'], axis=1)
y_train=df_train['Days']
X_test = df_test.drop(['Days'], axis=1)
y_test=df_test['Days']

y_pred_train = current_model.predict(X_train)
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
train_rmse

16.165763490001975

In [77]:
y_pred_test = current_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)
test_rmse

16.941747490138525

**d)** Now, calculate the RMSE for the full model (from Question 2b) on the training data and on the test set.

In [78]:
y_pred_train = fit_model.predict(X_train)
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
train_rmse

19.73810187753586

In [79]:
y_pred_test = fit_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)
test_rmse

25.80024026909838

**e)** Does the RMSE show evidence that the parsimonious model reduces overfitting to the training set that could be occurring?  Briefly explain.

Yes it does. The RMSE for the testing data of the complex model has an rmse of 25.8, which is larger than the parsimonious model's testing data rmse of 16.94. This means that it is very likely that the complex model is overfitting the training data, while the parsimonious model is reducing overfitting and providing a better generalization for the data.

Remember to keep all your cells and hit the save icon above periodically to checkpoint (save) your results on your local computer. Once you are satisified with your results restart the kernel and run all (Kernel -> Restart & Run All). **Make sure nothing has changed**. Checkpoint and exit (File -> Save and Checkpoint + File -> Close and Halt). Follow the instructions on the Homework 5 Canvas Assignment to submit your notebook to GitHub.