# Step 4. Pre-Processing and Training Data Development

**The Data Science Method**  


1.   [Problem Identification](https://medium.com/@aiden.dataminer/the-data-science-method-problem-identification-6ffcda1e5152)

2.   [Data Wrangling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-data-collection-organization-and-definitions-d19b6ff141c4) 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   [Exploratory Data Analysis](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-exploratory-data-analysis-bc84d4d8d3f9)
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   [**Pre-processing and Training Data Development**](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967)
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   [Modeling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-modeling-56b4233cad1b)
  * Create dummy or indicator features for categorical variable
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   [Documentation](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-documentation-c92c28bd45e6)

  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

Load the necessary packages as we did in step 3 and print out the current working directory just to confirm we are in the correct project directory.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

print(os.getcwd())
print(os.listdir())

/Users/jb/Development/courses/springboard/ds/Assignments/6 Applying the Data Science Method/big-mountain-resort
['Notebook_Step4.ipynb', 'Notebook_Step6.ipynb', 'Notebook_Step2.ipynb', '.DS_Store', 'LICENSE', 'Notebook_Step5.ipynb', 'models', 'Notebook_Step3.ipynb', 'README.md', '.gitignore', 'figures', '.ipynb_checkpoints', '.git', 'data']


Load the csv file created in step 3, remember it should be saved inside the data subfolder and print the first five rows.

In [2]:
df = pd.read_csv('data/step3_output.csv')
df.head(5)

Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow_Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
0,Hilltop Ski Area,Alaska,2090,294,0,0.0,0,0,0,1,...,30.0,30.0,150.0,36.0,69.0,30.0,34.0,152.0,30.0,0
1,Sunrise Park Resort,Arizona,11100,1800,0,0.0,0,1,2,3,...,800.0,80.0,115.0,49.0,250.0,74.0,78.0,104.0,80.0,1
2,Yosemite Ski & Snowboard Area,California,7800,600,0,0.0,0,0,0,1,...,88.0,100.0,110.0,84.0,300.0,47.0,47.0,107.0,0.0,1
3,Boreal Mountain Resort,California,7700,500,0,0.0,0,1,1,3,...,380.0,200.0,150.0,54.0,400.0,49.0,60.0,150.0,200.0,1
4,Dodge Ridge,California,8200,1600,0,0.0,0,0,1,2,...,862.0,100.0,114.0,69.0,350.0,78.0,78.0,140.0,0.0,1


In [3]:
df[df['Name'].str.contains('Whitefish')]

Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow_Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
171,Whitefish Mountain Resort,Montana,6817,2353,0,0.0,0,3,2,6,...,3000.0,600.0,123.0,72.0,333.0,81.0,81.0,123.0,600.0,1


## Create dummy features for categorical variables

Create dummy variables for `state`. Addes the dummies back to the dataframe and remove the original column for `state`.

In [4]:
df_awe_m1 = pd.concat([df.drop(['state'], axis=1), pd.get_dummies(df[['state']])], axis=1)

In [5]:
df_awe_m1.head()

Unnamed: 0,Name,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,double,...,state_Rhode Island,state_South Dakota,state_Tennessee,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
0,Hilltop Ski Area,2090,294,0,0.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Sunrise Park Resort,11100,1800,0,0.0,0,1,2,3,1,...,0,0,0,0,0,0,0,0,0,0
2,Yosemite Ski & Snowboard Area,7800,600,0,0.0,0,0,0,1,3,...,0,0,0,0,0,0,0,0,0,0
3,Boreal Mountain Resort,7700,500,0,0.0,0,1,1,3,1,...,0,0,0,0,0,0,0,0,0,0
4,Dodge Ridge,8200,1600,0,0.0,0,0,1,2,5,...,0,0,0,0,0,0,0,0,0,0


## Standardize the magnitude of numeric features

Using sklearn preprocessing standardize the scale of the features of the dataframe except the name of the resort which we done't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) so we can have their true values available for model performance review. Let's set `AdultWeekend` to the y variable as our response for scaling and modeling. Later we will go back and consider the `AdultWeekday`, `dayOpenLastYear`, and `projectedDaysOpen`. For now leave them in the development dataframe.

## Predict `AdultWeekend`

In [6]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing

# The standardizing of the features will happen many time so the fuction below 
# handles the standardiziation of features
def standardize_features(x):
    # Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
    scaler = preprocessing.StandardScaler().fit(x)
    # Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
    X_scaled = scaler.transform(x)
    # return X_scaled
    return X_scaled

In [7]:
# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = df_awe_m1.drop(['Name','AdultWeekend'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m1.AdultWeekend

X_scaled = standardize_features(X)

## Split into training and testing datasets

Using sklearn model selection import train_test_split, and create a 75/25 split with the y = `AdultWeekend`. We will start by using the adult weekend ticket price as our response variable for modeling.

In [8]:
# Import the train_test_split function from the sklearn.model_selection utility.  
from sklearn.model_selection import train_test_split

# The splitting of the dataframe will happen many time so the fuction below 
# handles the splitting the data into X_train, X_test, y_train, y_test given the 75/25 ratio split.
def split_data(x, y):
    # Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
    y = y.ravel()

    # Call the train_test_split() function with the first two parameters set to x and y 
    # Declare four variables, X_train, X_test, y_train and y_test separated by commas 
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
    return  X_train, X_test, y_train, y_test

In [9]:
 X_train, X_test, y_train, y_test = split_data(X_scaled, y)

Here we start the actual modeling work. First let's fit a multiple linear regression model to predict the `AdultWeekend` price.

# Step 5. Modeling


This is the fifth step in the Data Science Method. In the previous steps we cleaned and prepared the datasets. Now it's time to get into the most exciting part: modeling! In this exercise, we'll build three different models and compare each model's performance. In the end, we'll choose the best model for demonstrating insights to Big Mountain management.


**The Data Science Method**  


1.   [Problem Identification](https://medium.com/@aiden.dataminer/the-data-science-method-problem-identification-6ffcda1e5152)

2.   [Data Wrangling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-data-collection-organization-and-definitions-d19b6ff141c4) 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   [Exploratory Data Analysis](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-exploratory-data-analysis-bc84d4d8d3f9)
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   [Pre-processing and Training Data Development](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967)
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   [**Modeling**](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-modeling-56b4233cad1b)
  * Create dummy or indicator features for categorical variable
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   [Documentation](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-documentation-c92c28bd45e6)

  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Fit Models with a Training Dataset

Using sklearn, fit the model on the training dataset.

In [10]:
from sklearn import linear_model
lm = linear_model.LinearRegression()

#### Model 1

In [11]:
model_awe_1 = lm.fit(X_train, y_train)

Predict on the testing dataset and score the model performance with the y_test set and the y-pred values. The explained variance is a measure of the variation explained by the model. This is also known as the R-squared value.

In [12]:
# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_1.predict(X_test)

## Review Model Outcomes — Iterate over additional models as needed

In [13]:
# You might want to use the explained_variance_score() and mean_absolute_error() metrics.
# To do so, you will need to import them from sklearn.metrics. 
# You can plug y_test and y_pred into the functions to evaluate the model
from sklearn.metrics import explained_variance_score, mean_absolute_error



# The evaluation of the models will happen many time so the fuction below 
# handles model evaluation with explained variance score and mean absolute error
def model_evaluate(y_test, y_pred):
    evs = explained_variance_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    print('explained variance score: ', evs)
    print('mean absolute error: ', mae)
    return evs, mae

In [14]:
evs_awe_1, mae_awe_1 = model_evaluate(y_test, y_pred)

explained variance score:  0.7756641343021198
mean absolute error:  6.145708594149419


Prints the intercept value from the linear model.

In [15]:
print(model_awe_1.intercept_)

56.2827138903714


The intercept is the mean `AdultWeekend` price for all the resorts given the other characteristics. The addition or subtraction of each of the coefficient values in the regression are numeric adjustments applied to the intercept to provide a particular observation's value for the resulting `AdultWeekend` value. Also, because we took the time to scale our x values in the training data, we can compare each of the coeeficients for the features to determine the feature importances. 

Prints the coefficient values from the linear model and sort in descending order to identify the top ten most important features. Makes sure to review the absolute value of the coefficients, because the adjustment may be positive or negative, but what we are looking for is the magnitude of impact on our response variable.

In [16]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)

Unnamed: 0,Coefficient
fastSixes,1243003000000.0
fastEight,578034200000.0
total_chairs,220764900000.0
trams,151279700000.0
double,114786200000.0
surface,112769700000.0
triple,100252600000.0
state_New York,81054190000.0
state_Michigan,71340170000.0
state_California,64415340000.0


We should see that the top ten important features contain different states. However, the state is not something the managers at the Big Mountain Resort can do anything about. Given that we care more about actionable traits associated with ticket pricing, rebuild the model without the state features and compare the results.

Hint: Try to construct another model using exactly the steps we followed above. 

#### Model 2

In [17]:
# Start fresh 
df_awe_m2 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state' and 'AdultWeekend' from the df
X = df_awe_m2.drop(['Name', 'state', 'AdultWeekend'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m2.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [18]:
model_awe_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_2.predict(X_test)

evs_awe_2, mae_awe_2 = model_evaluate(y_test, y_pred)
print(model_awe_2.intercept_)

explained variance score:  0.7927639050722355
mean absolute error:  6.539748091155944
56.28448772100999


In [19]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)


Unnamed: 0,Coefficient
AdultWeekday,9.730253
clusters,2.972244
daysOpenLastYear,2.615928
summit_elev,2.387304
averageSnowfall,2.072779
triple,1.674026
vertical_drop,1.486892
projectedDaysOpen,1.278542
SkiableTerrain_ac,1.205469
surface,1.13702


When reviewing our new model coefficients, we see `summit_elev` is now in the number three spot. This is also difficult to change from a management prespective and highly correlated with `summit_elev` and `base_elev`.  This time, rebuild the model without the state features and without the `summit_elev` and without `base_elev`and compare the results.

#### Model 3

In [20]:
# Start fresh 
df_awe_m3 = df

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state', 'AdultWeekend', and 'summit_elev', from the df
X = df_awe_m3.drop(['Name', 'state', 'AdultWeekend', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m3.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [21]:
model_awe_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_3.predict(X_test)

evs_awe_3, mae_awe_3 = model_evaluate(y_test, y_pred)
print(model_awe_3.intercept_)

explained variance score:  0.7777271010258031
mean absolute error:  6.5599012811726185
56.33993390471849


In [22]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)


Unnamed: 0,Coefficient
AdultWeekday,9.620722
averageSnowfall,2.804015
clusters,2.546432
daysOpenLastYear,2.446166
SkiableTerrain_ac,1.701617
triple,1.43533
projectedDaysOpen,1.170337
LongestRun_mi,1.100049
Runs,1.02188
vertical_drop,0.983185


## Identify the Final Model

We review the model performances in the table below and choose the best model for proving insights to Big Mountain management about what features are driving ski resort lift ticket prices.

In [23]:
print('Model 1:')
print('Explained Variance:  ', evs_awe_1)
print('Mean Absolute Error: ', mae_awe_1)
print()
print('Model 2:')
print('Explained Variance:  ', evs_awe_2)
print('Mean Absolute Error: ', mae_awe_2)
print()
print('Model 3:')
print('Explained Variance:  ', evs_awe_3)
print('Mean Absolute Error: ', mae_awe_3)

Model 1:
Explained Variance:   0.7756641343021198
Mean Absolute Error:  6.145708594149419

Model 2:
Explained Variance:   0.7927639050722355
Mean Absolute Error:  6.539748091155944

Model 3:
Explained Variance:   0.7777271010258031
Mean Absolute Error:  6.5599012811726185


| Model | Explained Variance| Mean Absolute Error|Features Dropped|
| --- | --- | --- | --- |
| Model 1. | 0.940 | 4.966 |'base_elev'|
| Model 2. | 0.936| 5.201 |'base_elev', 'state'|
| Model 3. | 0.937 | 5.188 |'base_elev', 'state','summit_elev'|

### `AdultWeekend` prediction
Model Selection: Model 1

Model 1, with all features, performed the best. This will provide the most accurate insights to the Big Mountain management decision on lift ticket prices. Model 1 with the `state` feature produced a higher explained variance than the other two models, showing that 94% of the variance in the `AdultWeekend` (y) can be explained by the linear relationship between the input features (X) and the output prediction `AdultWeekend` (y). The mean absolute error for model 1 is lower the other models at 4.966.

While categorical variable features are extremely difficult to control, the `state` feature helped maintain a lower mean absolute error because it segmentized and analyzed the states were higher or lower prices.




## Predict `AdultWeekday`

In [24]:
# Start fresh 
df_awd_m1 = df.copy()

# Make dummy `state_`s and drop `state`
df_awd_m1 = pd.concat([df_awd_m1.drop(['state'], axis=1), pd.get_dummies(df_awd_m1[['state']])], axis=1)

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = df_awd_m1.drop(['Name', 'AdultWeekday'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m1.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_awd_1 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_1.predict(X_test)

evs_awd_1, mae_awd_1 = model_evaluate(y_test, y_pred)
print(model_awd_1.intercept_)

explained variance score:  0.7422556758180926
mean absolute error:  6.896908113250268
49.222820694797726


In [25]:
## Start fresh 
df_awd_m2 = df.copy()


# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'AdultWeekend', and 'state' from the df
X = df_awd_m2.drop(['Name', 'AdultWeekday', 'state'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m2.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_awd_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_2.predict(X_test)

evs_awd_2, mae_awd_2 = model_evaluate(y_test, y_pred)
print(model_awd_2.intercept_)

explained variance score:  0.8007501094726159
mean absolute error:  6.449138267727306
49.292181257812395


In [26]:
## Start fresh 
df_awd_m3 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'AdultWeekend', 'state', and 'summit_elev' from the df
X = df_awd_m3.drop(['Name', 'AdultWeekday', 'state', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m3.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

# Train model
model_awd_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_3.predict(X_test)

evs_awd_3, mae_awd_3 = model_evaluate(y_test, y_pred)
print(model_awd_3.intercept_)

explained variance score:  0.7830456840197244
mean absolute error:  6.636564235573113
49.214061705603015


In [27]:
print('Model 1:')
print('Explained Variance:  ', evs_awd_1)
print('Mean Absolute Error: ', mae_awd_1)
print()
print('Model 2:')
print('Explained Variance:  ', evs_awd_2)
print('Mean Absolute Error: ', mae_awd_2)
print()
print('Model 3:')
print('Explained Variance:  ', evs_awd_3)
print('Mean Absolute Error: ', mae_awd_3)

Model 1:
Explained Variance:   0.7422556758180926
Mean Absolute Error:  6.896908113250268

Model 2:
Explained Variance:   0.8007501094726159
Mean Absolute Error:  6.449138267727306

Model 3:
Explained Variance:   0.7830456840197244
Mean Absolute Error:  6.636564235573113


| Model | Explained Variance| Mean Absolute Error|Features Dropped|
| --- | --- | --- | --- |
| Model 1. | 0.940 | 4.966 |'base_elev'|
| Model 2. | 0.936| 5.201 |'base_elev', 'state'|
| Model 3. | 0.937 | 5.188 |'base_elev', 'state','summit_elev'|

### `AdultWeekday` prediction
Model Selection: Model 1




## Predict `projectedDaysOpen`

In [28]:
# Start fresh 
df_pdo_m1 = df.copy()

# Make dummy `state_`s and drop `state`
df_pdo_m1 = pd.concat([df_pdo_m1.drop(['state'], axis=1), pd.get_dummies(df_pdo_m1[['state']])], axis=1)

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m1.drop(['Name', 'projectedDaysOpen'], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m1.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_1 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_1.predict(X_test)

evs_pdo_1, mae_pdo_1 = model_evaluate(y_test, y_pred)
print(model_pdo_1.intercept_)

explained variance score:  0.2351528471545128
mean absolute error:  13.270347756137712
112.5715893885886


In [29]:
# Start fresh 
df_pdo_m2 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m2.drop(['Name', 'projectedDaysOpen', 'state'], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m2.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_2.predict(X_test)

evs_pdo_2, mae_pdo_2 = model_evaluate(y_test, y_pred)
print(model_pdo_2.intercept_)

explained variance score:  0.4172462552656919
mean absolute error:  11.864796873312795
112.48208365948472


In [30]:
# Start fresh 
df_pdo_m3 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m3.drop(['Name', 'projectedDaysOpen', 'state', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m3.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_3.predict(X_test)

evs_pdo_3, mae_pdo_3 = model_evaluate(y_test, y_pred)
print(model_pdo_3.intercept_)

explained variance score:  0.4144167636696271
mean absolute error:  11.941545945074344
112.53160454382764
