# Step 4. Pre-Processing and Training Data Development

**The Data Science Method**  


1.   [Problem Identification](https://medium.com/@aiden.dataminer/the-data-science-method-problem-identification-6ffcda1e5152)

2.   [Data Wrangling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-data-collection-organization-and-definitions-d19b6ff141c4) 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   [Exploratory Data Analysis](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-exploratory-data-analysis-bc84d4d8d3f9)
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   [**Pre-processing and Training Data Development**](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967)
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   [Modeling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-modeling-56b4233cad1b)
  * Create dummy or indicator features for categorical variable
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   [Documentation](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-documentation-c92c28bd45e6)

  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

Load the necessary packages as we did in step 3 and print out the current working directory just to confirm we are in the correct project directory.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

print(os.getcwd())
print(os.listdir())

/Users/jb/Development/courses/springboard/ds/Assignments/test/big-mountain-resort
['Notebook_Step4.ipynb', 'Notebook_Step6.ipynb', 'Notebook_Step2.ipynb', '.DS_Store', 'LICENSE', 'Notebook_Step5.ipynb', 'models', 'Notebook_Step3.ipynb', 'README.md', '.gitignore', 'figures', '.ipynb_checkpoints', '.git', 'data']


Load the csv file created in step 3, remember it should be saved inside the data subfolder and print the first five rows.

In [2]:
df = pd.read_csv('data/step3_output.csv')
df.head(5)



Unnamed: 0,Name,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,quad,...,SkiableTerrain_ac,Snow_Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
0,Alyeska Resort,Alaska,3939,2500,250,1,0.0,0,2,2,...,1610.0,113.0,150.0,60.0,669.0,65.0,85.0,150.0,550.0,1
1,Eaglecrest Ski Area,Alaska,2600,1540,1200,0,0.0,0,0,0,...,640.0,60.0,45.0,44.0,350.0,47.0,53.0,90.0,0.0,1
2,Hilltop Ski Area,Alaska,2090,294,1796,0,0.0,0,0,0,...,30.0,30.0,150.0,36.0,69.0,30.0,34.0,152.0,30.0,1
3,Arizona Snowbowl,Arizona,11500,2300,9200,0,0.0,1,0,2,...,777.0,104.0,122.0,81.0,260.0,89.0,89.0,122.0,0.0,0
4,Sunrise Park Resort,Arizona,11100,1800,9200,0,0.0,0,1,2,...,800.0,80.0,115.0,49.0,250.0,74.0,78.0,104.0,80.0,0


In [3]:
df[df['Name'].str.contains('Whitefish')]

Unnamed: 0,Name,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,quad,...,SkiableTerrain_ac,Snow_Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
151,Whitefish Mountain Resort,Montana,6817,2353,4464,0,0.0,0,3,2,...,3000.0,600.0,123.0,72.0,333.0,81.0,81.0,123.0,600.0,2


## Create dummy features for categorical variables

Create dummy variables for `state`. Addes the dummies back to the dataframe and remove the original column for `state`.

In [4]:
df_awe_m1 = pd.concat([df.drop(['state'], axis=1), pd.get_dummies(df['state'])], axis=1)
# df_awe_m1 = df.drop(['summit_elev','base_elev', 'clusters'], axis=1)

In [5]:
df_awe_m1.head()

Unnamed: 0,Name,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,quad,triple,...,Rhode Island,South Dakota,Tennessee,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,Alyeska Resort,3939,2500,250,1,0.0,0,2,2,0,...,0,0,0,0,0,0,0,0,0,0
1,Eaglecrest Ski Area,2600,1540,1200,0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hilltop Ski Area,2090,294,1796,0,0.0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,Arizona Snowbowl,11500,2300,9200,0,0.0,1,0,2,2,...,0,0,0,0,0,0,0,0,0,0
4,Sunrise Park Resort,11100,1800,9200,0,0.0,0,1,2,3,...,0,0,0,0,0,0,0,0,0,0


## Standardize the magnitude of numeric features

Using sklearn preprocessing standardize the scale of the features of the dataframe except the name of the resort which we done't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) so we can have their true values available for model performance review. Let's set `AdultWeekend` to the y variable as our response for scaling and modeling. Later we will go back and consider the `AdultWeekday`, `dayOpenLastYear`, and `projectedDaysOpen`. For now leave them in the development dataframe.

## Predict `AdultWeekend`

In [6]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing

# The standardizing of the features will happen many time so the fuction below 
# handles the standardiziation of features
def standardize_features(x):
    # Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
    scaler = preprocessing.StandardScaler().fit(x)
    # Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
    X_scaled = scaler.transform(x)
    # return X_scaled
    return X_scaled

In [7]:
# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = df_awe_m1.drop(['Name','AdultWeekend', 'summit_elev','base_elev', 'clusters'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m1.AdultWeekend

X_scaled = standardize_features(X)

## Split into training and testing datasets

Using sklearn model selection import train_test_split, and create a 75/25 split with the y = `AdultWeekend`. We will start by using the adult weekend ticket price as our response variable for modeling.

In [8]:
# Import the train_test_split function from the sklearn.model_selection utility.  
from sklearn.model_selection import train_test_split

# The splitting of the dataframe will happen many time so the fuction below 
# handles the splitting the data into X_train, X_test, y_train, y_test given the 75/25 ratio split.
def split_data(x, y):
    # Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
    y = y.ravel()

    # Call the train_test_split() function with the first two parameters set to x and y 
    # Declare four variables, X_train, X_test, y_train and y_test separated by commas 
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
    return  X_train, X_test, y_train, y_test

In [9]:
 X_train, X_test, y_train, y_test = split_data(X_scaled, y)

Here we start the actual modeling work. First let's fit a multiple linear regression model to predict the `AdultWeekend` price.

# Step 5. Modeling


This is the fifth step in the Data Science Method. In the previous steps we cleaned and prepared the datasets. Now it's time to get into the most exciting part: modeling! In this exercise, we'll build three different models and compare each model's performance. In the end, we'll choose the best model for demonstrating insights to Big Mountain management.


**The Data Science Method**  


1.   [Problem Identification](https://medium.com/@aiden.dataminer/the-data-science-method-problem-identification-6ffcda1e5152)

2.   [Data Wrangling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-data-collection-organization-and-definitions-d19b6ff141c4) 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   [Exploratory Data Analysis](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-exploratory-data-analysis-bc84d4d8d3f9)
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   [Pre-processing and Training Data Development](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967)
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   [**Modeling**](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-modeling-56b4233cad1b)
  * Create dummy or indicator features for categorical variable
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   [Documentation](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-documentation-c92c28bd45e6)

  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Fit Models with a Training Dataset

Using sklearn, fit the model on the training dataset.

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

lm = LinearRegression()
rfr = RandomForestRegressor(random_state=0, n_estimators=500)


#### Model 1

In [11]:
model_awe_1 = lm.fit(X_train, y_train)

Predict on the testing dataset and score the model performance with the y_test set and the y-pred values. The explained variance is a measure of the variation explained by the model. This is also known as the R-squared value.

In [12]:
# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_1.predict(X_test)

## Review Model Outcomes — Iterate over additional models as needed

In [13]:
# You might want to use the explained_variance_score() and mean_absolute_error() metrics.
# To do so, you will need to import them from sklearn.metrics. 
# You can plug y_test and y_pred into the functions to evaluate the model
from sklearn.metrics import explained_variance_score, mean_absolute_error

# The evaluation of the models will happen many time so the fuction below 
# handles model evaluation with explained variance score and mean absolute error
def model_evaluate(y_test, y_pred):
    evs = explained_variance_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    print('Explained Variance Score: ', round(evs, 3))
    print('Mean Absolute Error: ', round(mae, 3))
    return evs, mae

In [14]:
evs_awe_1, mae_awe_1 = model_evaluate(y_test, y_pred)

Explained Variance Score:  0.942
Mean Absolute Error:  4.891


Prints the intercept value from the linear model.

In [15]:
print(model_awe_1.intercept_)

64.09440259526306


The intercept is the mean `AdultWeekend` price for all the resorts given the other characteristics. The addition or subtraction of each of the coefficient values in the regression are numeric adjustments applied to the intercept to provide a particular observation's value for the resulting `AdultWeekend` value. Also, because we took the time to scale our x values in the training data, we can compare each of the coeeficients for the features to determine the feature importances. 

Prints the coefficient values from the linear model and sort in descending order to identify the top ten most important features. Makes sure to review the absolute value of the coefficients, because the adjustment may be positive or negative, but what we are looking for is the magnitude of impact on our response variable.

In [16]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)

Unnamed: 0,Coefficient
total_chairs,45172750000000.0
fastQuads,17125090000000.0
surface,16044930000000.0
double,14139380000000.0
triple,12613300000000.0
quad,10222610000000.0
fastSixes,5076742000000.0
trams,4362078000000.0
New York,4118177000000.0
Michigan,3886441000000.0


We should see that the top ten important features contain different states. However, the state is not something the managers at the Big Mountain Resort can do anything about. Given that we care more about actionable traits associated with ticket pricing, rebuild the model without the state features and compare the results.

Hint: Try to construct another model using exactly the steps we followed above. 

#### Model 2

In [17]:
# Start fresh 
df_awe_m2 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state' and 'AdultWeekend' from the df
X = df_awe_m2.drop(['Name', 'state', 'AdultWeekend'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m2.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [18]:
model_awe_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_2.predict(X_test)

evs_awe_2, mae_awe_2 = model_evaluate(y_test, y_pred)
print(model_awe_2.intercept_)

Explained Variance Score:  0.924
Mean Absolute Error:  5.427
64.05984328521328


In [19]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)


Unnamed: 0,Coefficient
AdultWeekday,20.043081
summit_elev,13.937355
base_elev,10.74881
vertical_drop,4.839792
averageSnowfall,1.525115
quad,1.521778
triple,1.418922
Runs,1.331539
surface,1.259977
daysOpenLastYear,1.037355


When reviewing our new model coefficients, we see `summit_elev` is now in the number three spot. This is also difficult to change from a management prespective and highly correlated with `summit_elev` and `base_elev`.  This time, rebuild the model without the state features and without the `summit_elev` and without `base_elev`and compare the results.

#### Model 3

In [20]:
# Start fresh 
df_awe_m3 = df

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state', 'AdultWeekend', and 'summit_elev', from the df
X = df_awe_m3.drop(['Name', 'state', 'AdultWeekend', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m3.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [21]:
model_awe_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_3.predict(X_test)

evs_awe_3, mae_awe_3 = model_evaluate(y_test, y_pred)
print(model_awe_3.intercept_)

Explained Variance Score:  0.93
Mean Absolute Error:  5.23
64.06317276743371


In [22]:
# You might want to make a pandas DataFrame displaying the coefficients for each state like so: 
pd.DataFrame(abs(lm.coef_), X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False).head(10)


Unnamed: 0,Coefficient
AdultWeekday,20.011979
vertical_drop,1.496063
averageSnowfall,1.486591
quad,1.477121
Runs,1.454351
triple,1.398586
surface,1.200811
daysOpenLastYear,1.084806
base_elev,0.975561
clusters,0.860096


#### Model 4

In [23]:
# Start fresh 
df_awe_m4 = pd.concat([df.drop(['state'], axis=1), pd.get_dummies(df['state'])], axis=1)

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state', 'AdultWeekend', and 'summit_elev', from the df
X = df_awe_m4.drop(['Name','AdultWeekend', 'summit_elev','base_elev', 'clusters'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m4.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)


In [24]:
model_awe_4 = rfr.fit(X_train, y_train)
model_awe_4.score(X_test, y_test)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_4.predict(X_test)

evs_awe_4, mae_awe_4 = model_evaluate(y_test, y_pred)

coff_m4 = pd.DataFrame(rfr.feature_importances_, X.columns, columns=['Importance'])
print(coff_m4.sort_values('Importance', ascending=False))

Explained Variance Score:  0.907
Mean Absolute Error:  5.808
                   Importance
AdultWeekday         0.840008
total_chairs         0.015910
Snow_Making_ac       0.015194
yearsOpen            0.011536
Runs                 0.010734
vertical_drop        0.009692
TerrainParks         0.008492
averageSnowfall      0.008453
projectedDaysOpen    0.007792
daysOpenLastYear     0.007588
LongestRun_mi        0.007206
SkiableTerrain_ac    0.007204
quad                 0.005975
fastQuads            0.005544
triple               0.005128
NightSkiing_ac       0.005055
North Carolina       0.004484
surface              0.004293
double               0.004174
Montana              0.001512
California           0.001462
Vermont              0.001128
Tennessee            0.001086
Michigan             0.001046
New York             0.000965
Idaho                0.000903
Maryland             0.000846
Connecticut          0.000778
Virginia             0.000535
Minnesota            0.000533
Pennsylva

#### Model 5

In [25]:
# Start fresh 
df_awe_m5 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state', 'AdultWeekend', and 'summit_elev', from the df
X = df_awe_m5.drop(['Name', 'state', 'AdultWeekend'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m5.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [26]:
model_awe_5 = rfr.fit(X_train, y_train)
print(model_awe_5.score(X_test, y_test))

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_5.predict(X_test)

evs_awe_5, mae_awe_5 = model_evaluate(y_test, y_pred)

coff_m5 = pd.DataFrame(rfr.feature_importances_, X.columns, columns=['Importance'])
print(coff_m5.sort_values('Importance', ascending=False))

0.9107895218584501
Explained Variance Score:  0.911
Mean Absolute Error:  5.746
                   Importance
AdultWeekday         0.841119
total_chairs         0.016499
Snow_Making_ac       0.015016
yearsOpen            0.010643
Runs                 0.010372
vertical_drop        0.010147
base_elev            0.009333
TerrainParks         0.008693
projectedDaysOpen    0.008162
LongestRun_mi        0.007588
averageSnowfall      0.007443
daysOpenLastYear     0.007294
SkiableTerrain_ac    0.007267
summit_elev          0.007228
quad                 0.006193
NightSkiing_ac       0.005515
fastQuads            0.005434
triple               0.005070
surface              0.004617
double               0.003979
clusters             0.001596
fastSixes            0.000593
trams                0.000199
fastEight            0.000000


#### Model 6

In [27]:
# Start fresh 
df_awe_m6 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'state', 'AdultWeekend', and 'summit_elev', from the df
X = df_awe_m6.drop(['Name', 'state', 'AdultWeekend', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awe_m6.AdultWeekend

X_scaled = standardize_features(X) 
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

In [28]:
model_awe_6 = rfr.fit(X_train, y_train)
print(model_awe_6.score(X_test, y_test))

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awe_6.predict(X_test)

evs_awe_6, mae_awe_6 = model_evaluate(y_test, y_pred)

coff_m6 = pd.DataFrame(rfr.feature_importances_, X.columns, columns=['Importance'])
print(coff_m6.sort_values('Importance', ascending=False))

0.9087046451030285
Explained Variance Score:  0.909
Mean Absolute Error:  5.793
                   Importance
AdultWeekday         0.839749
total_chairs         0.016978
Snow_Making_ac       0.015548
base_elev            0.012715
yearsOpen            0.011352
Runs                 0.010934
vertical_drop        0.010740
TerrainParks         0.008675
averageSnowfall      0.008098
projectedDaysOpen    0.007923
LongestRun_mi        0.007767
daysOpenLastYear     0.007489
SkiableTerrain_ac    0.006914
quad                 0.006618
fastQuads            0.005876
NightSkiing_ac       0.005533
triple               0.005442
surface              0.004967
double               0.004133
clusters             0.001698
fastSixes            0.000611
trams                0.000241
fastEight            0.000000


## Identify the Final Model

We review the model performances in the table below and choose the best model for proving insights to Big Mountain management about what features are driving ski resort lift ticket prices.

In [29]:
print('Model 1:')
print('Explained Variance:  ', evs_awe_1)
print('Mean Absolute Error: ', mae_awe_1)
print()
print('Model 2:')
print('Explained Variance:  ', evs_awe_2)
print('Mean Absolute Error: ', mae_awe_2)
print()
print('Model 3:')
print('Explained Variance:  ', evs_awe_3)
print('Mean Absolute Error: ', mae_awe_3)
print()
print('Model 4:')
print('Explained Variance:  ', evs_awe_4)
print('Mean Absolute Error: ', mae_awe_4)
print()
print('Model 5:')
print('Explained Variance:  ', evs_awe_5)
print('Mean Absolute Error: ', mae_awe_5)
print()
print('Model 6:')
print('Explained Variance:  ', evs_awe_6)
print('Mean Absolute Error: ', mae_awe_6)

Model 1:
Explained Variance:   0.9418302256922265
Mean Absolute Error:  4.891004596047414

Model 2:
Explained Variance:   0.9244916305612476
Mean Absolute Error:  5.4273095818580135

Model 3:
Explained Variance:   0.9303448843132529
Mean Absolute Error:  5.230491183251202

Model 4:
Explained Variance:   0.9069943141371271
Mean Absolute Error:  5.8080674949259

Model 5:
Explained Variance:   0.9108519835581262
Mean Absolute Error:  5.745632334931096

Model 6:
Explained Variance:   0.9087392430754438
Mean Absolute Error:  5.793215346547441


| Model | Explained Variance| Mean Absolute Error|Features Dropped|
| --- | --- | --- | --- |
| Linear Regressor  |   |   | 
| Model 1. | 0.941 | 4.891 | 'summit_elev','base_elev', 'clusters' |
| Model 2. | 0.925 | 5.404 | 'state'|
| Model 3. | 0.931 | 5.211 |'state', 'base_elev', 'summit_elev'|
|   |   |   | 
| Random Forest Regressor  |   |   | 
| Model 4. | 0.906 | 5.808 | 'summit_elev','base_elev', 'clusters' |
| Model 5. | 0.910 | 5.749 | 'state'|
| Model 6. | 0.909 | 5.772 |'state', 'base_elev', 'summit_elev'|

### `AdultWeekend` prediction
Model Selection: Model 1

Model 1, with all features, performed the best. This will provide the most accurate insights to the Big Mountain management decision on lift ticket prices. Model 1 with the `state` feature produced a higher explained variance than the other two models, showing that 94% of the variance in the `AdultWeekend` (y) can be explained by the linear relationship between the input features (X) and the output prediction `AdultWeekend` (y). The mean absolute error for model 1 is lower the other models at 4.966.

While categorical variable features are extremely difficult to control, the `state` feature helped maintain a lower mean absolute error because it segmentized and analyzed the states were higher or lower prices.




## Predict `AdultWeekday`

In [30]:
# Start fresh 
df_awd_m1 = df.copy()

# Make dummy `state_`s and drop `state`
df_awd_m1 = pd.concat([df_awd_m1.drop(['state'], axis=1), pd.get_dummies(df_awd_m1[['state']])], axis=1)

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = df_awd_m1.drop(['Name', 'AdultWeekday', 'summit_elev','base_elev', 'clusters'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m1.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_awd_1 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_1.predict(X_test)

evs_awd_1, mae_awd_1 = model_evaluate(y_test, y_pred)
print(model_awd_1.intercept_)

Explained Variance Score:  0.932
Mean Absolute Error:  5.344
58.015599983428864


In [31]:
## Start fresh 
df_awd_m2 = df.copy()


# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'AdultWeekend', and 'state' from the df
X = df_awd_m2.drop(['Name', 'AdultWeekday', 'state'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m2.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_awd_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_2.predict(X_test)

evs_awd_2, mae_awd_2 = model_evaluate(y_test, y_pred)
print(model_awd_2.intercept_)

Explained Variance Score:  0.915
Mean Absolute Error:  5.727
58.07050182630486


In [32]:
## Start fresh 
df_awd_m3 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name', 'AdultWeekend', 'state', and 'summit_elev' from the df
X = df_awd_m3.drop(['Name', 'AdultWeekday', 'state', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = df_awd_m3.AdultWeekday

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

# Train model
model_awd_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_awd_3.predict(X_test)

evs_awd_3, mae_awd_3 = model_evaluate(y_test, y_pred)
print(model_awd_3.intercept_)

Explained Variance Score:  0.918
Mean Absolute Error:  5.601
58.066984208429695


In [33]:
print('Model 1:')
print('Explained Variance:  ', evs_awd_1)
print('Mean Absolute Error: ', mae_awd_1)
print()
print('Model 2:')
print('Explained Variance:  ', evs_awd_2)
print('Mean Absolute Error: ', mae_awd_2)
print()
print('Model 3:')
print('Explained Variance:  ', evs_awd_3)
print('Mean Absolute Error: ', mae_awd_3)

Model 1:
Explained Variance:   0.9319839837628828
Mean Absolute Error:  5.343511780990138

Model 2:
Explained Variance:   0.9146816770144183
Mean Absolute Error:  5.727019249006059

Model 3:
Explained Variance:   0.9182691736896524
Mean Absolute Error:  5.6006548696030976


|Model | Explained Variance| Mean Absolute Error|Features Dropped|
| --- | --- | --- | --- |
| Linear Regressor  |   |   | 
| Model 1. | 0.931 | 5.343 | 'summit_elev','base_elev', 'clusters' |
| Model 2. | 0.914 | 5.745 | 'state'|
| Model 3. | 0.918 | 5.616 |'state', 'base_elev', 'summit_elev'|


### `AdultWeekday` prediction
Model Selection: Model 1




## Predict `projectedDaysOpen`

In [34]:
# Start fresh 
df_pdo_m1 = df.copy()

# Make dummy `state_`s and drop `state`
df_pdo_m1 = pd.concat([df_pdo_m1.drop(['state'], axis=1), pd.get_dummies(df_pdo_m1[['state']])], axis=1)

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m1.drop(['Name', 'projectedDaysOpen', 'summit_elev','base_elev', 'clusters'], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m1.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_1 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_1.predict(X_test)

evs_pdo_1, mae_pdo_1 = model_evaluate(y_test, y_pred)
print(model_pdo_1.intercept_)

Explained Variance Score:  -0.228
Mean Absolute Error:  15.449
120.17855603091783


In [35]:
# Start fresh 
df_pdo_m2 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m2.drop(['Name', 'projectedDaysOpen', 'state', ], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m2.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_2 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_2.predict(X_test)

evs_pdo_2, mae_pdo_2 = model_evaluate(y_test, y_pred)
print(model_pdo_2.intercept_)

Explained Variance Score:  -0.049
Mean Absolute Error:  13.811
120.16053365315938


In [36]:
# Start fresh 
df_pdo_m3 = df.copy()

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'projectedDaysOpen' from the df
X = df_pdo_m3.drop(['Name', 'projectedDaysOpen', 'state', 'summit_elev'], axis=1)

# Declare a response variable, called y, and assign it the projectedDaysOpen column of the df 
y = df_pdo_m3.projectedDaysOpen

# Standardize
X_scaled = standardize_features(X) 

# Split into test and train
X_train, X_test, y_train, y_test = split_data(X_scaled, y)

#  Train model
model_pdo_3 = lm.fit(X_train, y_train)

# Make a variable called y_pred and assign it the result of calling predict() on our model variable with parameter X_test
y_pred = model_pdo_3.predict(X_test)

evs_pdo_3, mae_pdo_3 = model_evaluate(y_test, y_pred)
print(model_pdo_3.intercept_)

Explained Variance Score:  -0.064
Mean Absolute Error:  13.856
120.16539099135309


In [37]:
print('Model 1:')
print('Explained Variance:  ', evs_pdo_1)
print('Mean Absolute Error: ', mae_pdo_1)
print()
print('Model 2:')
print('Explained Variance:  ', evs_pdo_2)
print('Mean Absolute Error: ', mae_pdo_2)
print()
print('Model 3:')
print('Explained Variance:  ', evs_pdo_3)
print('Mean Absolute Error: ', mae_pdo_3)

Model 1:
Explained Variance:   -0.22791399282251334
Mean Absolute Error:  15.44886556966154

Model 2:
Explained Variance:   -0.04938168313610203
Mean Absolute Error:  13.811375894432356

Model 3:
Explained Variance:   -0.06443561881378312
Mean Absolute Error:  13.85622688415533


| Model | Explained Variance | Mean Absolute Error | Features Dropped|
| --- | --- | --- | --- |
| Linear Regressor  |   |   | 
| Model 1. | -0.227 | 15.448 | 'summit_elev','base_elev', 'clusters' |
| Model 2. | -0.041 | 13.738 | 'state'|
| Model 3. | -0.056 | 13.775 |'state', 'base_elev', 'summit_elev'|