# Guided Capstone Step 4. Pre-processing and Training Data Development 

**The Data Science Method**  


1.   Problem Identification 


2.   Data Wrangling 
  
 
3.   Exploratory Data Analysis   

4.   **Pre-processing and Training Data Development**  
 * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

**<font color='teal'> Start by loading the necessary packages as we did in step 3 and printing out our current working directory just to confirm we are in the correct project directory. </font>**

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
os.getcwd()

'/Users/justin/Desktop/GuidedCapstone-master-2/notebooks'

**<font color='teal'>  Load the csv file you created in step 3, remember it should be saved inside your data subfolder and print the first five rows.</font>**

In [17]:
#why is it in the answer key asking to load from step 2? 
path="/Users/justin/Desktop/GuidedCapstone-master-2/data/data"
os.chdir(path) 
os.getcwd()
df = pd.read_csv('step3_output.csv')

In [18]:
df.head()

Unnamed: 0,Name,state,summit_elev,vertical_drop,trams,fastEight,fastSixes,fastQuads,quad,triple,...,SkiableTerrain_ac,Snow Making_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
0,Alyeska Resort,Alaska,3939,2500,1,0.0,0,2,2,0,...,1610.0,113.0,150.0,60.0,669.0,65.0,85.0,150.0,550.0,1
1,Eaglecrest Ski Area,Alaska,2600,1540,0,0.0,0,0,0,0,...,640.0,60.0,45.0,44.0,350.0,47.0,53.0,90.0,0.0,1
2,Hilltop Ski Area,Alaska,2090,294,0,0.0,0,0,0,1,...,30.0,30.0,150.0,36.0,69.0,30.0,34.0,152.0,30.0,1
3,Arizona Snowbowl,Arizona,11500,2300,0,0.0,1,0,2,2,...,777.0,104.0,122.0,81.0,260.0,89.0,89.0,122.0,0.0,0
4,Sunrise Park Resort,Arizona,11100,1800,0,0.0,0,1,2,3,...,800.0,80.0,115.0,49.0,250.0,74.0,78.0,104.0,80.0,0


In [19]:
df.shape

(330, 26)

In [20]:
#we had 35 unique variables so when we make them into dummy variables 
df['state'].nunique()

35

## Create dummy features for categorical variables

**<font color='teal'> Create dummy variables for `state`. Add the dummies back to the dataframe and remove the original column for `state`. </font>**

In [21]:
df.columns

Index(['Name', 'state', 'summit_elev', 'vertical_drop', 'trams', 'fastEight',
       'fastSixes', 'fastQuads', 'quad', 'triple', 'double', 'surface',
       'total_chairs', 'Runs', 'TerrainParks', 'LongestRun_mi',
       'SkiableTerrain_ac', 'Snow Making_ac', 'daysOpenLastYear', 'yearsOpen',
       'averageSnowfall', 'AdultWeekday', 'AdultWeekend', 'projectedDaysOpen',
       'NightSkiing_ac', 'clusters'],
      dtype='object')

In [34]:
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
#The math is 26 + 35 - 1 = 60 columns/features 
#Let's use column transformer instead with one hot encoder 
#https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

ohe = OneHotEncoder(handle_unknown='ignore') #if test set has less features 
scaler = preprocessing.StandardScaler()
#drop the recommended columns 
df2 = df.drop(['Name','AdultWeekend'], axis=1)
#clusters shouldn't be scaled as well.  
ct = make_column_transformer(
    (ohe, ['state', 'clusters']), #encode state 
    (scaler, ['summit_elev', 'vertical_drop', 'trams', 'fastEight',
       'fastSixes', 'fastQuads', 'quad', 'triple', 'double', 'surface',
       'total_chairs', 'Runs', 'TerrainParks', 'LongestRun_mi',
       'SkiableTerrain_ac', 'Snow Making_ac', 'daysOpenLastYear', 'yearsOpen',
       'averageSnowfall', 'AdultWeekday', 'projectedDaysOpen',
       'NightSkiing_ac']), #scale
    remainder='passthrough') #passthrough

## Standardize the magnitude of numeric features

**<font color='teal'> Using sklearn preprocessing standardize the scale of the features of the dataframe except the name of the resort which we don't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) so we can have their true values available for model performance review. Let's set `AdultWeekend` to the y variable as our response for scaling and modeling. Later we will go back and consider the `AdultWeekday`, `dayOpenLastYear`, and `projectedDaysOpen`. For now leave them in the development dataframe. </font>**

In [26]:
#spelling error above at don't 
#https://scikit-learn.org/stable/modules/preprocessing.html
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
#shouldn't we train test split first before scaling? Lets do that first so 
#there is no data leakage 
#scale data
X= df2
y=df.AdultWeekend

## Split into training and testing datasets

**<font color='teal'> Using sklearn model selection import train_test_split, and create a 75/25 split with the y = `AdultWeekend`. We will start by using the adult weekend ticket price as our response variable for modeling.</font>**

In [49]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#explain random state
#https://stackoverflow.com/questions/49147774/what-is-random-state-in-sklearn-model-selection-train-test-split-example
#https://numpy.org/doc/1.18/reference/generated/numpy.ravel.html
#Explain ravel 
#https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
#possibly use a pipeline instead of train test splithere 
from sklearn.model_selection import train_test_split
y=y.ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3)
#Best to seperate them to prevent leakage
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.fit_transform(X_test)

In [50]:
#So for example like this
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
pipe = make_pipeline(ct, reg)

In [51]:
pipe.fit(X,y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=True),
                                                  ['state', 'clusters']),
                                                 ('standardscaler',
                                                  StandardScaler(copy=Tr...
                                            

In [52]:
cross_val_score(pipe, X, y, cv=5, scoring='r2')

array([-3.09006839e+24, -2.95685086e+23, -5.21963721e+23, -3.73543246e+19,
       -2.38376545e+23])

**<font color='teal'> To complete this step you upload this notebook to your github repo and share the url with your mentor.  </font>**

In [32]:
print(X_train_scaled.shape)
print('\n')
print(X_test_scaled.shape)
print('\n')
print(y.shape)

(247, 60)


(83, 50)


(330,)
