### I. Data Preparation 

The goal of this dataset is to clean the data set we'll use for data visualizations and training the model. 
    
Here's what to expect on this notebook:
    - Importing libraries and data
    - Fixing data types
    - Find number of nulls
    - Feature engineering: Creating new columns, aggreagating categories and hot encoding
    
Finally we save modified data for future data visualizations, and then for creating the predictive model. 


### 1. Import Libraries

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math

%matplotlib inline
%matplotlib inline

### 2. Import datasets and merge

In [2]:
#import train values
train=pd.read_csv('train_values.csv')
train.shape

(3198, 34)

In [3]:
#import train labels 
labels=pd.read_csv('train_labels.csv', sep=',')
labels.shape

(3198, 2)

In [4]:
# Merge the two datasets
df= pd.concat([train,labels],axis=1)
df.shape

(3198, 36)

### 3. Remove duplicate columns and rows (if any by column id)

In [5]:
# Remove duplicates (id column)
#Remove duplicate columns
_, i = np.unique(df.columns, return_index=True)
df=df.iloc[:, i]
df.shape

(3198, 35)

In [6]:
# Make sure I have no more duplicates
#Find out how many duplicates rows I have
print(df.shape)
print (df.row_id.unique().shape)

(3198, 35)
(3198,)


In [7]:
# Remove duplicate rows (by id if any)
#Drop duplicates (rows)
df.drop_duplicates(subset='row_id', keep='first', inplace=True)
print(df.shape)
print(df.row_id.unique().shape)

(3198, 35)
(3198,)


In [8]:
#view data
df.head(5)

Unnamed: 0,area__rucc,area__urban_influence,demo__birth_rate_per_1k,demo__death_rate_per_1k,demo__pct_adults_bachelors_or_higher,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_aged_65_years_and_older,demo__pct_american_indian_or_alaskan_native,...,health__pct_adult_smoking,health__pct_diabetes,health__pct_excessive_drinking,health__pct_low_birthweight,health__pct_physical_inacticity,health__pop_per_dentist,health__pop_per_primary_care_physician,heart_disease_mortality_per_100k,row_id,yr
0,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,12.0,12.0,0.154382,0.194223,0.424303,0.227092,0.176,0.004,...,0.23,0.131,,0.089,0.332,1650.0,1489.0,312,0,a
1,Metro - Counties in metro areas of fewer than ...,Small-in a metro area with fewer than 1 millio...,19.0,7.0,0.259372,0.164134,0.234043,0.342452,0.101,0.008,...,0.19,0.09,0.181,0.082,0.265,2010.0,2480.0,257,1,a
2,Metro - Counties in metro areas of 1 million p...,Large-in a metro area with at least 1 million ...,12.0,6.0,0.417245,0.158573,0.237859,0.186323,0.115,0.013,...,0.156,0.084,0.195,0.098,0.209,629.0,690.0,195,4,b
3,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore adjacent to a small metro with town of...,11.0,12.0,0.162675,0.181637,0.407186,0.248503,0.164,0.007,...,,0.104,,0.058,0.238,1810.0,6630.0,218,5,b
4,"Nonmetro - Urban population of 2,500 to 19,999...",Noncore not adjacent to a metro/micro area and...,14.0,12.0,0.157472,0.122367,0.41324,0.306921,0.171,0.003,...,0.234,0.137,0.194,0.07,0.29,3489.0,2590.0,355,6,a


### 3. Data Types and Nulls

In [9]:
#view data types and see if there's anything wrong
df.dtypes

area__rucc                                           object
area__urban_influence                                object
demo__birth_rate_per_1k                             float64
demo__death_rate_per_1k                             float64
demo__pct_adults_bachelors_or_higher                float64
demo__pct_adults_less_than_a_high_school_diploma    float64
demo__pct_adults_with_high_school_diploma           float64
demo__pct_adults_with_some_college                  float64
demo__pct_aged_65_years_and_older                   float64
demo__pct_american_indian_or_alaskan_native         float64
demo__pct_asian                                     float64
demo__pct_below_18_years_of_age                     float64
demo__pct_female                                    float64
demo__pct_hispanic                                  float64
demo__pct_non_hispanic_african_american             float64
demo__pct_non_hispanic_white                        float64
econ__economic_typology                 

All datatypes seem to be correct

In [10]:
#see how many nulls we have on data
df.isna().sum()

area__rucc                                             0
area__urban_influence                                  0
demo__birth_rate_per_1k                                0
demo__death_rate_per_1k                                0
demo__pct_adults_bachelors_or_higher                   0
demo__pct_adults_less_than_a_high_school_diploma       0
demo__pct_adults_with_high_school_diploma              0
demo__pct_adults_with_some_college                     0
demo__pct_aged_65_years_and_older                      2
demo__pct_american_indian_or_alaskan_native            2
demo__pct_asian                                        2
demo__pct_below_18_years_of_age                        2
demo__pct_female                                       2
demo__pct_hispanic                                     2
demo__pct_non_hispanic_african_american                2
demo__pct_non_hispanic_white                           2
econ__economic_typology                                0
econ__pct_civilian_labor       

I will drop those columns where I simply have too many blanks, as it would be to much inference if trying to replace those values by the mode or any other one.
For other columns with blanks, I shall replace the blanks with the mode. 

### 4. Feature Engineering

In [11]:
#drop columns with too many blanks
df = df.drop(['health__homicides_per_100k'],axis=1)


In [12]:
## to handle the problem of too many missing values for the columns I have, I have replaced the missing values with the mode
for column in df:
    df[column].fillna(df[column].mode()[0], inplace=True)

In [13]:
#Verify we have no nulls
df.isna().sum()

area__rucc                                          0
area__urban_influence                               0
demo__birth_rate_per_1k                             0
demo__death_rate_per_1k                             0
demo__pct_adults_bachelors_or_higher                0
demo__pct_adults_less_than_a_high_school_diploma    0
demo__pct_adults_with_high_school_diploma           0
demo__pct_adults_with_some_college                  0
demo__pct_aged_65_years_and_older                   0
demo__pct_american_indian_or_alaskan_native         0
demo__pct_asian                                     0
demo__pct_below_18_years_of_age                     0
demo__pct_female                                    0
demo__pct_hispanic                                  0
demo__pct_non_hispanic_african_american             0
demo__pct_non_hispanic_white                        0
econ__economic_typology                             0
econ__pct_civilian_labor                            0
econ__pct_unemployment      

In [14]:
#export data set to main directory to be used in a different jupyter notebook
df.to_csv("dfvisualizations.csv", index=False)

#### 4.1 Create categorical columns

#### 4.1.1 Year

In [15]:
# Get dummy varibles for Warranty
df = pd.concat([df,pd.get_dummies(df['yr'], prefix='Year:')],axis=1)


In [16]:
# drop year column 
df = df.drop(['yr'],axis=1)

In [17]:
# Verify year column is no longer in dataset
df.columns

Index(['area__rucc', 'area__urban_influence', 'demo__birth_rate_per_1k',
       'demo__death_rate_per_1k', 'demo__pct_adults_bachelors_or_higher',
       'demo__pct_adults_less_than_a_high_school_diploma',
       'demo__pct_adults_with_high_school_diploma',
       'demo__pct_adults_with_some_college',
       'demo__pct_aged_65_years_and_older',
       'demo__pct_american_indian_or_alaskan_native', 'demo__pct_asian',
       'demo__pct_below_18_years_of_age', 'demo__pct_female',
       'demo__pct_hispanic', 'demo__pct_non_hispanic_african_american',
       'demo__pct_non_hispanic_white', 'econ__economic_typology',
       'econ__pct_civilian_labor', 'econ__pct_unemployment',
       'econ__pct_uninsured_adults', 'econ__pct_uninsured_children',
       'health__air_pollution_particulate_matter',
       'health__motor_vehicle_crash_deaths_per_100k',
       'health__pct_adult_obesity', 'health__pct_adult_smoking',
       'health__pct_diabetes', 'health__pct_excessive_drinking',
       'health_

#### 4.1.2 Area of urban influence

In [18]:
#View Categories
df['area__urban_influence'].value_counts()

Small-in a metro area with fewer than 1 million residents                                             692
Large-in a metro area with at least 1 million residents or more                                       436
Noncore adjacent to a small metro with town of at least 2,500 residents                               346
Micropolitan adjacent to a small metro area                                                           262
Micropolitan not adjacent to a metro area                                                             254
Noncore not adjacent to a metro/micro area and does not contain a town of at least 2,500 residents    210
Noncore adjacent to micro area and does not contain a town of at least 2,500 residents                210
Noncore adjacent to micro area and contains a town of 2,500-19,999 residents                          206
Noncore adjacent to a small metro and does not contain a town of at least 2,500 residents             176
Noncore adjacent to a large metro area        

In [19]:
#Aggregate categories
area__urban_influence_cat = {'Small-in a metro area with fewer than 1 million residents':'small', 
                        'Large-in a metro area with at least 1 million residents or more':'Large',
                        'Micropolitan adjacent to a small metro area':'Micro',     
                        'Noncore adjacent to a small metro with town of at least 2,500 residents':'NonCore',
                        'Micropolitan not adjacent to a metro area':'Micro',                     
                        'Micropolitan adjacent to a large metro area':'Micro',
                        'Noncore adjacent to micro area and contains a town of 2,500-19,999 residents':'NonCore',                               
                        'Noncore not adjacent to a metro/micro area and contains a town of 2,500  or more residents':'NonCore',
                        'Noncore adjacent to a large metro area':'Noncore',
                        'Noncore adjacent to a small metro and does not contain a town of at least 2,500 residents':'NonCore',
                        'Noncore adjacent to micro area and does not contain a town of at least 2,500 residents':'NonCore',
                        'Noncore not adjacent to a metro/micro area and does not contain a town of at least 2,500 residents':'NonCore'}

df['area__urban_influence']=[area__urban_influence_cat[x] for x in df['area__urban_influence']]
df['area__urban_influence'].value_counts()

NonCore    1270
small       692
Micro       642
Large       436
Noncore     158
Name: area__urban_influence, dtype: int64

In [20]:
#Turn Area of Urban influence into dummies
df = pd.concat([df,pd.get_dummies(df['area__urban_influence'], prefix='Urban_influence:')],axis=1)

In [21]:
# Verify 'area_urban_influence' column is no longer in dataset
df.columns

Index(['area__rucc', 'area__urban_influence', 'demo__birth_rate_per_1k',
       'demo__death_rate_per_1k', 'demo__pct_adults_bachelors_or_higher',
       'demo__pct_adults_less_than_a_high_school_diploma',
       'demo__pct_adults_with_high_school_diploma',
       'demo__pct_adults_with_some_college',
       'demo__pct_aged_65_years_and_older',
       'demo__pct_american_indian_or_alaskan_native', 'demo__pct_asian',
       'demo__pct_below_18_years_of_age', 'demo__pct_female',
       'demo__pct_hispanic', 'demo__pct_non_hispanic_african_american',
       'demo__pct_non_hispanic_white', 'econ__economic_typology',
       'econ__pct_civilian_labor', 'econ__pct_unemployment',
       'econ__pct_uninsured_adults', 'econ__pct_uninsured_children',
       'health__air_pollution_particulate_matter',
       'health__motor_vehicle_crash_deaths_per_100k',
       'health__pct_adult_obesity', 'health__pct_adult_smoking',
       'health__pct_diabetes', 'health__pct_excessive_drinking',
       'health_

#### 4.1.3 Economic_typology

In [22]:
#Economic typology
df['econ__economic_typology'].value_counts()


Nonspecialized                        1266
Manufacturing-dependent                494
Farm-dependent                         482
Federal/State government-dependent     390
Recreation                             312
Mining-dependent                       254
Name: econ__economic_typology, dtype: int64

In [23]:
# I keep the values as they are, I simply turn them into categorical
econ__economic_typology_cat = {'Nonspecialized':'Nonspecialized', 
                             'Federal/State government-dependent':'Federal/State government-dependent',
                             'Manufacturing-dependent':'Manufacturing-dependent',
                             'Recreation':'Recreation',                     
                             'Mining-dependent':'Mining_farming',
                             'Farm-dependent':'Mining_farming',                               
                            }
df['econ__economic_typology']=[econ__economic_typology_cat[x] for x in df['econ__economic_typology']]
df['econ__economic_typology'].value_counts()

Nonspecialized                        1266
Mining_farming                         736
Manufacturing-dependent                494
Federal/State government-dependent     390
Recreation                             312
Name: econ__economic_typology, dtype: int64

In [24]:
# Now create dummies and drop original column
df = pd.concat([df,pd.get_dummies(df['econ__economic_typology'], prefix='Economic_typo:')],axis=1)

In [25]:
#Make sure 'econ__economic_typology' is no longer in dataset
df.columns

Index(['area__rucc', 'area__urban_influence', 'demo__birth_rate_per_1k',
       'demo__death_rate_per_1k', 'demo__pct_adults_bachelors_or_higher',
       'demo__pct_adults_less_than_a_high_school_diploma',
       'demo__pct_adults_with_high_school_diploma',
       'demo__pct_adults_with_some_college',
       'demo__pct_aged_65_years_and_older',
       'demo__pct_american_indian_or_alaskan_native', 'demo__pct_asian',
       'demo__pct_below_18_years_of_age', 'demo__pct_female',
       'demo__pct_hispanic', 'demo__pct_non_hispanic_african_american',
       'demo__pct_non_hispanic_white', 'econ__economic_typology',
       'econ__pct_civilian_labor', 'econ__pct_unemployment',
       'econ__pct_uninsured_adults', 'econ__pct_uninsured_children',
       'health__air_pollution_particulate_matter',
       'health__motor_vehicle_crash_deaths_per_100k',
       'health__pct_adult_obesity', 'health__pct_adult_smoking',
       'health__pct_diabetes', 'health__pct_excessive_drinking',
       'health_

#### 4.1.4 Area of rural influence

In [26]:
df['area__rucc'].value_counts()

Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area                         608
Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area    484
Metro - Counties in metro areas of 1 million population or more                                  436
Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area                     418
Metro - Counties in metro areas of 250,000 to 1 million population                               370
Metro - Counties in metro areas of fewer than 250,000 population                                 322
Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area        238
Nonmetro - Urban population of 20,000 or more, adjacent to a metro area                          222
Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area                      100
Name: area__rucc, dtype: int64

In [27]:
# Agregate categories
area__rucc_cat = {'Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area':'NonMetro', 
                'Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area':'NonMetro',
                'Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area':'NonMetro',
                'Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area':'NonMetro',                     
                'Nonmetro - Urban population of 20,000 or more, adjacent to a metro area':'NonMetro',
                'Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area':'NonMetro',                               
                'Metro - Counties in metro areas of 1 million population or more':'Metro',
                'Metro - Counties in metro areas of 250,000 to 1 million population':'Metro',
                'Metro - Counties in metro areas of fewer than 250,000 population':'Metro'}
df['area__rucc']=[area__rucc_cat[x] for x in df['area__rucc']]
df['area__rucc'].value_counts()

NonMetro    2070
Metro       1128
Name: area__rucc, dtype: int64

In [28]:
#turn it into dummies and drop original column
df = pd.concat([df,pd.get_dummies(df['area__rucc'], prefix='Area_rucc:')],axis=1)

#### 4.2 Check column 'demo__pct_aged_65_years_and_older'

In [29]:
df['demo__pct_aged_65_years_and_older'].value_counts()


0.158    47
0.176    41
0.139    40
0.171    38
0.163    38
0.160    37
0.168    36
0.164    36
0.178    35
0.172    35
0.146    35
0.174    34
0.156    34
0.170    34
0.167    33
0.186    33
0.153    33
0.150    33
0.144    32
0.173    32
0.187    32
0.152    32
0.169    31
0.161    31
0.155    31
0.179    30
0.193    30
0.140    30
0.180    30
0.190    30
         ..
0.321     1
0.328     1
0.077     1
0.066     1
0.284     1
0.315     1
0.336     1
0.048     1
0.259     1
0.339     1
0.065     1
0.346     1
0.069     1
0.307     1
0.072     1
0.045     1
0.311     1
0.057     1
0.047     1
0.056     1
0.271     1
0.312     1
0.330     1
0.305     1
0.334     1
0.087     1
0.067     1
0.287     1
0.071     1
0.283     1
Name: demo__pct_aged_65_years_and_older, Length: 249, dtype: int64

In [30]:
#I shall group them with above and below 65
def demo__pct_aged_65_years_and_older_xform(al):
    if al > 0.2: return 'old'
    else: return 'young'

df["young_old_pop"] = df['demo__pct_aged_65_years_and_older'].map(demo__pct_aged_65_years_and_older_xform)


df['young_old_pop'].value_counts()

young    2535
old       663
Name: young_old_pop, dtype: int64

In [31]:
df = pd.concat([df,pd.get_dummies(df['young_old_pop'], prefix='Age_Group:')],axis=1)

In [32]:
# View columns
df.dtypes

area__rucc                                            object
area__urban_influence                                 object
demo__birth_rate_per_1k                              float64
demo__death_rate_per_1k                              float64
demo__pct_adults_bachelors_or_higher                 float64
demo__pct_adults_less_than_a_high_school_diploma     float64
demo__pct_adults_with_high_school_diploma            float64
demo__pct_adults_with_some_college                   float64
demo__pct_aged_65_years_and_older                    float64
demo__pct_american_indian_or_alaskan_native          float64
demo__pct_asian                                      float64
demo__pct_below_18_years_of_age                      float64
demo__pct_female                                     float64
demo__pct_hispanic                                   float64
demo__pct_non_hispanic_african_american              float64
demo__pct_non_hispanic_white                         float64
econ__economic_typology 

In [33]:
# Drop useless rows to avoid colinearity (those I used to create new ones)
df = df.drop(['row_id'],axis=1)
df = df.drop(['young_old_pop'],axis=1)
df = df.drop(['demo__pct_aged_65_years_and_older'],axis=1)
df = df.drop(['econ__economic_typology'],axis=1)
df = df.drop(['area__rucc'],axis=1)
df = df.drop(['area__urban_influence'],axis=1)


In [34]:
df.dtypes

demo__birth_rate_per_1k                              float64
demo__death_rate_per_1k                              float64
demo__pct_adults_bachelors_or_higher                 float64
demo__pct_adults_less_than_a_high_school_diploma     float64
demo__pct_adults_with_high_school_diploma            float64
demo__pct_adults_with_some_college                   float64
demo__pct_american_indian_or_alaskan_native          float64
demo__pct_asian                                      float64
demo__pct_below_18_years_of_age                      float64
demo__pct_female                                     float64
demo__pct_hispanic                                   float64
demo__pct_non_hispanic_african_american              float64
demo__pct_non_hispanic_white                         float64
econ__pct_civilian_labor                             float64
econ__pct_unemployment                               float64
econ__pct_uninsured_adults                           float64
econ__pct_uninsured_chil

In [35]:
#Move LABEL column (attractiveness) to the end of the dataset
cols = [col for col in df if col != 'heart_disease_mortality_per_100k']+['heart_disease_mortality_per_100k']
df = df[cols]
df.head()

Unnamed: 0,demo__birth_rate_per_1k,demo__death_rate_per_1k,demo__pct_adults_bachelors_or_higher,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_below_18_years_of_age,demo__pct_female,...,Economic_typo:_Federal/State government-dependent,Economic_typo:_Manufacturing-dependent,Economic_typo:_Mining_farming,Economic_typo:_Nonspecialized,Economic_typo:_Recreation,Area_rucc:_Metro,Area_rucc:_NonMetro,Age_Group:_old,Age_Group:_young,heart_disease_mortality_per_100k
0,12.0,12.0,0.154382,0.194223,0.424303,0.227092,0.004,0.011,0.235,0.516,...,0,1,0,0,0,1,0,0,1,312
1,19.0,7.0,0.259372,0.164134,0.234043,0.342452,0.008,0.015,0.272,0.503,...,0,0,1,0,0,1,0,0,1,257
2,12.0,6.0,0.417245,0.158573,0.237859,0.186323,0.013,0.085,0.179,0.522,...,0,0,0,1,0,1,0,0,1,195
3,11.0,12.0,0.162675,0.181637,0.407186,0.248503,0.007,0.001,0.2,0.525,...,0,0,0,1,0,0,1,0,1,218
4,14.0,12.0,0.157472,0.122367,0.41324,0.306921,0.003,0.0,0.237,0.511,...,0,0,0,1,0,0,1,0,1,355


Now dataset is ready for visualizations and data wrangling

### 5. Scale data

In [36]:
# finally, to be able to use this data for modeling, we need to scale it (all except the label)
df.columns

Index(['demo__birth_rate_per_1k', 'demo__death_rate_per_1k',
       'demo__pct_adults_bachelors_or_higher',
       'demo__pct_adults_less_than_a_high_school_diploma',
       'demo__pct_adults_with_high_school_diploma',
       'demo__pct_adults_with_some_college',
       'demo__pct_american_indian_or_alaskan_native', 'demo__pct_asian',
       'demo__pct_below_18_years_of_age', 'demo__pct_female',
       'demo__pct_hispanic', 'demo__pct_non_hispanic_african_american',
       'demo__pct_non_hispanic_white', 'econ__pct_civilian_labor',
       'econ__pct_unemployment', 'econ__pct_uninsured_adults',
       'econ__pct_uninsured_children',
       'health__air_pollution_particulate_matter',
       'health__motor_vehicle_crash_deaths_per_100k',
       'health__pct_adult_obesity', 'health__pct_adult_smoking',
       'health__pct_diabetes', 'health__pct_excessive_drinking',
       'health__pct_low_birthweight', 'health__pct_physical_inacticity',
       'health__pop_per_dentist', 'health__pop_per_p

In [37]:
quant_features = ['demo__birth_rate_per_1k',
       'demo__death_rate_per_1k', 'demo__pct_adults_bachelors_or_higher',
       'demo__pct_adults_less_than_a_high_school_diploma',
       'demo__pct_adults_with_high_school_diploma',
       'demo__pct_adults_with_some_college',
       'demo__pct_american_indian_or_alaskan_native', 'demo__pct_asian',
       'demo__pct_below_18_years_of_age', 'demo__pct_female',
       'demo__pct_hispanic', 'demo__pct_non_hispanic_african_american',
       'demo__pct_non_hispanic_white',
       'econ__pct_civilian_labor', 'econ__pct_unemployment',
       'econ__pct_uninsured_adults', 'econ__pct_uninsured_children',
       'health__air_pollution_particulate_matter',
       'health__motor_vehicle_crash_deaths_per_100k',
       'health__pct_adult_obesity', 'health__pct_adult_smoking',
       'health__pct_diabetes', 'health__pct_excessive_drinking',
       'health__pct_low_birthweight', 'health__pct_physical_inacticity',
       'health__pop_per_dentist', 'health__pop_per_primary_care_physician',
       'Year:_a', 'Year:_b', 'Urban_influence:_Large',
       'Urban_influence:_Micro', 'Urban_influence:_NonCore',
       'Urban_influence:_Noncore', 'Urban_influence:_small',
       'Economic_typo:_Federal/State government-dependent',
       'Economic_typo:_Manufacturing-dependent',
       'Economic_typo:_Mining_farming', 'Economic_typo:_Nonspecialized',
       'Economic_typo:_Recreation', 'Area_rucc:_Metro', 'Area_rucc:_NonMetro',
       'Age_Group:_old', 'Age_Group:_young']

# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for each in quant_features:
    mean, std = df[each].mean(), df[each].std()
    scaled_features[each] = [mean, std]
    df.loc[:, each] = (df[each] - mean)/std

In [38]:
### 6. Save model for predictions
df.to_csv("dfformodeling.csv", index=False)
