## Mod 1 Project Submission

**John Kline**  
**FT Data Science Online**  
James / Rafael  
_Scheduled review - Monday, Oct. 28 11AM PST_  
Blog Post URL TBD  

***

Exploratory Data Analysis Approach Summary:
1. **Import & Review Data File(s)**
    - Check columns with .describe, .info  
    - Review column data types  
    - Identify null values  
<br>
2. **Review Data Content & Plan Analysis**  
    - Check histograms - continuous, binary, categorical data  
    - Evaluate correlations between different dependent variables  
    - Review data for outliers  
    - Identify dependent variable, potential independent variables  
<br>
3. **Narrow Down Variables, Transform Variables**  
    - Kill off obvious unusable variables  
    - Evaluate normality of target variables  
    - Decide on transformations / normalizations / standardizations  
<br>
4. **Run Core Analyses**  
    - Create core multivariate analysis  
    - Create grouped multivariate analyses  
    - Evaluate statistical validity (p-values, skewness, kurtosis)  
    - Evaluate applicability (size of the effect, direction, does it make sense  
<br>
5. **Run Checks & Create Visuals**  
    - Check alternate models, check alternate explanations of effects
    - Create easy to consume graphs / maps of data
    

<div class="alert alert-block alert-info">

## 1. Import & Review Data File(s)
    - Check columns with .describe, .info  
    - Review column data types  
    - Identify null values  

<div class="alert alert-block alert-info">

We have several important features of homes included in the dataset, as well as one obvious dependent variable - price.  The most obvious line of analysis is to investigate the relationship between various home features (e.g. bedrooms, living space sqft, etc.) and the house price. The data is collected over a relatively narrow time period - 2014-2015, so limited macro price growth in the market should be visible in the data.  One of the more interesting areas of dispersion is likely the effect of geography on price.

|**Category**| Description | _Initial commentary_|
|---:|:---|:---|
|**id** | unique identified for a house |_definite keeper - unique id for joining tables_|
|**dateDate** | house was sold  |_narrow date range - may be hard to establish a pattern_|
| **pricePrice** |  is prediction target |_obvious dependent variable_|
| **bedroomsNumber** |  of Bedrooms/House |_important - classic one-line description of house size_|
| **bathroomsNumber** |  of bathrooms/bedrooms| _probably just tracks w/ bedrooms_|
| **sqft_livings** |  footage of the home| _important - you can have 2 4-br homes of wildly diff size_|
| **sqft_lot** |  footage of the lot |_important probably_|
| **floorsTotal** |  floors levels) in house | _unclear_|
| **waterfront** | House which has a view to a waterfront |_probably a positive predictor_|
| **view** | Has been viewed  | _unclear_|
| **condition** | How good the condition is ( Overall )| _hopefully a good indicator of quality apart from size_|
| **grade** | overall grade given to the housing unit, based on King County grading system | _unclear_|
| **sqft_above** | square footage of house apart from basement |_unclear value vs livingsquare_|
| **sqft_basement** | square footage of the basement |_unclear_|
| **yr_built** | Built Year |_unclear_|
| **yr_renovated** | Year when house was renovated |_probably a positive predictor_|
| **zipcode** | zip |_useful categorical_|
| **lat** | Latitude coordinate |_binned? maybe_|
| **long** | Longitude coordinate |_binned? maybe_|
| **sqft_living15** | The square footage of interior housing living space for the nearest 15 neighbors| _unclear_|
| **sqft_lot15** | The square footage of the land lots of the nearest 15 neighbors| _unclear_|

In [9]:
# Importing key libraries
import pandas as pd
import matplotlib as plt
import pylab as pl
%matplotlib notebook
import seaborn as sns
import numpy as np
from sklearn import preprocessing 
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Importing data file
df = pd.read_csv('kc_house_data.csv')
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,19221.0,21534.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,0.007596,0.233863,3.409825,7.657915,1788.596842,1970.999676,83.636778,98077.951845,47.560093,-122.213982,1986.620318,12758.283512
std,2876736000.0,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,0.086825,0.765686,0.650546,1.1732,827.759761,29.375234,399.946414,53.513072,0.138552,0.140724,685.230472,27274.44195
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,0.0,0.0,1.0,3.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,1951.0,0.0,98033.0,47.4711,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,0.0,0.0,4.0,8.0,2210.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


**Cleaning data: working list of tasks**:
- (done) waterfront has 2000 nan values
- (done) yr_ renovated has 4000 nan values 
- (done) view has 200 nan values
- (done) sqft basement is some sort of object

In [10]:
# Cleaning Section

# view - NaNs convert to 0, the majority have not been viewed (>75%) - changing to binary - viewed or not
df['view']=df['view'].fillna(0)
df['was_viewed'] = df['view'] > 0
df['was_viewed'] = df['was_viewed'].astype(int)


# waterfront - we don't know if its a waterfront property and the vast majority are not (<1%)
df['waterfront']=df['waterfront'].fillna(0)

# sqft_basement- taking out NaNs - the ? values tended to be low-priced houses, similar to the 0 basement value houses
# !!! Can only run once - can't run twice without reimporting the csv file
df['sqft_basement'] = pd.to_numeric(df['sqft_basement'].replace({'?': 0}))

# Rennovation status - taking out NaNs (converts to 0) and changing the column to 'recently_renovated', a binary
df['yr_renovated']=df['yr_renovated'].fillna(0)
df['recently_renovated'] = df['yr_renovated'] >= 1980
df['recently_renovated'] = df['recently_renovated'].astype(int)

# yrs_old will be a newly constructed variable
df['yrs_old']= 2019 - df['yr_built']

# Joneses_living - test variable construction
# Sqft_living / lot are highly, highly correlated with the actual house - so seeing if being at a premium to neighbors is predictive (spoiler its not)
df['joneses_living'] = np.log(df['sqft_living'])-np.log(df['sqft_living15'])
df['joneses_lot'] = np.log(df['sqft_lot'])-np.log(df['sqft_lot15'])

#Dropping a single extreme 33 bedroom observation (~30 sd's away from the mean)
df = df[df.bedrooms != 33]

<div class="alert alert-block alert-info">

## 2. Review Data Content & Plan Analysis  
    - Check histograms - continuous, binary, categorical data  
    - Evaluate correlations between different dependent variables  
    - Review data for outliers  
    - Identify dependent variable, potential independent variables  

<div class="alert alert-block alert-info">

In [11]:
#Dropped all variables we aren't using (e.g. lat/long) or that are highly correlated (e.g. sqft_above, bathrooms, grade)
df_deps = df.drop(['id','price','lat','long','yr_renovated', 'yr_built','sqft_lot15','sqft_living15','sqft_above','bathrooms','grade','joneses_lot','view'], axis = 1)

#Check for correlation among candidate dependent variables
corr = df_deps.corr()
pl.figure(figsize=(10,10))
sns.heatmap(data=np.abs(corr), cmap=sns.color_palette('Blues'), annot=True, linewidths = .5)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1dbd9d53a90>

**All correlations between candidate indepentent variables are <0.6 in magnitude**

In [12]:
# Check for approximate normality of variables
df_deps.hist(figsize=(10,10))

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBD9C6B828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDA4442E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDA472860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDA4A6E10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAEA4400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAED59B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAF09F60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAF42588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAF425C0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAFB50F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDAFE56A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDB015C50>]],
      dtype=object)

**Several variables look skewed (e.g. sqft values and price), so we can try a log transformation in the next cleanup section**

<div class="alert alert-block alert-info">

## 3. Narrow Down Variables, Transform Variables 
    - Kill off obvious unusable variables  
    - Evaluate normality of target variables  
    - Decide on transformations / normalizations / standardizations  

<div class="alert alert-block alert-info">

|**Category**| Description | _Detail & Transformations_|
|---:|:---:|:---|
|**id** | dropped | |
|**date** | dropped  | |
| **price** |  **kept** |_dependent variable - log transformed_|
| **bedrooms** |  **kept** |  |
| **bathrooms** |  droped |   |
| **sqft_living** |  **kept**| _log transformed_ |
| **sqft_lot** |  **kept** | _log transformed_ |
| **floors** |  **kept** |  |
| **waterfront** | **kept** | _binary_ |
| **view** | dropped  |  |
| **was_viewed** | **constructed** | _binary was viewed between 1 and 4 times_ |
| **condition** | **kept**| |
| **grade** | dropped | |
| **sqft_above** | dropped | |
| **sqft_basement** | **kept** | |
| **yr_built** | dropped | |
| **yrs_old** | dropped | originally included, but condition number of regresion was 1.66e+03, indicating potential collinearity |
| **yr_renovated** | dropped | |
| **recently_renovated** | **constructed** |_binary - if rennovated after 1980, 1, otherwise, 0_|
| **zipcode** | **kept** |_categorical_|
| **lat** | dropped | |
| **long** | dropped | |
| **sqft_living15** | dropped | |
| **sqft_lot15** | dropped|  |
| **joneses_living** | **constructed** | _difference between log(sqft_living) and log(sqft_living15), or living space relative to closest 15 neighbors_ |


**Goal**: create a new DataFrame with just the variables we will use in the regressions, with zipcode as a column for future grouping

In [13]:
# Pulling in variables and transformations

#Importing several data fields as-is (converting zipcode to string from numeric)
df_reg = df[['recently_renovated','waterfront','zipcode','was_viewed','bedrooms','condition','floors']].copy()
df_reg['zipcode'] = df_reg['zipcode'].astype('str')
# 'yrs_old',

# sqft_living has a long tail for extreme high values - using log to normalize
df_reg['sqft_living'] = np.log(df['sqft_living'])

# sqft_lot has a long tail for extreme high values - using log to normalize
df_reg['sqft_lot'] = np.log(df['sqft_lot'])

# joneses_living pulling in as-is, it is already log-transformed
df_reg['joneses_living'] = df['joneses_living']

# Originally, price was the dependent variable, but using the log of price as the outcome fixed skewness / kurtosis / heteroskedasticity issues
df_reg['price'] = np.log(df['price'])

df_reg.hist()
#np.exp(df_reg['price']).describe()


<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDB1B3518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC2DCF98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC3144A8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC3479E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC37AF98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC3B6588>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC3EAB38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC427160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC427198>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC48BC88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC4C8278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DBDC4FA828>]],
      dtype=object)

<div class="alert alert-block alert-info">

## 4. Run Core Analyses
    - Create core multivariate analysis  
    - Create grouped multivariate analyses  
    - Evaluate statistical validity (p-values, skewness, kurtosis)  
    - Evaluate applicability (size of the effect, direction, does it make sense     

<div class="alert alert-block alert-info">

In [14]:
# Creating a formula to run a multivariate regression of all values in a regression, ignoring a single group column (zipcode)
def regress_grouped(data, outcome_val, group_col):
    outcome = outcome_val
    predictors = data.drop([outcome,group_col], axis=1)
    pred_sum = "+".join(predictors.columns)
    formula = outcome + "~" + pred_sum
    model = ols(formula= formula, data=data).fit()
    return model

regress_all = regress_grouped(df_reg, 'price', 'zipcode')
regress_all.summary()



0,1,2,3
Dep. Variable:,price,R-squared:,0.547
Model:,OLS,Adj. R-squared:,0.547
Method:,Least Squares,F-statistic:,2899.0
Date:,"Sun, 27 Oct 2019",Prob (F-statistic):,0.0
Time:,19:35:07,Log-Likelihood:,-8235.6
No. Observations:,21596,AIC:,16490.0
Df Residuals:,21586,BIC:,16570.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.4500,0.065,83.834,0.000,5.323,5.577
recently_renovated,0.2464,0.014,17.226,0.000,0.218,0.274
waterfront,0.5020,0.031,16.457,0.000,0.442,0.562
was_viewed,0.2258,0.009,25.775,0.000,0.209,0.243
bedrooms,-0.0558,0.004,-15.850,0.000,-0.063,-0.049
condition,0.0910,0.004,23.430,0.000,0.083,0.099
floors,0.0585,0.005,10.699,0.000,0.048,0.069
sqft_living,1.0600,0.010,103.369,0.000,1.040,1.080
sqft_lot,-0.0716,0.003,-22.571,0.000,-0.078,-0.065

0,1,2,3
Omnibus:,38.856,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.197
Skew:,0.061,Prob(JB):,2.28e-08
Kurtosis:,2.844,Cond. No.,351.0


### OLS Regression Results Discussion:  
- The variables are all exhibit very low p-values (<0.001) and the overall adjusted r-squared is high (54.7%), indicating that the dependent variables are explaining a high proportion of the variance and are all individually correlated
- Skew is <2 and Kurtosis is <3, indicating limited unusual residual patterning
- The value and direction of the intercepts largely follows intuition
    - **sqft_living** has the highest positive coefficient of 1.06, which follows as house size is generally a key factor in house price
    - **waterfront** surprisingly has a high coefficient of 0.5, but only a very small number of houses (<1%) have this, so it likely is a significant but not highly explanatory factor
    - **recently_renovated** and was_viewed are both high positive factors of 0.25 and 0.23 respectively - these make sense as more modern and more in-demand houses likely drive higher prices
    - **floors** and **condition** both had low positive coefficients, so are of relatively lower importance
- Several coefficients, however, cut against intuition
    - **joneses_living** has a significant negative coefficient of -0.39, compared to a positive sqft_living value of 1.06.  My interpretation here is that for a given house size, e.g. 2,000 feet, if all of your neighbors' houses are half the size, then you live in a neighborhood of small houses.  This property of your neighborhood may correlate with income, which may correlate with desireability of the neighborhood, or with schools, or with neighborhood prestige, any of which might negatively impact price.  This, vs. a house 2,000 sqft in a neighborhood of all 5,000 sqft houses, which might represent a small house in a nice, wealthy, prestigious neighborhood which has a halo effect on the price of your house. We can evaluate this by looking within zip codes further on to see if this effect on a smaller scale. I anticipate that given small enough geographic bins, this effect would reverse or disappear.
    - **sqft_lot** also has a slightly negative correlation with housing price (-0.07) - this I attribute to houses with large lots being in more distant and suburban / rural areas further from main centers of employment.  However, we can check this by analyzing houses within specific zip codes further on
    - **bedrooms** similarly is small but negative (-0.06), but with a small coefficient. My interpretation is that for a given house size, more bedrooms makes it a more cramped space, i.e. if shoppers are looking for a certain number of bedrooms, the ratio of bedrooms to space is important, and a low ratio is better.  **see below for an evaluation**


In [15]:
#testing bedroom hypothesis
df_reg_bedroom = df_reg.copy()
df_reg_bedroom['bedroom_ratio'] = df_reg_bedroom['sqft_living']/df_reg_bedroom['bedrooms']
df_reg_bedroom.drop('bedrooms', axis = 1, inplace = True)

regress_bedratio = regress_grouped(df_reg_bedroom, 'price', 'zipcode')
#regress_bedratio.summary()

Evaluating my hypothesis that a high ratio of square footage of living space to bedrooms is desireable, the newly constructed 'bedroom_ratio' value has a positive correlation with price in the otherwise same regression

In [16]:
df_reg_zipdummy = df_reg.copy()
df_reg_zipdummy['zipcode'] = 'zip'+df_reg_zipdummy['zipcode']
zipdummies = pd.get_dummies(df_reg_zipdummy.zipcode, drop_first = True)
for column in zipdummies:
    df_reg_zipdummy[column]=zipdummies[column]

regress_zipdummies = regress_grouped(df_reg_zipdummy, 'price', 'zipcode')
regress_zipdummies.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.862
Model:,OLS,Adj. R-squared:,0.861
Method:,Least Squares,F-statistic:,1722.0
Date:,"Sun, 27 Oct 2019",Prob (F-statistic):,0.0
Time:,19:35:08,Log-Likelihood:,4588.0
No. Observations:,21596,AIC:,-9018.0
Df Residuals:,21517,BIC:,-8388.0
Df Model:,78,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.7559,0.044,129.715,0.000,5.669,5.843
recently_renovated,0.0859,0.008,10.748,0.000,0.070,0.102
waterfront,0.5871,0.017,34.307,0.000,0.554,0.621
was_viewed,0.1532,0.005,29.949,0.000,0.143,0.163
bedrooms,-0.0273,0.002,-13.793,0.000,-0.031,-0.023
condition,0.0396,0.002,17.850,0.000,0.035,0.044
floors,0.0655,0.003,20.624,0.000,0.059,0.072
sqft_living,0.8002,0.006,124.731,0.000,0.788,0.813
sqft_lot,0.0651,0.002,30.685,0.000,0.061,0.069

0,1,2,3
Omnibus:,1203.046,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4527.45
Skew:,-0.149,Prob(JB):,0.0
Kurtosis:,5.223,Cond. No.,843.0


### OLS Regression With Zipcode Dummies Results Discussion:

The core independent variables continue to exhibit very low p-values (<0.001) and the overall adjusted r-squared is significantly higher (86.1%) with the inclusion of zipcode dummy variables than without (54.7%), indicating that the zipcode variables are explaining a high proportion of the variance in housing prices beyond the individual home factors.

Several of the core independent variable coefficients change with the addition of zip code dummy variables, indicating that these variables previously were capturing latent charactaristics of zip codes rather than the actual incremental contribution of the variable to overall housing price:
- Generally, values either slightly attenuated or slightly improved. My interpretation of values that slighly attenuated are that the value (e.g. recently_renovated, which fell from 0.24 to 0.09) was previously an indicator of the uncaptured quality of the neighborhood (i.e. demand factors from buyers with higher general willingness to pay).  
- By contrast, the single value that increased slightly between the two regressions, **waterfront**, is an important indicator of house price that was being slightly obscured without being able to sort out the relative impact of zip code.
- One value flipped direction - **sqft_lot**, from slightly negative at -0.72 to slightly positive at 0.65 and my interpretation of this metric is that previously high sqft_lot values were capturing properties further from city centers, and that distance from shopping / businesses was decreasing value faster that large lot size was adding value.  However, by pulling out the level effect of zip codes from the analysis, we are able to isolate the positive effect of **sqft_lot**.
- Almost all of the zip code coefficients have very low p-values (<0.001), except for a few that exhibited both very low coefficient values and high p-values. Because we dropped the first zipcode from the list, that zipcode coefficient is already accounted for in the intercept value. These zip codes with close-to-zero coeficients and high p-values likely have values statistically indistinguishable from the first zip code that was dropped.  We should see these zip codes change back to non-zero values with low p-values if we change the zipcode we decided to drop.

<div class="alert alert-block alert-info">

## 5. Run Checks & Create Visuals   
    - Check alternate models, check alternate explanations of effects
    - Create easy to consume graphs / maps of data

<div class="alert alert-block alert-info">

In [42]:

zip_params = regress_zipdummies.params[10:]
zip_intercept = regress_zipdummies.params[0]
print(pd.DataFrame(zip_params))


                 0
zip98002  0.003495
zip98003  0.039541
zip98004  1.185802
zip98005  0.781812
zip98006  0.705491
zip98007  0.693628
zip98008  0.683117
zip98010  0.224986
zip98011  0.447850
zip98014  0.282065
zip98019  0.289747
zip98022  0.046536
zip98023 -0.005207
zip98024  0.422620
zip98027  0.528549
zip98028  0.413408
zip98029  0.658448
zip98030  0.054283
zip98031  0.075861
zip98032 -0.002873
zip98033  0.825519
zip98034  0.567524
zip98038  0.159741
zip98039  1.408528
zip98040  0.937948
zip98042  0.071442
zip98045  0.329939
zip98052  0.666946
zip98053  0.609461
zip98055  0.155085
...            ...
zip98092  0.030923
zip98102  1.069892
zip98103  0.890357
zip98105  1.009231
zip98106  0.407676
zip98107  0.919387
zip98108  0.393802
zip98109  1.071312
zip98112  1.129294
zip98115  0.860230
zip98116  0.822283
zip98117  0.875782
zip98118  0.495685
zip98119  1.058357
zip98122  0.890880
zip98125  0.591446
zip98126  0.616058
zip98133  0.498785
zip98136  0.729628
zip98144  0.726127
zip98146  0.

In [None]:
# Folium Map Creation for Zips Function
# Table  = main data frame
# zips = column name containing zips
# mapped feature = column of values for zip heat map
# add test = commentary for the mpa
# https://towardsdatascience.com/visualizing-data-at-the-zip-code-level-with-folium-d07ac983db20

def create_zip_heatmap(table, zips, mapped_feature, add_text):
    # read in geo file
    kc_geo = open("Zip_Codes.json", "r")
    # initiating a Folium map with the KC Long / lat
    m = foium.Map(location = [47.4365, -122.1463], zoom_start = 11)
    # creating a choroplet map
    m.chloropleth(
        geo_data = kc_geo
        fill_opacity = 0.7
        line_opacity = 0.2
        data = zip_params
        # refers to key in the geojson file
        key_on = tbd
        # first element contains location information, second contains feature of interst
        columns = [zips, mapped_feature])
        fill_color = 'RdYlGn'
        legend_name = add_text
    )
    folium.LayerControl().add_to(m)
    m.save(outfile = mapped_feature + '_map.html')