<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

## San Francisco Data

---

[San Francisco provides a wealth of data on the city to the public.](https://data.sfgov.org/) 

Project 3 is all about modelling exploring this data and modeling interesting relationships with regression. 


---

## Notes on the data

We have gone through the above website and pulled out a variety of different datasets that we think are particularly interesting. Some of the datasets are from external sources as well, but all are related to San Francisco. A high level overview of data folders is provided after the project requirements section.

** Feel free to include any other datasets from the San Francisco data if you think there is relevance or if it could be useful for your analysis.**


**The uncompressed data is a large filesize.** Even the compressed data is pretty large. The data is compressed into a .7z format which has one of the smallest filesizes available. You will likely need a 3rd party app to extract it. 

### Recommended Utilities for .7z
- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.

---

## Project requirements

**You will be performing 4 different sections of analysis on the San Francisco data.**

**Models must be regression. This means that your target variable needs to be numeric/continuous**

Do not perform classification models – this will be the topic of week 4.


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use the San Francisco assessor dataset and perform EDA

---

1. Explain what the data is. This may include multiple csv files. Some of this data has hard to understand codes representing the variables. Nearly all data is pulled from https://data.sfgov.org/ so this is a very good resource for determining what the data is.
- Clean the data.
- Develop and state clearly a hypothesis about the data that you would want to test.(This is totally upto you)
- Create some initial visualizations on the portions of the data relevant to your hypothesis.

In [1]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

path='/Users/joaquincunanan/desktop/dsi-sf-7-materials/datasets/proj_4/Historic_Secured_Property_Tax_Rolls.csv'
assessor=pd.read_csv(path)

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
print(assessor.shape)
assessor.head()

(1817896, 42)


Unnamed: 0,Closed Roll Fiscal Year,Property Location,Neighborhood Code,Neighborhood Code Definition,Block and Lot Number,Volume Number,Property Class Code,Property Class Code Definition,Year Property Built,Number of Bathrooms,...,Closed Roll Misc Exemption Value,Closed Roll Homeowner Exemption Value,Current Sales Date,Closed Roll Assessed Fixtures Value,Closed Roll Assessed Improvement Value,Closed Roll Assessed Land Value,Closed Roll Assessed Personal Prop Value,Supervisor District,Neighborhoods - Analysis Boundaries,Location
0,2012.0,0000 0188 MINNA ST0024C,09B,,3722273,25,Z,Z,2005.0,2.0,...,0.0,0.0,07/31/1990,0.0,760483.0,1140725.0,0.0,6.0,Financial District/South Beach,"(37.7862913318072, -122.401375181471)"
1,2014.0,0000 1006 COLE ST0000,05E,Parnassus/Ashbury Heights,1278032,9,Z,Z,1907.0,3.0,...,0.0,0.0,04/15/2011,0.0,346562.0,519843.0,0.0,5.0,Haight Ashbury,"(37.7646938184545, -122.449439257453)"
2,2007.0,0000 0000VWEBSTER ST0000,06C,Lower Pacific Heights,685050,5,V,V,1900.0,0.0,...,0.0,0.0,,0.0,17572.0,182948.0,0.0,5.0,Japantown,"(37.7860078381928, -122.430650176965)"
3,2014.0,0000 0601 VAN NESS AV0044,08F,Van Ness/ Civic Center,762044,6,Z,Z,1982.0,3.0,...,0.0,7000.0,01/31/2010,0.0,295002.0,295002.0,0.0,5.0,Western Addition,"(37.7813857775995, -122.421406328014)"
4,2008.0,0000 1221 HARRISON ST0014,09F,South of Market,3757127,25,LZ,LZ,2004.0,0.0,...,0.0,0.0,05/28/1940,0.0,249383.0,424483.0,0.0,6.0,South of Market,"(37.7731031100387, -122.4086736996)"


In [3]:
assessor.columns

Index([u'Closed Roll Fiscal Year', u'Property Location', u'Neighborhood Code',
       u'Neighborhood Code Definition', u'Block and Lot Number',
       u'Volume Number', u'Property Class Code',
       u'Property Class Code Definition', u'Year Property Built',
       u'Number of Bathrooms', u'Number of Bedrooms', u'Number of Rooms',
       u'Number of Stories', u'Number of Units',
       u'Characteristics Change Date', u'Zoning Code', u'Construction Type',
       u'Lot Depth', u'Lot Frontage', u'Property Area in Square Feet',
       u'Basement Area', u'Lot Area', u'Lot Code', u'Prior Sales Date',
       u'Recordation Date', u'Document Number', u'Document Number 2',
       u'Tax Rate Area Code', u'Percent of Ownership',
       u'Closed Roll Exemption Type Code',
       u'Closed Roll Exemption Type Code Definition',
       u'Closed Roll Status Code', u'Closed Roll Misc Exemption Value',
       u'Closed Roll Homeowner Exemption Value', u'Current Sales Date',
       u'Closed Roll Assessed Fi

In [4]:
assessor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1817896 entries, 0 to 1817895
Data columns (total 42 columns):
Closed Roll Fiscal Year                       float64
Property Location                             object
Neighborhood Code                             object
Neighborhood Code Definition                  object
Block and Lot Number                          object
Volume Number                                 int64
Property Class Code                           object
Property Class Code Definition                object
Year Property Built                           float64
Number of Bathrooms                           float64
Number of Bedrooms                            int64
Number of Rooms                               int64
Number of Stories                             int64
Number of Units                               int64
Characteristics Change Date                   float64
Zoning Code                                   object
Construction Type                       

In [6]:
# assessor2=assessor[['Current Sales Date','Year Property Built','Closed Roll Assessed Land Value']]
assessor2=assessor[['Number of Bathrooms','Number of Bedrooms','Number of Rooms','Number of Units']] 
assessor2.head()

Unnamed: 0,Number of Bathrooms,Number of Bedrooms,Number of Rooms,Number of Units
0,2.0,2,5,0
1,3.0,3,6,1
2,0.0,0,0,0
3,3.0,3,5,1
4,0.0,0,0,0


In [7]:
assessor2.dropna(inplace=True)
assessor2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,Number of Bathrooms,Number of Bedrooms,Number of Rooms,Number of Units
0,2.0,2,5,0
1,3.0,3,6,1
2,0.0,0,0,0
3,3.0,3,5,1
4,0.0,0,0,0


In [8]:
assessor['full_value']=assessor['Closed Roll Assessed Fixtures Value']+assessor['Closed Roll Assessed Improvement Value']+assessor['Closed Roll Assessed Land Value']+assessor['Closed Roll Assessed Land Value']+assessor['Closed Roll Assessed Personal Prop Value']
assessor['value_per_sqft']=assessor.full_value/assessor['Property Area in Square Feet']
# assessor.dropna(inplace=True)
assessor.head()

Unnamed: 0,Closed Roll Fiscal Year,Property Location,Neighborhood Code,Neighborhood Code Definition,Block and Lot Number,Volume Number,Property Class Code,Property Class Code Definition,Year Property Built,Number of Bathrooms,...,Current Sales Date,Closed Roll Assessed Fixtures Value,Closed Roll Assessed Improvement Value,Closed Roll Assessed Land Value,Closed Roll Assessed Personal Prop Value,Supervisor District,Neighborhoods - Analysis Boundaries,Location,full_value,value_per_sqft
0,2012.0,0000 0188 MINNA ST0024C,09B,,3722273,25,Z,Z,2005.0,2.0,...,07/31/1990,0.0,760483.0,1140725.0,0.0,6.0,Financial District/South Beach,"(37.7862913318072, -122.401375181471)",3041933.0,1821.516766
1,2014.0,0000 1006 COLE ST0000,05E,Parnassus/Ashbury Heights,1278032,9,Z,Z,1907.0,3.0,...,04/15/2011,0.0,346562.0,519843.0,0.0,5.0,Haight Ashbury,"(37.7646938184545, -122.449439257453)",1386248.0,956.033103
2,2007.0,0000 0000VWEBSTER ST0000,06C,Lower Pacific Heights,685050,5,V,V,1900.0,0.0,...,,0.0,17572.0,182948.0,0.0,5.0,Japantown,"(37.7860078381928, -122.430650176965)",383468.0,inf
3,2014.0,0000 0601 VAN NESS AV0044,08F,Van Ness/ Civic Center,762044,6,Z,Z,1982.0,3.0,...,01/31/2010,0.0,295002.0,295002.0,0.0,5.0,Western Addition,"(37.7813857775995, -122.421406328014)",885006.0,853.429122
4,2008.0,0000 1221 HARRISON ST0014,09F,South of Market,3757127,25,LZ,LZ,2004.0,0.0,...,05/28/1940,0.0,249383.0,424483.0,0.0,6.0,South of Market,"(37.7731031100387, -122.4086736996)",1098349.0,926.876793


In [9]:
assessor3=assessor.copy()
assessor4=assessor3[assessor.value_per_sqft!=np.inf]
#assessor.groupby(['Neighborhoods - Analysis Boundaries'])['value_per_sqft'].mean()

In [10]:
assessor4.head()

Unnamed: 0,Closed Roll Fiscal Year,Property Location,Neighborhood Code,Neighborhood Code Definition,Block and Lot Number,Volume Number,Property Class Code,Property Class Code Definition,Year Property Built,Number of Bathrooms,...,Current Sales Date,Closed Roll Assessed Fixtures Value,Closed Roll Assessed Improvement Value,Closed Roll Assessed Land Value,Closed Roll Assessed Personal Prop Value,Supervisor District,Neighborhoods - Analysis Boundaries,Location,full_value,value_per_sqft
0,2012.0,0000 0188 MINNA ST0024C,09B,,3722273,25,Z,Z,2005.0,2.0,...,07/31/1990,0.0,760483.0,1140725.0,0.0,6.0,Financial District/South Beach,"(37.7862913318072, -122.401375181471)",3041933.0,1821.516766
1,2014.0,0000 1006 COLE ST0000,05E,Parnassus/Ashbury Heights,1278032,9,Z,Z,1907.0,3.0,...,04/15/2011,0.0,346562.0,519843.0,0.0,5.0,Haight Ashbury,"(37.7646938184545, -122.449439257453)",1386248.0,956.033103
3,2014.0,0000 0601 VAN NESS AV0044,08F,Van Ness/ Civic Center,762044,6,Z,Z,1982.0,3.0,...,01/31/2010,0.0,295002.0,295002.0,0.0,5.0,Western Addition,"(37.7813857775995, -122.421406328014)",885006.0,853.429122
4,2008.0,0000 1221 HARRISON ST0014,09F,South of Market,3757127,25,LZ,LZ,2004.0,0.0,...,05/28/1940,0.0,249383.0,424483.0,0.0,6.0,South of Market,"(37.7731031100387, -122.4086736996)",1098349.0,926.876793
5,2012.0,0000 0517 11TH AV0000,01B,Inner Richmond,1554002,11,D,D,1913.0,0.0,...,,0.0,42819.0,35093.0,0.0,1.0,Inner Richmond,"(37.7784931441868, -122.469764099819)",113005.0,57.951282


In [11]:
haight=assessor4[assessor4['Neighborhoods - Analysis Boundaries']=='Haight Ashbury']
fin_dist=assessor4[assessor4['Neighborhoods - Analysis Boundaries']=='Financial District/South Beach']
print(haight.value_per_sqft.mean(),fin_dist.value_per_sqft.mean())
print(stats.ttest_ind(haight.value_per_sqft.dropna(), fin_dist.value_per_sqft.dropna(),equal_var = False))
#I can conclude that the Haight has less valuable real estate than the Financial District.

(498.8776755113393, 1482.628182406018)
Ttest_indResult(statistic=-4.3880742952007949, pvalue=1.1455128605855667e-05)


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. Construct and evaluate a linear regression model on the data

---

1. State the variables that are predictors in your linear regression and the target variable.
- Investigate and remove any outliers or other problems in your data. _This is a subjective process._
- Construct a linear regression model.
- Evaluate the model. How does the $R^2$ of the overall model compare to cross-validated $R^2$. What do the differences in $R^2$ mean?
  - Use test / train split
  - Use K-Folds
  - Compare and explain your results with both
- Visualize the evaluation metrics of your analysis in clear charts.
- Summarize your results in the context of your hypothesis. Frame this as if you are presenting to non-technical readers.


In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.cross_validation import cross_val_score
X=assessor2
# X['Current Sales Date']=pd.to_datetime(assessor2['Current Sales Date'])
X.head()



Unnamed: 0,Number of Bathrooms,Number of Bedrooms,Number of Rooms,Number of Units
0,2.0,2,5,0
1,3.0,3,6,1
2,0.0,0,0,0
3,3.0,3,5,1
4,0.0,0,0,0


In [14]:
assessor2.head()

Unnamed: 0,Number of Bathrooms,Number of Bedrooms,Number of Rooms,Number of Units
0,2.0,2,5,0
1,3.0,3,6,1
2,0.0,0,0,0
3,3.0,3,5,1
4,0.0,0,0,0


In [44]:
X.isnull().sum(axis=0)

Number of Bathrooms    0
Number of Bedrooms     0
Number of Rooms        0
Number of Units        0
dtype: int64

In [39]:
X[[X==np.nan]]

ValueError: cannot copy sequence with size 1817896 to array axis with dimension 4

In [46]:
X.fillna(value=0,inplace=True)
y.fillna(value=0,inplace=True)
ss = StandardScaler()
Xn=ss.fit_transform(X.values)



# X=X.reshape((1, 954261))
y=assessor['Closed Roll Assessed Land Value']
lr = LinearRegression()
lr.fit(X.values,y.values)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [47]:

from sklearn.model_selection import StratifiedKFold

def accuracy_crossvalidator(X, y, knn):
#     '''Corss validates knn classifer with a StratifiedKFold
#     Input: X - design matrix, array
#            y - labels, array
#            knn - classifer, object
#     Output: accuracy scores, list
#             mean accuracy, float'''
        
    scores = []
    cv_indices = StratifiedKFold(n_splits=5)
    for train_i, test_i in cv_indices.split(X,y):

        X_train = X[train_i, :]
        X_test = X[test_i, :]

        y_train = y[train_i]
        y_test = y[test_i]

        knn.fit(X_train, y_train)

        acc = knn.score(X_test, y_test)
        scores.append(acc)


    return scores, np.mean(scores)

# mean_knn_n5 = KNeighborsClassifier(n_neighbors=5,
#                                    weights='uniform')

accs, mean_acc = accuracy_crossvalidator(X.values, y.values, lr)
accs, mean_acc



KeyboardInterrupt: 

In [None]:
#cross validate wouldn't complete execution :(

In [50]:
lr.score(X.values,y.values)
#test correlation of Number of Bathrooms, Number of Bedrooms, Number of Rooms, Number of Units to price
# unfortunately, the correlation was poor

0.041231174650317781

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2.2 Explain $R^2$ vs. mean squared error (MSE)

---

1. If you have negative $R^2$ values in cross-validation, what does this mean? 
2. Why can $R^2$ only be negative when the model is tested on new data?

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. Combine Crime and Fire incidents datasets from the  San Francisco data. Build a Linear regression model to predict number of fire incidents. What are the most significant predictors?

### Evaluate the model with regularized regression.

---

**I recommend having many predictors to see benefits from regularization methods, but it's up to you.**


- Like in part 1, you should state a hypothesis and perform data cleaning and EDA _only_ on the relevant portions of your data. Don't waste time!
- Construct and evaluate different models with cross-validated $R^2$. Compare LinearRegression, Lasso, Ridge, and ElasticNet. 
- Report on which model is best after performing regularization, and why that might be the case (hint: does your data have multicollinearity? Irrelevant variables? Both?)
- Plot visuals that compare the performance of the four models.


In [103]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

path='/Users/joaquincunanan/desktop/dsi-sf-7-materials/datasets/proj_4/Fire_Incidents_-_Current_Year__2016_.csv'
fire=pd.read_csv(path)

path='/Users/joaquincunanan/desktop/dsi-sf-7-materials/datasets/proj_4/Police_Department_Incidents_-_Previous_Year__2016_.csv'
police=pd.read_csv(path)

  interactivity=interactivity, compiler=compiler, result=result)


In [104]:
fire.head()

Unnamed: 0,Incident Number,Exposure Number,Address,Incident Date,Call Number,Alarm DtTm,Arrival DtTm,Close DtTm,City,Zipcode,...,Detector Effectiveness,Detector Failure Reason,Automatic Extinguishing System Present,Automatic Extinguishing Sytem Type,Automatic Extinguishing Sytem Perfomance,Automatic Extinguishing Sytem Failure Reason,Number of Sprinkler Heads Operating,Supervisor District,Neighborhood District,Location
0,16000003,0,Precita Av/florida Street,01/01/2016,160010015,01/01/2016 12:02:57 AM,01/01/2016 12:08:05 AM,01/01/2016 12:12:51 AM,San Francisco,94110.0,...,,,,,,,,9.0,Bernal Heights,"(37.7475540000296, -122.409572)"
1,16000004,0,1620 Eucalyptus Drive,01/01/2016,160010018,01/01/2016 12:03:02 AM,01/01/2016 12:09:32 AM,01/01/2016 12:15:04 AM,San Francisco,94132.0,...,,,,,,,,7.0,Sunset/Parkside,"(37.7310980000296, -122.488151)"
2,16000023,0,171 2nd Street,01/01/2016,160010157,01/01/2016 12:35:02 AM,01/01/2016 12:40:17 AM,01/01/2016 12:53:24 AM,San Francisco,94105.0,...,,,,,,,,6.0,Financial District/South Beach,"(37.7871460000297, -122.398598)"
3,16000034,0,535 Wisconsin Street,01/01/2016,160010210,01/01/2016 12:45:36 AM,01/01/2016 12:50:00 AM,01/01/2016 01:00:47 AM,San Francisco,94107.0,...,,,,,,,,10.0,Potrero Hill,"(37.7606670000296, -122.399175)"
4,16000051,0,El Camino Del Mar/seal Rock Drive,01/01/2016,160010302,01/01/2016 01:01:59 AM,01/01/2016 01:12:01 AM,01/01/2016 01:20:16 AM,San Francisco,94121.0,...,,,,,,,,1.0,Outer Richmond,"(37.7805136379969, -122.510171681643)"


In [105]:
police.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,120058272,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Friday,01/29/2016 12:00:00 AM,11:00,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",12005827212120
1,120058272,WEAPON LAWS,"FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE",Friday,01/29/2016 12:00:00 AM,11:00,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",12005827212168
2,141059263,WARRANTS,WARRANT ARREST,Monday,04/25/2016 12:00:00 AM,14:59,BAYVIEW,"ARREST, BOOKED",KEITH ST / SHAFTER AV,-122.388856,37.729981,"(37.7299809672996, -122.388856204292)",14105926363010
3,160002740,NON-CRIMINAL,LOST PROPERTY,Friday,01/01/2016 12:00:00 AM,00:30,MISSION,NONE,16TH ST / MISSION ST,-122.419672,37.76505,"(37.7650501214668, -122.419671780296)",16000274071000
4,160002869,ASSAULT,BATTERY,Friday,01/01/2016 12:00:00 AM,21:35,NORTHERN,NONE,1700 Block of BUSH ST,-122.426077,37.788019,"(37.788018555829, -122.426077177375)",16000286904134


In [107]:
fire.columns
# There are no columns to key on to execute a join. Address recording standards are different. PdDistrict/Precinct names don't conform 1:1 to neighborhood district. 

Index([u'Incident Number', u'Exposure Number', u'Address', u'Incident Date',
       u'Call Number', u'Alarm DtTm', u'Arrival DtTm', u'Close DtTm', u'City',
       u'Zipcode', u'Battalion', u'Station Area', u'Box', u'Suppression Units',
       u'Suppression Personnel', u'EMS Units', u'EMS Personnel',
       u'Other Units', u'Other Personnel', u'First Unit On Scene',
       u'Estimated Property Loss', u'Estimated Contents Loss',
       u'Fire Fatalities', u'Fire Injuries', u'Civilian Fatalities',
       u'Civilian Injuries', u'Number of Alarms', u'Primary Situation',
       u'Mutual Aid', u'Action Taken Primary', u'Action Taken Secondary',
       u'Action Taken Other', u'Detector Alerted Occupants', u'Property Use',
       u'Area of Fire Origin', u'Ignition Cause', u'Ignition Factor Primary',
       u'Ignition Factor Secondary', u'Heat Source', u'Item First Ignited',
       u'Human Factors Associated with Ignition', u'Structure Type',
       u'Structure Status', u'Floor of Fire Origin', 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. Conduct another analysis using the San Francisco Parks data to predict Park scores

---

1. Combining multiple sources of park data (csv files) is required.
- Perform EDA and cleaning on relevant data.
- Construct and compare different regression models with cross-validation.
- Plot descriptive visuals you think are useful for understanding the data.
- Report on your findings.


In [45]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

path='/Users/joaquincunanan/desktop/dsi-sf-7-materials/datasets/proj_4/Recreation___Park_Department_Park_Info_Dataset.csv'
parks_data=pd.read_csv(path)

path='/Users/joaquincunanan/desktop/dsi-sf-7-materials/datasets/proj_4/Park_Scores_2005-2014.csv'
parks_scores=pd.read_csv(path)

In [4]:
# parks_data.drop(parks_data.index[0],inplace=True)
parks_data.head()

Unnamed: 0,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,Zipcode,Acreage,SupDist,ParkID,Location 1,Lat
0,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,,,,,,
1,10TH AVE/CLEMENT MINI PARK,Mini Park,PSA 1,"Elder, Steve",steven.elder@sfgov.org,(415) 601-6501,94118.0,0.66,1.0,156.0,"351 9th Ave\nSan Francisco, CA\n(37.78184397, ...",
2,15TH AVENUE STEPS,Mini Park,PSA 4,"Sheehy, Chuck",charles.sheehy@sfgov.org,(415) 218-2226,94122.0,0.26,7.0,185.0,"15th Ave b w Kirkham\nSan Francisco, CA\n(37.7...",
3,24TH/YORK MINI PARK,Mini Park,PSA 6,"Field, Adrian",adrian.field@sfgov.org,(415) 717-2872,94110.0,0.12,9.0,51.0,"24th\nSan Francisco, CA\n(37.75306042, -122.40...",
4,29TH/DIAMOND OPEN SPACE,Neighborhood Park or Playground,PSA 5,"O'Brien, Teresa",teresa.o'brien@sfgov.org,(415) 819-2699,94131.0,0.82,8.0,194.0,"Diamond\nSan Francisco, CA\n(37.74360211, -122...",


In [5]:
parks_scores.head()

Unnamed: 0,ParkID,PSA,Park,FQ,Score
0,86,PSA4,Carl Larsen Park,FY05Q3,0.795
1,13,PSA4,Junipero Serra Playground,FY05Q3,0.957
2,9,PSA4,Rolph Nicol Playground,FY05Q3,0.864
3,117,PSA2,Alamo Square,FY05Q4,0.857
4,60,PSA6,Jose Coronado Playground,FY05Q4,0.859


In [6]:
parks_all=parks_data.set_index('ParkID').join(parks_scores.set_index('ParkID')).reset_index()

In [7]:
parks_all.head()

Unnamed: 0,ParkID,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,Zipcode,Acreage,SupDist,Location 1,Lat,PSA,Park,FQ,Score
0,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,"Elk St\nSan Francisco, CA\n(37.7400257, -122.4...",,PSA5,Glen Park,FY05Q4,0.858
1,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,"Elk St\nSan Francisco, CA\n(37.7400257, -122.4...",,PSA5,Glen Park,FY06Q2,0.99
2,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,"Elk St\nSan Francisco, CA\n(37.7400257, -122.4...",,PSA5,Glen Park,FY06Q4,0.963
3,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,"Elk St\nSan Francisco, CA\n(37.7400257, -122.4...",,PSA5,Glen Park,FY07Q2,0.876
4,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,"Elk St\nSan Francisco, CA\n(37.7400257, -122.4...",,PSA5,Glen Park,FY08Q2,1.0


In [14]:
pd.pivot_table(parks_all, values='Score', index=['PSAManager'],columns='ParkType',aggfunc=np.mean)

ParkType,Civic Plaza or Square,Community Garden,Concession,Family Camp,Mini Park,Neighborhood Park or Playground,ParkType,Parkway,Regional Park,Zoological Garden
PSAManager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"Castile, Steve",,,,,,,,,,
"Cleveland, Maggie",0.938122,,,,0.887306,0.925017,,,,
"Deasy, Jon",,,,,,,,0.83929,0.904969,
"Dennis, Brent",,,,,,,,,0.949182,
"Elder, Steve",,,,,0.904649,0.926429,,0.739394,0.896129,
"Field, Adrian",,,,,0.939824,0.912582,,,,
"Figone, Joe",0.9221,,,,0.933,0.947911,,,0.890625,
"Gay, Mike",,,,,,,,,,
"Giammattei, Joe",,,,,,,,,0.896533,
"Hill, Eric",0.960656,,,,0.898581,0.908384,,,,


In [23]:

parks_all=parks_all[parks_all.Score>0]#drop rows with no score

In [25]:
pd.pivot_table(parks_all, values='Score', index=['PSAManager'],columns='ParkType',aggfunc=np.mean)

ParkType,Civic Plaza or Square,Mini Park,Neighborhood Park or Playground,Parkway,Regional Park
PSAManager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Cleveland, Maggie",0.938122,0.887306,0.92892,,
"Deasy, Jon",,,,0.83929,0.904969
"Dennis, Brent",,,,,0.949182
"Elder, Steve",,0.904649,0.926429,0.739394,0.896129
"Field, Adrian",,0.939824,0.912582,,
"Figone, Joe",0.9221,0.933,0.947911,,0.890625
"Giammattei, Joe",,,,,0.896533
"Hill, Eric",0.960656,0.898581,0.908384,,
"Koch-Gonzalez, Gloria",,,,,0.856814
"Lockwood, Darlene",,0.889586,0.923097,,0.914935


In [65]:
parks_all_d=parks_all[['ParkType','PSAManager','SupDist']]
parks_all_d['SupDist']=parks_all_d['SupDist'].astype(str)
parks_all_d.head()
parks_all.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


(5466, 16)

In [66]:
parks_all_d.head()

Unnamed: 0,ParkType,PSAManager,SupDist
2,Regional Park,"Lockwood, Darlene",8.0
3,Regional Park,"Lockwood, Darlene",8.0
4,Regional Park,"Lockwood, Darlene",8.0
5,Regional Park,"Lockwood, Darlene",8.0
6,Regional Park,"Lockwood, Darlene",8.0


In [67]:
parks_all_d=pd.get_dummies(parks_all_d)
parks_all_d.head()
parks_all_d.columns

Index([u'ParkType_Civic Plaza or Square', u'ParkType_Mini Park',
       u'ParkType_Neighborhood Park or Playground', u'ParkType_Parkway',
       u'ParkType_Regional Park', u'PSAManager_Cleveland, Maggie',
       u'PSAManager_Deasy, Jon', u'PSAManager_Dennis, Brent',
       u'PSAManager_Elder, Steve', u'PSAManager_Field, Adrian',
       u'PSAManager_Figone, Joe', u'PSAManager_Giammattei, Joe',
       u'PSAManager_Hill, Eric', u'PSAManager_Koch-Gonzalez, Gloria',
       u'PSAManager_Lockwood, Darlene', u'PSAManager_Martin, York (Acting)',
       u'PSAManager_McCormick, James', u'PSAManager_Miller, John',
       u'PSAManager_O'Brien, Teresa', u'PSAManager_O'Connor, Tom',
       u'PSAManager_Sheehy, Chuck', u'PSAManager_Sheets, Robert',
       u'PSAManager_Stone, Andy', u'PSAManager_Taylor, Zack',
       u'PSAManager_Watkins, Robert', u'SupDist_1.0', u'SupDist_10.0',
       u'SupDist_11.0', u'SupDist_2.0', u'SupDist_3.0', u'SupDist_4.0',
       u'SupDist_5.0', u'SupDist_6.0', u'SupDist_7.0

In [68]:
parks_all_2=pd.concat([parks_all, parks_all_d],axis=1,join='inner')

parks_all_2.head()

Unnamed: 0,ParkID,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,Zipcode,Acreage,SupDist,...,SupDist_10.0,SupDist_11.0,SupDist_2.0,SupDist_3.0,SupDist_4.0,SupDist_5.0,SupDist_6.0,SupDist_7.0,SupDist_8.0,SupDist_9.0
2,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,...,0,0,0,0,0,0,0,0,1,0
3,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,...,0,0,0,0,0,0,0,0,1,0
4,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,...,0,0,0,0,0,0,0,0,1,0
5,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,...,0,0,0,0,0,0,0,0,1,0
6,1.0,GLEN PARK,Regional Park,PSA 5,"Lockwood, Darlene",darlene.lockwood@sfgov.org,(415) 717-2872,94127.0,77.94,8.0,...,0,0,0,0,0,0,0,0,1,0


In [69]:
parks_all_2.columns

Index([u'ParkID', u'ParkName', u'ParkType', u'ParkServiceArea', u'PSAManager',
       u'email', u'Number', u'Zipcode', u'Acreage', u'SupDist', u'Location 1',
       u'Lat', u'PSA', u'Park', u'FQ', u'Score',
       u'ParkType_Civic Plaza or Square', u'ParkType_Mini Park',
       u'ParkType_Neighborhood Park or Playground', u'ParkType_Parkway',
       u'ParkType_Regional Park', u'PSAManager_Cleveland, Maggie',
       u'PSAManager_Deasy, Jon', u'PSAManager_Dennis, Brent',
       u'PSAManager_Elder, Steve', u'PSAManager_Field, Adrian',
       u'PSAManager_Figone, Joe', u'PSAManager_Giammattei, Joe',
       u'PSAManager_Hill, Eric', u'PSAManager_Koch-Gonzalez, Gloria',
       u'PSAManager_Lockwood, Darlene', u'PSAManager_Martin, York (Acting)',
       u'PSAManager_McCormick, James', u'PSAManager_Miller, John',
       u'PSAManager_O'Brien, Teresa', u'PSAManager_O'Connor, Tom',
       u'PSAManager_Sheehy, Chuck', u'PSAManager_Sheets, Robert',
       u'PSAManager_Stone, Andy', u'PSAManager_Tay

In [70]:
drop_list=['ParkName','ParkType','SupDist','Zipcode','ParkID','ParkServiceArea','PSAManager','email','Number','Acreage','Location 1','Lat','PSA','Park','FQ']
parks_all_2.drop(drop_list,axis=1,inplace=True)
parks_all_2.head()

Unnamed: 0,Score,ParkType_Civic Plaza or Square,ParkType_Mini Park,ParkType_Neighborhood Park or Playground,ParkType_Parkway,ParkType_Regional Park,"PSAManager_Cleveland, Maggie","PSAManager_Deasy, Jon","PSAManager_Dennis, Brent","PSAManager_Elder, Steve",...,SupDist_10.0,SupDist_11.0,SupDist_2.0,SupDist_3.0,SupDist_4.0,SupDist_5.0,SupDist_6.0,SupDist_7.0,SupDist_8.0,SupDist_9.0
2,0.963,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.876,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1.0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0.971,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,0.974,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [71]:
X=parks_all_2.copy()
y=X.Score
del X['Score']
X.head()

Unnamed: 0,ParkType_Civic Plaza or Square,ParkType_Mini Park,ParkType_Neighborhood Park or Playground,ParkType_Parkway,ParkType_Regional Park,"PSAManager_Cleveland, Maggie","PSAManager_Deasy, Jon","PSAManager_Dennis, Brent","PSAManager_Elder, Steve","PSAManager_Field, Adrian",...,SupDist_10.0,SupDist_11.0,SupDist_2.0,SupDist_3.0,SupDist_4.0,SupDist_5.0,SupDist_6.0,SupDist_7.0,SupDist_8.0,SupDist_9.0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [72]:

lr = linear_model.LinearRegression()
lr.fit(X.values,y.values)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [73]:
lr.score(X,y)
#Unfortunately, the correlation was poor. Attempted to correlate Parktype,PSAManager, SupDistrict to Park score.

0.097974320639559243

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. Bias-variance tradeoff

---

1. Use a model from any of the previous questions above of your choice and construct a regularized regression model . Ideally the model should actually improve across regularization strengths...
- Gridsearch the regularization parameters to find the optimal.
- Plot the regularization parameter against the cross-validated $R^2$.
- Explain how regularization and regularization strength is related to the bias-variance tradeoff.


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5.1 Calculate the approximated $\text{bias}^2$ and variance across regularization strengths.

---

You can obviously use my code from the bias-variance lab to do this. 

Plot the bias and variance change _with_ the cross-validated $R^2$. 

You'll need to scale these values somehow to put them on the same chart (I recommend (MinMaxScaler)[http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] to put $\text{bias}^2$ and variance on the same scale as cross-validated $R^2$.

<img src="http://imgur.com/HNPKfE8.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Custom regularized regression penalties

---

The $\lambda1$ and $\lambda2$ norm regularization penalties (Lasso and Ridge) are the most commonly used regularization penalties. They have a solid foundation in statistics and evidence of effectiveness. However, these are not the only possible penalties for regression – sometimes new, customized penalties give additional performance and predictive power to models depending on the context.


**Devise of and implement a penalized regression for San Francisco Crime data.** What is your rationale – why would this be useful? How does it perform compared to the standard Ridge, Lasso, and Elastic Net penalties?

## Statistics, Biases, and Hypothesis Testing

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 7. Biases 

---
A new food ordering smartphone app incentivizes its users to invite their friends by offering them free orders for each friend that signs up
- What biases are being caused here?
 
- How would you measure the success of such a program?
  
- Rephrase this question to be unbiased:
    ** Many people have said that there is a need for stricter laws on dangerous weapons. Do you agree? ** 
   


In [None]:
1) You may get a lot of people signing up but not ordering anything just as a favor to the friend. 
2) I would measure the success by measuring whether the cost of customer acquisition < the profit generated by the acquisition.
3) Do you think that we should regulate weapons more strictly?

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 8. Hypothesis Testing 

---

For the health/mortality data from the following website: 
http://assets.datacamp.com/blog_assets/chol.txt'


- Generate summary statistics, histogram plots, cdf plots, and confidence intervals for two columns of your choice and a correlation matrix across all
- Using what you generated, provide short summaries of each column describing the data
- Is there a difference in mortality between smokers, nonsmokers?
- If national average weight is 85 pounds, is our weight average significantly different?
- Until now, we’ve only asked is it different. This is called a two sided test. 
    - What if we want to know if its less than or greater than? This is called a one sided test. We can calculate this from the result of a two sided. You divide your p value in half and check if your t statistic is positive or negative Greater than: p/2 < confidence level and t > 0 and Less-than: p/2 < confidence level and t < 0.
    
    - If national average weight is 85 pounds, is our weight average statistical significantly less?