<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

## San Francisco Data

---

[San Francisco provides a wealth of data on the city to the public.](https://data.sfgov.org/) 

Project 3 is all about modelling exploring this data and modeling interesting relationships with regression. 


---

## Notes on the data

We have gone through the above website and pulled out a variety of different datasets that we think are particularly interesting. Some of the datasets are from external sources as well, but all are related to San Francisco. A high level overview of data folders is provided after the project requirements section.

** Feel free to include any other datasets from the San Francisco data if you think there is relevance or if it could be useful for your analysis.**


**The uncompressed data is a large filesize.** Even the compressed data is pretty large. The data is compressed into a .7z format which has one of the smallest filesizes available. You will likely need a 3rd party app to extract it. 

### Recommended Utilities for .7z
- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.

---

## Project requirements

**You will be performing 4 different sections of analysis on the San Francisco data.**

**Models must be regression. This means that your target variable needs to be numeric/continuous**

Do not perform classification models – this will be the topic of week 4.


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use the San Francisco assessor dataset and perform EDA

---

1. Explain what the data is. This may include multiple csv files. Some of this data has hard to understand codes representing the variables. Nearly all data is pulled from https://data.sfgov.org/ so this is a very good resource for determining what the data is.
- Clean the data.
- Develop and state clearly a hypothesis about the data that you would want to test.(This is totally upto you)
- Create some initial visualizations on the portions of the data relevant to your hypothesis.

In [43]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import scipy.stats as stats

# plotting modules
sns.set_style('whitegrid')

# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler


%matplotlib inline
%config InlineBackend.figure_format ='retina'

pd.set_option('display.max_columns', 500)

In [53]:
filepath  = '/Users/manuel/desktop/dsi-sf-7-materials_manuel/datasets/sf_assessor_value/assessor_value_cleaned.csv'
assessor_df = pd.read_csv(filepath)


## EDA

In [54]:
assessor_df.head(2)

Unnamed: 0,baths,beds,lot_depth,basement_area,front_ft,owner_pct,rooms,property_class,neighborhood,tax_rate,volume,sqft,stories,year_recorded,year_built,zone,value
0,2,2,0.0,0.0,0.0,1.0,5,Z,08E,1000.0,1,1419,0,2007,1907,RH3,1002840.0
1,2,2,0.0,0.0,0.0,1.0,7,Z,08E,1000.0,1,1773,0,2007,1907,RH3,1433430.0


In [55]:
assessor_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
baths,754147.0,1.454369,0.632256,0.0,1.0,1.0,2.0,3.0
beds,754147.0,0.754092,1.241474,0.0,0.0,0.0,2.0,4.0
lot_depth,754147.0,7.865771,20.137087,0.0,0.0,0.0,0.0,93.3
basement_area,754147.0,49.159869,144.199184,0.0,0.0,0.0,0.0,799.0
front_ft,754147.0,0.012888,0.630595,0.0,0.0,0.0,0.0,50.0
owner_pct,754147.0,0.897754,0.214527,0.0,1.0,1.0,1.0,1.0
rooms,754147.0,5.582982,1.40592,1.0,5.0,5.0,6.0,9.0
tax_rate,754147.0,1000.389953,2.345064,1000.0,1000.0,1000.0,1000.0,1019.0
volume,754147.0,23.767649,12.45832,1.0,14.0,21.0,36.0,44.0
sqft,754147.0,1399.682332,461.872881,495.0,1075.0,1320.0,1656.0,3050.0


In [56]:
assessor_df.isnull().any()

baths             False
beds              False
lot_depth         False
basement_area     False
front_ft          False
owner_pct         False
rooms             False
property_class    False
neighborhood      False
tax_rate          False
volume            False
sqft              False
stories           False
year_recorded     False
year_built        False
zone              False
value             False
dtype: bool

In [57]:
assessor_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 754147 entries, 0 to 754146
Data columns (total 17 columns):
baths             754147 non-null int64
beds              754147 non-null int64
lot_depth         754147 non-null float64
basement_area     754147 non-null float64
front_ft          754147 non-null float64
owner_pct         754147 non-null float64
rooms             754147 non-null int64
property_class    754147 non-null object
neighborhood      754147 non-null object
tax_rate          754147 non-null float64
volume            754147 non-null int64
sqft              754147 non-null int64
stories           754147 non-null int64
year_recorded     754147 non-null int64
year_built        754147 non-null int64
zone              754147 non-null object
value             754147 non-null float64
dtypes: float64(6), int64(8), object(3)
memory usage: 97.8+ MB


In [66]:
assessor_df_value = assessor_df.value
assessor_df.drop('value', axis=1, inplace=True)
# kaggle_combined = pd.concat((kaggle_train.drop('SalePrice', axis=1), kaggle_test))


In [68]:
# assessor_df.columns

## Normalize

In [69]:
ss = StandardScaler()
assessor_df[list(assessor_df.select_dtypes(exclude=['object']).columns)]= ss.fit_transform(assessor_df.select_dtypes(exclude=['object']))


In [40]:
# assessor_df.head()

## Categorize values

In [70]:
assessor_df = pd.get_dummies(data=assessor_df)

In [71]:
assessor_df.head()

Unnamed: 0,baths,beds,lot_depth,basement_area,front_ft,owner_pct,rooms,tax_rate,volume,sqft,stories,year_recorded,year_built,property_class_D,property_class_DBM,property_class_LZ,property_class_TH,property_class_Z,property_class_ZBM,neighborhood_01A,neighborhood_01B,neighborhood_01C,neighborhood_01D,neighborhood_01E,neighborhood_01F,neighborhood_01G,neighborhood_02A,neighborhood_02B,neighborhood_02C,neighborhood_02D,neighborhood_02E,neighborhood_02F,neighborhood_02G,neighborhood_03A,neighborhood_03B,neighborhood_03C,neighborhood_03D,neighborhood_03E,neighborhood_03F,neighborhood_03G,neighborhood_03H,neighborhood_03J,neighborhood_047,neighborhood_04A,neighborhood_04B,neighborhood_04C,neighborhood_04D,neighborhood_04E,neighborhood_04F,neighborhood_04G,neighborhood_04H,neighborhood_04J,neighborhood_04K,neighborhood_04M,neighborhood_04N,neighborhood_04P,neighborhood_04R,neighborhood_04S,neighborhood_04T,neighborhood_05A,neighborhood_05B,neighborhood_05C,neighborhood_05D,neighborhood_05E,neighborhood_05F,neighborhood_05G,neighborhood_05H,neighborhood_05J,neighborhood_05K,neighborhood_05M,neighborhood_06A,neighborhood_06B,neighborhood_06C,neighborhood_06D,neighborhood_06E,neighborhood_06F,neighborhood_07A,neighborhood_07B,neighborhood_07C,neighborhood_07D,neighborhood_08A,neighborhood_08B,neighborhood_08C,neighborhood_08D,neighborhood_08E,neighborhood_08F,neighborhood_08G,neighborhood_08H,neighborhood_08I,neighborhood_09A,neighborhood_09B,neighborhood_09C,neighborhood_09D,neighborhood_09E,neighborhood_09F,neighborhood_09G,neighborhood_09H,neighborhood_10A,neighborhood_10B,neighborhood_10C,neighborhood_10D,neighborhood_10E,neighborhood_10F,neighborhood_10G,neighborhood_10H,neighborhood_10J,neighborhood_10K,zone_24NOE,zone_C2,zone_C3G,zone_C3O,zone_C3S,zone_CCB,zone_CM,zone_CRNC,zone_CVR,zone_FILLMR,zone_HAYES,zone_M1,zone_M2,zone_MI,zone_MZ,zone_NBEACH,zone_NC1,zone_NC2,zone_NC3,zone_NCR,zone_NCS,zone_NCZ,zone_OTCLEM,zone_P,zone_P/RH1,zone_P/RH1D,zone_P/RH2,zone_POLK,zone_RC1,zone_RC3,zone_RC4,zone_RC4NC3,zone_RED,zone_RH,zone_RH1,zone_RH1/CM,zone_RH1D,zone_RH1NC2,zone_RH1RH2,zone_RH1RM1,zone_RH1S,zone_RH2,zone_RH2NC3,zone_RH2RH3,zone_RH2RM1,zone_RH3,zone_RH3RM1,zone_RHDRH1,zone_RHDRH2,zone_RHI,zone_RHZ,zone_RM1,zone_RM1CM,zone_RM1RM4,zone_RM2,zone_RM2RM3,zone_RM3,zone_RM3NC2,zone_RM3NCS,zone_RM3RM4,zone_RM4,zone_RMI,zone_RMZ,zone_RSD,zone_SACTO,zone_SLI,zone_SLR,zone_SPD,zone_SSO,zone_UNION,zone_UPMKT,zone_VALEN
0,0.862991,1.003573,-0.390611,-0.340917,-0.020438,0.476611,-0.414662,-0.166287,-1.827507,0.041825,-2.058338,-1.513623,-1.441895,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0.862991,1.003573,-0.390611,-0.340917,-0.020438,0.476611,1.007894,-0.166287,-1.827507,0.80827,-2.058338,-1.513623,-1.441895,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0.862991,1.003573,-0.390611,-0.340917,-0.020438,0.476611,-0.414662,-0.166287,-1.827507,-0.341398,-2.058338,-1.513623,2.287808,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,-0.718647,-0.607417,-0.390611,-0.340917,-0.020438,-2.631021,0.296616,-0.166287,-1.827507,0.910029,-0.267808,-1.513623,-0.594235,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,-0.718647,-0.607417,-0.390611,-0.340917,-0.020438,0.476611,-1.125941,-0.166287,-1.827507,-0.82205,-0.267808,-1.513623,-0.636618,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# The more rooms, the more tha value of the house

In [72]:
# ase_int.groupby(by=['rooms'])['baths','beds','basement_area','stories', 'value'].mean()

In [73]:
# use_cols = ['baths','beds','basement_area','stories','rooms']

In [74]:
# sns.pairplot(ase_int, x_vars=use_cols[:2], y_vars='value', size=7, aspect=0.7);

In [75]:
# sns.pairplot(ase_int, x_vars=use_cols[2:-1], y_vars='value', size=7, aspect=0.7);

In [76]:
# sns.pairplot(ase_int, x_vars=use_cols[-1], y_vars='value', size=7, aspect=0.7);

In [77]:
# fig = plt.figure(figsize=(7,7))
# ax = fig.gca()

# ax = sns.boxplot(x='rooms', y='value', data=ase_int, ax=ax, notch=True)

# ax.set_title('Rooms and Value relation', fontsize=20)

# plt.show()


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. Construct and evaluate a linear regression model on the data

---

1. State the variables that are predictors in your linear regression and the target variable.
- Investigate and remove any outliers or other problems in your data. _This is a subjective process._
- Construct a linear regression model.
- Evaluate the model. How does the $R^2$ of the overall model compare to cross-validated $R^2$. What do the differences in $R^2$ mean?
  - Use test / train split
  - Use K-Folds
  - Compare and explain your results with both
- Visualize the evaluation metrics of your analysis in clear charts.
- Summarize your results in the context of your hypothesis. Frame this as if you are presenting to non-technical readers.


In [140]:
# y = pd.DataFrame(assessor_df_value)
y = assessor_df_value
X = assessor_df

In [139]:
y.shape, X.shape

((754147,), (754147, 179))

In [112]:
lm = linear_model.LinearRegression()

In [133]:
model = lm.fit(X, y)

In [131]:
# plt.scatter(X, y, color='black')
# plt.plot(X, lm.predict(X), color='blue', linewidth=3)
# plt.show()

In [150]:
####
# kfold = StratifiedKFold(y.values, n_folds=3, shuffle=True)

In [143]:
cross_val_score(lm, X, y, cv=5)

array([ -3.59173721e+10,   2.76113584e-01,   2.64593616e-01,
         3.04435932e-01,   2.98707758e-01])

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2.2 Explain $R^2$ vs. mean squared error (MSE)

---

1. If you have negative $R^2$ values in cross-validation, what does this mean? 
2. Why can $R^2$ only be negative when the model is tested on new data?

### A = The negative value shows that the model didnt have any knowledge the first time. 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. Combine Crime and Fire incidents datasets from the  San Francisco data. Build a Linear regression model to predict number of fire incidents. What are the most significant predictors?

### Evaluate the model with regularized regression.

---

**I recommend having many predictors to see benefits from regularization methods, but it's up to you.**


- Like in part 1, you should state a hypothesis and perform data cleaning and EDA _only_ on the relevant portions of your data. Don't waste time!
- Construct and evaluate different models with cross-validated $R^2$. Compare LinearRegression, Lasso, Ridge, and ElasticNet. 
- Report on which model is best after performing regularization, and why that might be the case (hint: does your data have multicollinearity? Irrelevant variables? Both?)
- Plot visuals that compare the performance of the four models.


## Importing files and slicing

In [158]:
file_crime = '/Users/manuel/Desktop/dsi-sf-7-materials_manuel/datasets/sf_crime/Police_Department_Incidents.csv'
file_fire = '/Users/manuel/Desktop/dsi-sf-7-materials_manuel/datasets/sf_crime/Fire_Incidents.csv'

In [159]:
crime = pd.read_csv(file_crime)
fire = pd.read_csv(file_fire)

  interactivity=interactivity, compiler=compiler, result=result)


In [167]:
crime = crime[:2000]
fire = fire[:2000]

## EDA

In [163]:
crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 13 columns):
IncidntNum    2000 non-null int64
Category      2000 non-null object
Descript      2000 non-null object
DayOfWeek     2000 non-null object
Date          2000 non-null object
Time          2000 non-null object
PdDistrict    2000 non-null object
Resolution    2000 non-null object
Address       2000 non-null object
X             2000 non-null float64
Y             2000 non-null float64
Location      2000 non-null object
PdId          2000 non-null int64
dtypes: float64(2), int64(2), object(9)
memory usage: 203.2+ KB


In [164]:
crime.isnull().any()

IncidntNum    False
Category      False
Descript      False
DayOfWeek     False
Date          False
Time          False
PdDistrict    False
Resolution    False
Address       False
X             False
Y             False
Location      False
PdId          False
dtype: bool

In [170]:
fire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 63 columns):
Incident Number                                 2000 non-null int64
Exposure Number                                 2000 non-null int64
Address                                         2000 non-null object
Incident Date                                   2000 non-null object
Call Number                                     2000 non-null int64
Alarm DtTm                                      2000 non-null object
Arrival DtTm                                    2000 non-null object
Close DtTm                                      2000 non-null object
City                                            2000 non-null object
Zipcode                                         1489 non-null float64
Battalion                                       2000 non-null object
Station Area                                    2000 non-null object
Box                                             241 non-null object

In [171]:
fire.isnull().any()

Incident Number                                 False
Exposure Number                                 False
Address                                         False
Incident Date                                   False
Call Number                                     False
Alarm DtTm                                      False
Arrival DtTm                                    False
Close DtTm                                      False
City                                            False
Zipcode                                          True
Battalion                                       False
Station Area                                    False
Box                                              True
Suppression Units                               False
Suppression Personnel                           False
EMS Units                                       False
EMS Personnel                                   False
Other Units                                     False
Other Personnel             

In [173]:
for col in fire.columns:
    if fire[col].dtype in [int, float]:
        fire[col].replace(to_replace=np.nan, value=0, inplace=True)
    else:
        fire[col].replace(to_replace=np.nan, value='None', inplace=True)

In [175]:
fire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 63 columns):
Incident Number                                 2000 non-null int64
Exposure Number                                 2000 non-null int64
Address                                         2000 non-null object
Incident Date                                   2000 non-null object
Call Number                                     2000 non-null int64
Alarm DtTm                                      2000 non-null object
Arrival DtTm                                    2000 non-null object
Close DtTm                                      2000 non-null object
City                                            2000 non-null object
Zipcode                                         2000 non-null float64
Battalion                                       2000 non-null object
Station Area                                    2000 non-null object
Box                                             2000 non-null objec

In [180]:
crime.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,150060275,NON-CRIMINAL,LOST PROPERTY,Monday,01/19/2015,14:00,MISSION,NONE,18TH ST / VALENCIA ST,-122.421582,37.761701,"(37.7617007179518, -122.42158168137)",15006027571000
1,150098210,ROBBERY,"ROBBERY, BODILY FORCE",Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821003074
2,150098210,ASSAULT,AGGRAVATED ASSAULT WITH BODILY FORCE,Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821004014
3,150098210,SECONDARY CODES,DOMESTIC VIOLENCE,Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821015200
4,150098226,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM OF VEHICLES",Tuesday,01/27/2015,19:00,NORTHERN,NONE,LOMBARD ST / LAGUNA ST,-122.431119,37.800469,"(37.8004687042875, -122.431118543788)",15009822628160


In [220]:
crime.columns

Index([u'IncidntNum', u'Category', u'Descript', u'DayOfWeek', u'Date', u'Time',
       u'PdDistrict', u'Resolution', u'Address', u'X', u'Y', u'Location',
       u'PdId'],
      dtype='object')

In [222]:
crime.Category.unique()

array(['NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'SECONDARY CODES',
       'VANDALISM', 'BURGLARY', 'LARCENY/THEFT', 'DRUG/NARCOTIC',
       'WARRANTS', 'VEHICLE THEFT', 'OTHER OFFENSES', 'WEAPON LAWS',
       'ARSON', 'MISSING PERSON', 'DRIVING UNDER THE INFLUENCE',
       'SUSPICIOUS OCC', 'RECOVERED VEHICLE', 'DRUNKENNESS', 'TRESPASS',
       'FRAUD', 'DISORDERLY CONDUCT', 'SEX OFFENSES, FORCIBLE',
       'FORGERY/COUNTERFEITING', 'KIDNAPPING', 'EMBEZZLEMENT',
       'STOLEN PROPERTY', 'LIQUOR LAWS', 'FAMILY OFFENSES', 'LOITERING',
       'BAD CHECKS', 'TREA', 'GAMBLING', 'RUNAWAY', 'BRIBERY'], dtype=object)

In [228]:
crime.DayOfWeek.unique()

array(['Monday', 'Sunday', 'Tuesday', 'Saturday', 'Friday', 'Wednesday',
       'Thursday'], dtype=object)

In [230]:
crime.Date.unique()

array(['01/19/2015', '02/01/2015', '01/27/2015', '01/31/2015',
       '01/04/2014', '01/30/2015', '02/04/2015', '01/05/2015',
       '01/28/2015', '01/29/2015', '01/01/2015', '01/15/2014',
       '02/02/2015', '01/20/2015', '01/26/2015', '01/16/2015',
       '01/24/2015', '02/27/2015', '01/23/2014', '01/25/2015',
       '02/01/2014', '01/21/2015', '02/02/2014', '02/12/2014',
       '01/17/2015', '01/04/2015', '01/03/2015', '01/23/2015',
       '02/03/2015', '02/27/2014', '03/05/2014', '03/07/2014',
       '01/14/2015', '03/01/2014', '01/22/2015', '01/13/2015',
       '01/02/2015', '10/26/2015', '03/17/2014', '03/16/2014',
       '03/26/2014', '03/27/2014', '01/10/2015', '04/04/2014',
       '01/29/2016', '04/18/2014', '04/19/2014', '03/30/2014',
       '04/26/2014', '05/08/2014', '02/05/2015', '10/01/2015',
       '05/15/2014', '05/16/2014', '02/06/2015', '05/27/2014',
       '01/12/2015', '06/05/2014', '06/06/2014', '06/17/2014',
       '06/22/2014', '01/18/2015', '08/22/2015', '03/10

In [233]:
crime.PdDistrict.unique()

array(['MISSION', 'TENDERLOIN', 'NORTHERN', 'RICHMOND', 'BAYVIEW',
       'CENTRAL', 'PARK', 'TARAVAL', 'SOUTHERN', 'INGLESIDE'], dtype=object)

In [235]:
crime.X.unique()

array([-122.42158168, -122.41440603, -122.43111854, ..., -122.43362342,
       -122.41717582, -122.40204258])

In [236]:
crime.Y.unique()

array([ 37.76170072,  37.78419072,  37.8004687 , ...,  37.72623583,
        37.78193108,  37.79750489])

In [238]:
crime_2 = crime[['Category', 'DayOfWeek', 'Date', 'PdDistrict', 'X', 'Y']]
crime_2.head()

Unnamed: 0,Category,DayOfWeek,Date,PdDistrict,X,Y
0,NON-CRIMINAL,Monday,01/19/2015,MISSION,-122.421582,37.761701
1,ROBBERY,Sunday,02/01/2015,TENDERLOIN,-122.414406,37.784191
2,ASSAULT,Sunday,02/01/2015,TENDERLOIN,-122.414406,37.784191
3,SECONDARY CODES,Sunday,02/01/2015,TENDERLOIN,-122.414406,37.784191
4,VANDALISM,Tuesday,01/27/2015,NORTHERN,-122.431119,37.800469


In [178]:
fire.head()

Unnamed: 0,Incident Number,Exposure Number,Address,Incident Date,Call Number,Alarm DtTm,Arrival DtTm,Close DtTm,City,Zipcode,Battalion,Station Area,Box,Suppression Units,Suppression Personnel,EMS Units,EMS Personnel,Other Units,Other Personnel,First Unit On Scene,Estimated Property Loss,Estimated Contents Loss,Fire Fatalities,Fire Injuries,Civilian Fatalities,Civilian Injuries,Number of Alarms,Primary Situation,Mutual Aid,Action Taken Primary,Action Taken Secondary,Action Taken Other,Detector Alerted Occupants,Property Use,Area of Fire Origin,Ignition Cause,Ignition Factor Primary,Ignition Factor Secondary,Heat Source,Item First Ignited,Human Factors Associated with Ignition,Structure Type,Structure Status,Floor of Fire Origin,Fire Spread,No Flame Spead,Number of floors with minimum damage,Number of floors with significant damage,Number of floors with heavy damage,Number of floors with extreme damage,Detectors Present,Detector Type,Detector Operation,Detector Effectiveness,Detector Failure Reason,Automatic Extinguishing System Present,Automatic Extinguishing Sytem Type,Automatic Extinguishing Sytem Perfomance,Automatic Extinguishing Sytem Failure Reason,Number of Sprinkler Heads Operating,Supervisor District,Neighborhood District,Location
0,9030109,0,310 Colon Av.,04/12/2009,91020273,04/12/2009 06:09:13 PM,04/12/2009 06:13:45 PM,04/12/2009 07:23:13 PM,SF,0.0,B09,15,,1,5,0,0,0,0,T15,0.0,0.0,0,0,0,0,0.0,551 - assist pd or other govern. agency,none,52 - forcible entry,-,-,-,"000 - property use, other",,,,,,,,,,0.0,,,0.0,0.0,0.0,0.0,,,,,,,,,,0.0,0.0,,
1,13067402,0,20 Lansdale Av,07/18/2013,131990117,07/18/2013 10:32:03 AM,07/18/2013 10:37:15 AM,07/18/2013 10:39:55 AM,SF,0.0,B09,39,8571.0,3,11,0,0,0,0,E39,0.0,0.0,0,0,0,0,0.0,745 - alarm system sounded/no fire-accidental,none,86 - investigate,-,-,-,429 - multifamily dwellings,,,,,,,,,,0.0,,,0.0,0.0,0.0,0.0,,,,,,,,,,0.0,0.0,,
2,12044490,0,7th St. / Folsom St.,05/13/2012,121340051,05/13/2012 03:55:37 AM,05/13/2012 04:01:57 AM,05/13/2012 04:05:44 AM,SF,94103.0,B03,1,,3,10,0,0,0,0,B03,0.0,0.0,0,0,0,0,0.0,"711 - municipal alarm system, street box false",none,86 - investigate,-,-,-,963 - street or road in commercial area,,,,,,,,,,0.0,,,0.0,0.0,0.0,0.0,,,,,,,,,,0.0,6.0,South of Market,"(37.7767460000297, -122.407844)"
3,13033326,0,2799 Pacific Av,04/09/2013,130990286,04/09/2013 04:34:07 PM,04/09/2013 04:39:31 PM,04/09/2013 05:20:27 PM,SF,0.0,B04,10,4163.0,3,10,0,0,0,0,B04,0.0,0.0,0,0,0,0,0.0,"746 - co detector activation, no co",none,86 - investigate,-,-,-,419 - 1 or 2 family dwelling,,,,,,,,,,0.0,,,0.0,0.0,0.0,0.0,,,,,,,,,,0.0,0.0,,
4,11101416,0,Polk St. / Pine St.,11/01/2011,113050357,11/01/2011 06:07:45 PM,11/01/2011 06:10:17 PM,11/01/2011 06:11:09 PM,SF,94109.0,B04,3,,2,9,0,0,0,0,E03,0.0,0.0,0,0,0,0,0.0,"711 - municipal alarm system, street box false",none,86 - investigate,-,-,-,"960 - street, other",,,,,,,,,,0.0,,,0.0,0.0,0.0,0.0,,,,,,,,,,0.0,3.0,Nob Hill,"(37.7896190000297, -122.420497)"


In [198]:
fire['Ignition Factor Primary'].unique()

array(['None', '11 - abandoned or discarded materials or p', '-',
       '12 - heat source too close to combustibles',
       '54 - equipment overloaded', '53 - equipment unattended',
       '00 - other factor contributed to ignition',
       '20 - mechanical failure, malfunction, othe', 'uu - undetermined',
       '30 - electrical failure, malfunction, othe',
       '52 - accidentally turned on, not turned of', 'nn - none',
       '55 - failure to clean',
       '33 - short cir. arc, defect/worn insulatio',
       '13 - cuttin/welding too close to combustib',
       '44 - manufacturing deficiency',
       '12 heat source too close to combustibles.',
       '58 - equipment not being operated properly',
       '53 equipment unattended'], dtype=object)

In [201]:
fire['Heat Source'].unique()

array(['None', '61 - cigarette', '40 - hot or smoldering object, other',
       '10 - heat from powered equipment, other',
       '81 - heat; direct flame or convection',
       '12 - radiated/conducted heat operating equ', '-',
       'uu - undetermined', '11 - spark/ember/flame from operating equi',
       '00 - heat source: other',
       '60 - heat; other open flame/smoking materi',
       '63 - heat from undetermined smoking materi', '66 - candle',
       '13 - arcing', '43 - hot ember or ash',
       '84 conducted heat from another fire',
       '12 radiated or conducted heat from operating equipment'], dtype=object)

In [204]:
fire['Structure Type'].unique()

array(['None', '1 -enclosed building', '3 -open structure', '-',
       '2 -fixed portable or mobile structure', '1 enclosed building',
       '0 -structure type, other'], dtype=object)

In [206]:
fire['Fire Spread'].unique()

array(['None', '-', '20 -furniture, utensils, other',
       '25 -appliance housing or casing', '00 -item first ignited, other',
       '12 -exterior wall covering or finish',
       '10 -structural component or finish, other',
       '14 -floor covering or rug/carpet/mat',
       '17 -structural member or framing',
       '81 -electrical wire, cable insulation',
       '32 -bedding; blanket, sheet, comforter',
       '23 -cabinetry (including built-in)', '31 -mattress, pillow'], dtype=object)

In [211]:
fire['Automatic Extinguishing Sytem Perfomance'].unique()

array(['None', '-', '1 -System operated and was effective'], dtype=object)

In [213]:
fire['Number of Sprinkler Heads Operating'].unique()

array([ 0.,  1.])

In [215]:
fire['Neighborhood  District'].unique()

array(['None', 'South of Market', 'Nob Hill', 'Marina',
       'Visitacion Valley', 'Sunset/Parkside', 'Potrero Hill',
       'Presidio Heights', 'Financial District/South Beach', 'Chinatown',
       'West of Twin Peaks', 'Mission', 'Russian Hill', 'Pacific Heights',
       'Excelsior', 'Tenderloin', 'Bayview Hunters Point',
       'Treasure Island', 'Inner Sunset', 'Portola', 'Lone Mountain/USF',
       'Castro/Upper Market', 'Golden Gate Park', 'Inner Richmond',
       'Hayes Valley', 'North Beach', 'Outer Mission', 'Haight Ashbury',
       'Japantown', 'Lakeshore', 'Twin Peaks', 'Seacliff', 'Noe Valley',
       'Western Addition', 'Oceanview/Merced/Ingleside', 'Outer Richmond',
       'Bernal Heights', 'Mission Bay', 'Lincoln Park', 'Glen Park',
       'Presidio', 'McLaren Park'], dtype=object)

In [216]:
fire['Location'].unique()

array(['None', '(37.7767460000297, -122.407844)',
       '(37.7896190000297, -122.420497)', ...,
       '(37.7451120000296, -122.452364)',
       '(37.7822970000297, -122.439951)', '(37.7654460000296, -122.477309)'], dtype=object)

In [231]:
fire_2 = fire[['Alarm DtTm', 'Ignition Factor Primary', 'Heat Source', 'Structure Type', 'Fire Spread', 
               'Automatic Extinguishing Sytem Perfomance', 'Number of Sprinkler Heads Operating', 
               'Neighborhood  District', 'Location']]
fire_2.head()

Unnamed: 0,Alarm DtTm,Ignition Factor Primary,Heat Source,Structure Type,Fire Spread,Automatic Extinguishing Sytem Perfomance,Number of Sprinkler Heads Operating,Neighborhood District,Location
0,04/12/2009 06:09:13 PM,,,,,,0.0,,
1,07/18/2013 10:32:03 AM,,,,,,0.0,,
2,05/13/2012 03:55:37 AM,,,,,,0.0,South of Market,"(37.7767460000297, -122.407844)"
3,04/09/2013 04:34:07 PM,,,,,,0.0,,
4,11/01/2011 06:07:45 PM,,,,,,0.0,Nob Hill,"(37.7896190000297, -122.420497)"


## Evaluating models

In [247]:
fire_2.groupby(['Ignition Factor Primary']).count()

Unnamed: 0_level_0,Alarm DtTm,Heat Source,Structure Type,Fire Spread,Automatic Extinguishing Sytem Perfomance,Number of Sprinkler Heads Operating,Neighborhood District,Location
Ignition Factor Primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
-,30,30,30,30,30,30,30,30
00 - other factor contributed to ignition,5,5,5,5,5,5,5,5
11 - abandoned or discarded materials or p,18,18,18,18,18,18,18,18
12 - heat source too close to combustibles,25,25,25,25,25,25,25,25
12 heat source too close to combustibles.,1,1,1,1,1,1,1,1
13 - cuttin/welding too close to combustib,1,1,1,1,1,1,1,1
"20 - mechanical failure, malfunction, othe",7,7,7,7,7,7,7,7
"30 - electrical failure, malfunction, othe",5,5,5,5,5,5,5,5
"33 - short cir. arc, defect/worn insulatio",2,2,2,2,2,2,2,2
44 - manufacturing deficiency,1,1,1,1,1,1,1,1


In [244]:
lr = linear_model.LinearRegression()
lasso = linear_model.Lasso()
riddge = linear_model.Ridge()
elast = linear_model.ElasticNet()

In [None]:
model_lr = 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. Conduct another analysis using the San Francisco Parks data to predict Park scores

---

1. Combining multiple sources of park data (csv files) is required.
- Perform EDA and cleaning on relevant data.
- Construct and compare different regression models with cross-validation.
- Plot descriptive visuals you think are useful for understanding the data.
- Report on your findings.


In [255]:
file_park = '/Users/manuel/Desktop/dsi-sf-7-materials_manuel/datasets/sf_crime/Recreation___Park_Department_Park_Info_Dataset.csv'
file_park2 = '/Users/manuel/Desktop/dsi-sf-7-materials_manuel/datasets/sf_crime/Park_Evaluation_Scores_starting_Fiscal_Year_2015.csv'
file_park3 = '/Users/manuel/Desktop/dsi-sf-7-materials_manuel/datasets/sf_crime/Park_Scores_2005-2014.csv'


In [381]:
park = pd.read_csv(file_park)
park_2015 = pd.read_csv(file_park2)
park_2004 = pd.read_csv(file_park3)

## EDA

In [257]:
park.head()

Unnamed: 0,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,Zipcode,Acreage,SupDist,ParkID,Location 1,Lat
0,ParkName,ParkType,ParkServiceArea,PSAManager,email,Number,,,,,,
1,10TH AVE/CLEMENT MINI PARK,Mini Park,PSA 1,"Elder, Steve",steven.elder@sfgov.org,(415) 601-6501,94118.0,0.66,1.0,156.0,"351 9th Ave\nSan Francisco, CA\n(37.78184397, ...",
2,15TH AVENUE STEPS,Mini Park,PSA 4,"Sheehy, Chuck",charles.sheehy@sfgov.org,(415) 218-2226,94122.0,0.26,7.0,185.0,"15th Ave b w Kirkham\nSan Francisco, CA\n(37.7...",
3,24TH/YORK MINI PARK,Mini Park,PSA 6,"Field, Adrian",adrian.field@sfgov.org,(415) 717-2872,94110.0,0.12,9.0,51.0,"24th\nSan Francisco, CA\n(37.75306042, -122.40...",
4,29TH/DIAMOND OPEN SPACE,Neighborhood Park or Playground,PSA 5,"O'Brien, Teresa",teresa.o'brien@sfgov.org,(415) 819-2699,94131.0,0.82,8.0,194.0,"Diamond\nSan Francisco, CA\n(37.74360211, -122...",


In [345]:
# park_2004.head()

In [346]:
# park_2015.head()

In [362]:
park_0415 = park_2004.merge(park_2015, on='Park')

In [363]:
park_0415 = park_0415.merge(park, on='ParkID')

In [364]:
park_0415.groupby(by=['Park Type', 'PSA_x']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Score,Park Site Score,Supervisor District,Zipcode,Acreage,SupDist,Lat
Park Type,PSA_x,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Civic Plaza or Square,PSA1,0.930671,82.929114,2.797468,94117.278481,34.77557,2.797468,
Civic Plaza or Square,PSA2,0.955226,87.835484,4.032258,94111.612903,1.624516,4.032258,
Mini Park,PSA1,0.926076,87.694595,2.383784,94111.772973,0.204973,2.383784,
Mini Park,PSA2,0.8951,90.417857,5.775,94110.903571,0.142,5.775,
Mini Park,PSA3,0.854678,79.473333,10.355556,94122.844444,0.691333,10.355556,
Mini Park,PSA4,0.830063,83.830159,11.0,94132.0,0.323651,11.0,
Mini Park,PSA5,0.876448,85.819481,8.603896,94120.220779,0.197532,8.603896,
Mini Park,PSA6,0.939836,87.714554,9.0,94110.0,0.209155,9.0,
Neighborhood Park or Playground,PSA1,0.93844,90.00571,1.988858,94119.087744,3.464875,1.988858,
Neighborhood Park or Playground,PSA2,0.892639,86.524165,6.3222,94106.982318,3.462908,6.3222,


In [365]:
# park_0415.head(2)

In [366]:
# park_0415.drop(['PSA_x', 'FQ', 'Supervisor District', 'ParkType', 'PSAManager', 'email', 'Number', 'Zipcode'], axis=1, inplace=True)
park_0415.drop(['PSA_y', 'Park Type', 'FQ', 'PSAManager', 'email', 'Number', 'Zipcode', 'Score', 'Location 1', 'Lat'], axis=1, inplace=True)


In [367]:
park_0415.head()

Unnamed: 0,ParkID,PSA_x,Park,Park Site Score,Supervisor District,ParkName,ParkType,ParkServiceArea,Acreage,SupDist
0,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0
1,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0
2,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0
3,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0
4,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0


In [368]:
# park_0415.drop(['Score', 'Lat'], axis=1, inplace=True)

In [369]:
# park_0415.shape

In [370]:
park_0415.drop_duplicates('ParkID', inplace=True)

In [371]:
park_0415.head()

Unnamed: 0,ParkID,PSA_x,Park,Park Site Score,Supervisor District,ParkName,ParkType,ParkServiceArea,Acreage,SupDist
0,86,PSA4,Carl Larsen Park,88.7,4,CARL LARSEN PARK,Neighborhood Park or Playground,PSA 4,6.58,4.0
31,13,PSA4,Junipero Serra Playground,91.1,7,JUNIPERO SERRA PLAYGROUND,Neighborhood Park or Playground,PSA 4,1.53,7.0
57,9,PSA4,Rolph Nicol Playground,73.6,7,ROLPH NICOL PLAYGROUND,Neighborhood Park or Playground,PSA 4,3.04,7.0
90,117,PSA2,Alamo Square,85.0,5,ALAMO SQUARE,Neighborhood Park or Playground,PSA 2,12.7,5.0
121,60,PSA6,Jose Coronado Playground,81.8,9,JOSE CORONADO PLAYGROUND,Neighborhood Park or Playground,PSA 6,0.78,9.0


In [380]:
park_0415.isnull().any()

ParkID                 False
PSA_x                  False
Park                   False
Park Site Score        False
Supervisor District    False
ParkName               False
ParkType               False
ParkServiceArea        False
Acreage                False
SupDist                False
dtype: bool

## Selecting variables

In [423]:
X = park_0415[['ParkType', 'PSA_x']]

In [424]:
# X_s = park_0415.Acreage

In [425]:
X = pd.get_dummies(X, columns=['ParkType', 'PSA_x'])

In [426]:
# ss = StandardScaler()

In [427]:
# X_ss = ss.fit_transform(X_s)

In [428]:
# pd.DataFrame(X_ss)
# X.join(pd.DataFrame(X_ss))

In [452]:
y = park_0415['Park Site Score']


## Regression

In [430]:
from sklearn.model_selection import train_test_split

In [431]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [445]:
lr = linear_model.LinearRegression()
# logreg = linear_model.LogisticRegression()

In [433]:
model_lr = lr.fit(X_train, y_train)

In [434]:
model_lr.score(X_test, y_test)

0.14925624627395973

In [444]:
cvs_c = cross_val_score(lr, X, y, cv=5)
cvs_c

array([-0.09154289,  0.19207844, -0.49254004,  0.32559647,  0.12853132])

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. Bias-variance tradeoff

---

1. Use a model from any of the previous questions above of your choice and construct a regularized regression model . Ideally the model should actually improve across regularization strengths...
- Gridsearch the regularization parameters to find the optimal.
- Plot the regularization parameter against the cross-validated $R^2$.
- Explain how regularization and regularization strength is related to the bias-variance tradeoff.


In [446]:
def predict_from_samples(model, X, y, number_of_splits=100):

    yhat_tracker = pd.DataFrame({'ytrue':y})

    rowinds = range(X.shape[0])

    for i in range(number_of_splits):

        train_inds, test_inds = train_test_split(rowinds, test_size=0.33)

        Xtrain, Ytrain = X.iloc[train_inds, :], y[train_inds]
        Xtest, Ytest = X.iloc[test_inds, :],    y[test_inds]

        model.fit(Xtrain, Ytrain)
        yhats = model.predict(Xtest)

        yhat_tracker['sample'+str(i+1)] = np.nan
        yhat_tracker.iloc[test_inds, -1] = yhats

    return yhat_tracker

In [447]:
def calculate_bias_sq(yhats_df):
    # Take out the true values of y that are in the first column:
    ytrue = yhats_df.iloc[:,0].values

    # Calculate the mean of the predictions, averaged across the columns.
    # So, all of the predictions for the true y at row 0 would be averaged together
    # and so on for all the rows.
    yhat_means = yhats_df.iloc[:,1:].mean(axis=1).values

    # Subtract the true value of y from the mean of the predicted values, and square it.
    elementwise_bias_sq = (yhat_means - ytrue)**2

    # Take the mean of those squared bias values (across all y)
    mean_bias_sq = np.mean(elementwise_bias_sq)
    return mean_bias_sq

In [449]:
def calculate_variance(yhats_df):
    # Calculate the mean of the predicted y's across the columns (mean of yhat for each row)
    yhats_means = yhats_df.iloc[:,1:].mean(axis=1)

    # subtract the mean of the yhats from the original yhat values (for each row)
    # and square the result.
    yhats_devsq = yhats_df.iloc[:,1:].subtract(yhats_means, axis=0)**2

    # Take the mean of the squared deviations from the mean, then
    # take the mean of those to get the overall variance across the y observations
    yhats_devsq_means = yhats_devsq.mean(axis=1).values
    return np.mean(yhats_devsq_means)

In [462]:
yhats_full = predict_from_samples(lr, X, y)
# yhats_small = predict_from_samples(lr, X_small, gradrate)
# yhats_over = predict_from_samples(lr, X_overfit, gradrate)


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
print calculate_bias_sq(yhats_full), calculate_variance(yhats_full)
print calculate_bias_sq(yhats_small), calculate_variance(yhats_small)
print calculate_bias_sq(yhats_over), calculate_variance(yhats_over)

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5.1 Calculate the approximated $\text{bias}^2$ and variance across regularization strengths.

---

You can obviously use my code from the bias-variance lab to do this. 

Plot the bias and variance change _with_ the cross-validated $R^2$. 

You'll need to scale these values somehow to put them on the same chart (I recommend (MinMaxScaler)[http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] to put $\text{bias}^2$ and variance on the same scale as cross-validated $R^2$.

<img src="http://imgur.com/HNPKfE8.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Custom regularized regression penalties

---

The $\lambda1$ and $\lambda2$ norm regularization penalties (Lasso and Ridge) are the most commonly used regularization penalties. They have a solid foundation in statistics and evidence of effectiveness. However, these are not the only possible penalties for regression – sometimes new, customized penalties give additional performance and predictive power to models depending on the context.


**Devise of and implement a penalized regression for San Francisco Crime data.** What is your rationale – why would this be useful? How does it perform compared to the standard Ridge, Lasso, and Elastic Net penalties?

## Statistics, Biases, and Hypothesis Testing

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 7. Biases 

---
A new food ordering smartphone app incentivizes its users to invite their friends by offering them free orders for each friend that signs up
- What biases are being caused here?
 
- How would you measure the success of such a program?
  
- Rephrase this question to be unbiased:
    ** Many people have said that there is a need for stricter laws on dangerous weapons. Do you agree? ** 
   


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 8. Hypothesis Testing 

---

For the health/mortality data from the following website: 
http://assets.datacamp.com/blog_assets/chol.txt'


- Generate summary statistics, histogram plots, cdf plots, and confidence intervals for two columns of your choice and a correlation matrix across all
- Using what you generated, provide short summaries of each column describing the data
- Is there a difference in mortality between smokers, nonsmokers?
- If national average weight is 85 pounds, is our weight average significantly different?
- Until now, we’ve only asked is it different. This is called a two sided test. 
    - What if we want to know if its less than or greater than? This is called a one sided test. We can calculate this from the result of a two sided. You divide your p value in half and check if your t statistic is positive or negative Greater than: p/2 < confidence level and t > 0 and Less-than: p/2 < confidence level and t < 0.
    
    - If national average weight is 85 pounds, is our weight average statistical significantly less?