# Feature selection and feature engineering

Based on paper in folder `/data/`.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 200)
%matplotlib inline

# Reading in our data

## Every single person in a crash

We'll start by reading in the list of people who were involved in an accident.

**Call this dataframe `people`.**

In [2]:
people = pd.read_csv('combined-person-data.csv')
people

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,REPORT_NO,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID
0,1.0,,0.0,,N,C,0.0,1952-04-20 00:00:00,,0.0,1.0,,1.0,N,1,PA,,,,,,,48dd00ee-e033-47e7-ad1e-0b734020301b,D,AB4284000S,13.0,F,eb6aadb8-dacb-4744-a1a7-ab812c96f27f
1,1.0,,0.0,,N,,0.0,1985-05-28 00:00:00,,0.0,0.0,,0.0,N,1,MD,,,,,,,166296bd-ffd3-4c16-aa74-4f4bf4139d8d,D,AB4313000X,13.0,F,b463eb20-2f01-4200-9d6f-b18888ce2593
2,1.0,,0.0,,N,C,0.0,1960-10-04 00:00:00,,0.0,0.0,,0.0,Y,1,MD,,,,,,,f3b2743f-fbc3-4345-9419-56a0ca29102c,D,AB4313000X,13.0,F,3c8629d0-d524-47c1-bfbc-b18e07f3087e
3,1.0,,0.0,,N,B,0.0,1971-05-28 00:00:00,,0.0,1.0,,0.0,N,1,MD,,,,,,,5bbe589b-a2db-4dbb-be9b-17a800d69a08,D,AB4313000Y,0.0,F,c4628cdb-f295-4a24-8a4b-653741ac6ae7
4,1.0,,0.0,,N,D,0.0,1955-04-23 00:00:00,,0.0,0.0,,1.0,N,1,SC,,,,,,,b914136f-5ecd-46bb-94ec-ff5d4136c3eb,D,AB4669001F,13.0,F,cdda1580-fd79-4358-8819-c2250f494591
5,1.0,,0.0,,N,C,0.0,1957-02-23 00:00:00,,0.0,0.0,,1.0,Y,1,MD,,,,,,,9d749007-0354-4d82-809e-8a3de66d6167,D,AB49430024,13.0,F,d4213216-3c25-4097-b510-8f0bfc39a1bc
6,1.0,,0.0,,N,C,0.0,1990-11-05 00:00:00,,0.0,1.0,,1.0,N,1,MD,,,,,,,21fff0f9-ed13-4b87-ac04-29d018aff9d4,D,AB5218001Y,13.0,F,4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f
7,1.0,,0.0,,N,,0.0,1954-01-25 00:00:00,,0.0,1.0,,0.0,N,1,MD,,,,,,,34b771f8-40c6-410c-a8d7-9640261204fa,D,AB6133000N,0.0,F,f2888089-5fc9-44dd-ad50-8ab446cdf247
8,1.0,,0.0,,N,C,0.0,1969-06-18 00:00:00,,0.0,1.0,,0.0,N,1,MD,,,,,,,e4b74c8d-ec2c-438d-b768-aae7a0435eb4,D,AB64220007,13.0,F,5464f341-7ce6-4692-b0f1-3fdb3339753f
9,1.0,,0.0,,N,,0.0,1981-02-18 00:00:00,,0.0,1.0,,0.0,Y,1,MD,,,,,,,ba99fc8c-8a87-47f9-9e4b-a99c664d8494,D,AC12390045,99.0,F,5445db7e-a933-43c7-bef6-18f91396c111


How often did each severity of injury show up? (e.g. not injured, non-incapacitating injury, etc)

In [3]:
people.INJ_SEVER_CODE.value_counts()

1    716806
2     82642
3     76801
4     10353
5      1681
Name: INJ_SEVER_CODE, dtype: int64

We're only interested in fatalities, so let's create a new `is_fatality` column for when people received a fatal injury.

**Confirm there were 1681 people with fatal injuries.**

In [4]:
people['is_fatality'] = people.INJ_SEVER_CODE.replace({5: 1, 4: 0, 3:0, 2:0, 1:0})
people.head() 

Unnamed: 0,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,REPORT_NO,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality
0,1.0,,0.0,,N,C,0.0,1952-04-20 00:00:00,,0.0,1.0,,1.0,N,1,PA,,,,,,,48dd00ee-e033-47e7-ad1e-0b734020301b,D,AB4284000S,13.0,F,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,0
1,1.0,,0.0,,N,,0.0,1985-05-28 00:00:00,,0.0,0.0,,0.0,N,1,MD,,,,,,,166296bd-ffd3-4c16-aa74-4f4bf4139d8d,D,AB4313000X,13.0,F,b463eb20-2f01-4200-9d6f-b18888ce2593,0
2,1.0,,0.0,,N,C,0.0,1960-10-04 00:00:00,,0.0,0.0,,0.0,Y,1,MD,,,,,,,f3b2743f-fbc3-4345-9419-56a0ca29102c,D,AB4313000X,13.0,F,3c8629d0-d524-47c1-bfbc-b18e07f3087e,0
3,1.0,,0.0,,N,B,0.0,1971-05-28 00:00:00,,0.0,1.0,,0.0,N,1,MD,,,,,,,5bbe589b-a2db-4dbb-be9b-17a800d69a08,D,AB4313000Y,0.0,F,c4628cdb-f295-4a24-8a4b-653741ac6ae7,0
4,1.0,,0.0,,N,D,0.0,1955-04-23 00:00:00,,0.0,0.0,,1.0,N,1,SC,,,,,,,b914136f-5ecd-46bb-94ec-ff5d4136c3eb,D,AB4669001F,13.0,F,cdda1580-fd79-4358-8819-c2250f494591,0


In [5]:
people['is_fatality'].value_counts()

0    886602
1      1681
Name: is_fatality, dtype: int64

## Working on Features

### Starting our analysis

We're going to run a regression on the impact of being **male vs female on crash fatalities**. Prepare a dataframe called `train_df` with the appropriate information in it.

* **Tip:** What column(s) are your input, and what is your output? Aka independent and dependent variables
* **Tip:** You'll need to convert your input column into something numeric, I suggest using `.replace`
* **Tip:** We aren't interested in the "Unknown" sex - either filtering or `np.nan` + `.dropna()` might be useful ways to get rid of those columns

In [6]:
people['genders'] = people.SEX_CODE.replace({'F': 1, 'M': 0, 'U': np.nan})
people['genders'].value_counts()

0.0    460973
1.0    354854
Name: genders, dtype: int64

In [7]:
train_df = people[['genders','is_fatality']]
train_df = train_df.dropna(subset=['genders'])
train_df.head()

Unnamed: 0,genders,is_fatality
0,1.0,0
1,1.0,0
2,1.0,0
3,1.0,0
4,1.0,0


Confirm that your `train_df` has two columns and 815,827 rows.

> **Tip:** If you have more rows, make sure you dropped all of the rows with Unknown sex.
>
> **Tip:** If you have more columns, make sure you only have your input and output columns.

In [8]:
train_df.shape

(815827, 2)

### Run your regression

See the effect of sex on whether the person's injuries are fatal or not. I want to see a result dataframe that includes:

* Feature name
* Coefficient
* Odds ratio

In [9]:
X = train_df.drop(columns=['is_fatality'])
y = train_df.is_fatality

In [10]:
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
0,genders,-0.71481,0.4893


### Use words to interpret this result

In [12]:
# women are 50% less likely, aka men are twice as likely to get in car crashes

## Adding more features

The actual crash data has more details - whether it was snowy/icy, whether it was a highway, etc. 

Read in `combined-crash-data.csv` and merge it with our people dataset. I'll save you a lookup: the `REPORT_NO` is what matches between the two.

In [13]:
combined = pd.read_csv('combined-crash-data.csv')
combined.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ACC_DATE,ACC_TIME,AGENCY_CODE,AREA_CODE,COLLISION_TYPE_CODE,COUNTY_NO,C_M_ZONE_FLAG,DISTANCE,DISTANCE_DIR_FLAG,FEET_MILES_FLAG,FIX_OBJ_CODE,HARM_EVENT_CODE1,HARM_EVENT_CODE2,JUNCTION_CODE,LANE_CODE,LATITUDE,LIGHT_CODE,LOC_CODE,LOGMILE_DIR_FLAG,LOG_MILE,LONGITUDE,MAINROAD_NAME,MUNI_CODE,RD_COND_CODE,RD_DIV_CODE,REFERENCE_NO,REFERENCE_ROAD_NAME,REFERENCE_SUFFIX,REFERENCE_TYPE_CODE,REPORT_NO,REPORT_TYPE,ROUTE_TYPE_CODE,RTE_NO,RTE_SUFFIX,SIGNAL_FLAG,SURF_COND_CODE,WEATHER_CODE
0,2018-04-10 00:00:00,01:50:00,MSP,UNK,17,18.0,N,0.0,N,F,22.03,16.0,11.0,1.0,,38.27723,4.0,,N,2.09,-76.682876,NEWTOWNE NECK RD,0.0,1.0,1.0,166.0,ROSEBANK RD,,CO,MSP6188002Q,Injury Crash,MD,243.0,,N,2.0,6.01
1,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01
2,2018-05-02 00:00:00,11:54:00,WASHINGTON,UNK,88,21.0,N,1.0,S,M,10.0,16.0,9.0,0.0,1.0,39.680144,1.0,,N,2.6,-77.514781,RITCHIE RD,0.0,1.0,1.0,270.0,WISE RD,,CO,ZU8005001W,Injury Crash,CO,269.0,,N,2.0,6.01
3,2018-05-02 00:00:00,12:45:00,MONTGOMERY,UNK,7,15.0,N,0.0,E,F,0.0,1.0,1.0,2.0,2.0,39.015649,1.0,55122.0,E,0.0,-77.042728,FOREST GLEN RD,0.0,0.0,4.0,97.0,GEORGIA AVE,,MD,MCP2891005S,Property Damage Crash,CO,697.0,,Y,2.0,6.01
4,2018-05-02 00:00:00,12:02:00,MSP,UNK,11,8.0,N,0.15,N,M,0.0,1.0,1.0,9.04,2.0,38.620875,1.0,,N,1.02,-76.901087,LEONARDTOWN RD,0.0,1.0,3.0,644.0,POST OFFICE RD,,CO,MSP6743004Q,Property Damage Crash,MD,5.0,BU,N,2.0,6.01


In [14]:
merged = pd.merge(combined, people, on='REPORT_NO')
merged.head()

Unnamed: 0,ACC_DATE,ACC_TIME,AGENCY_CODE,AREA_CODE,COLLISION_TYPE_CODE,COUNTY_NO,C_M_ZONE_FLAG,DISTANCE,DISTANCE_DIR_FLAG,FEET_MILES_FLAG,FIX_OBJ_CODE,HARM_EVENT_CODE1,HARM_EVENT_CODE2,JUNCTION_CODE,LANE_CODE,LATITUDE,LIGHT_CODE,LOC_CODE,LOGMILE_DIR_FLAG,LOG_MILE,LONGITUDE,MAINROAD_NAME,MUNI_CODE,RD_COND_CODE,RD_DIV_CODE,REFERENCE_NO,REFERENCE_ROAD_NAME,REFERENCE_SUFFIX,REFERENCE_TYPE_CODE,REPORT_NO,REPORT_TYPE,ROUTE_TYPE_CODE,RTE_NO,RTE_SUFFIX,SIGNAL_FLAG,SURF_COND_CODE,WEATHER_CODE,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality,genders
0,2018-04-10 00:00:00,01:50:00,MSP,UNK,17,18.0,N,0.0,N,F,22.03,16.0,11.0,1.0,,38.27723,4.0,,N,2.09,-76.682876,NEWTOWNE NECK RD,0.0,1.0,1.0,166.0,ROSEBANK RD,,CO,MSP6188002Q,Injury Crash,MD,243.0,,N,2.0,6.01,4.0,2.0,3.0,,N,C,2.0,1996-12-13 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,0212a82f-38a7-4a8a-969d-bce0920d3ab3,D,13.0,F,61d02432-7d40-401a-86ce-fea710234832,0,1.0
1,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,,,,,,1967-11-19 00:00:00,,,1.0,,1.0,,3,,,3.0,,,,,006b94ad-a347-4356-a48b-9b4b0849ae7b,O,13.0,F,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,1.0
2,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,0.0,,N,A,1.0,1992-06-12 00:00:00,,0.0,1.0,,1.0,N,1,DE,,,,,,,8d4526c4-2f46-4dae-bdeb-64cd811bda25,D,13.0,M,f583dc59-17b6-43cd-a853-c1be06240566,0,0.0
3,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,0.0,,N,C,1.0,1954-08-18 00:00:00,,0.0,1.0,B,1.0,N,3,MD,,,,,,,4fb7ba1d-4e1d-466c-a84c-de24d99ff7d0,D,13.0,M,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,0.0
4,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,5.0,,0.0,,N,B,99.0,1948-04-23 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,a6ad59e0-a578-4093-a6df-99a66c936821,D,13.0,M,3326638d-db2b-416e-974c-fc385d6552b9,0,0.0


### Examining more possible features

How often was it wet, dry, snowy, icy, etc? **What was the most common condition?**

In [15]:
merged.SURF_COND_CODE.value_counts()

2.00     619569
1.00     147803
0.00      30921
3.00       9418
4.00       9111
99.00      3616
6.03       1697
88.00      1064
5.00        447
7.01        258
9.88        141
8.05         36
Name: SURF_COND_CODE, dtype: int64

Do you feel that a **Dry** road condition should be the average of **Wet** and **Snow?**

In [16]:
# no?

The answer to that should be *no*, which means we can't use this data as numeric data. We want a different coefficient for each of these - I want to know the impact of dry, the impact of wet, the impact of snow, all separately.

Start by **replacing each code with a proper description.** I'll even include them here:

* `00` - Not Applicable
* `01`	- Wet
* `02`	- Dry
* `03`	- Snow
* `04`	- Ice
* `05`	- Mud, Dirt, Gravel
* `06`	- Slush
* `07`	- Water (standing/moving)
* `08`	- Sand
* `09`	- Oil
* `88`	- Other
* `99`	- Unknown

But watch out, pandas read the column in as numbers so they might have come through a little differently than their codes.

In [17]:
merged['condition'] = merged.SURF_COND_CODE.replace({2.00 : 'Dry', 1.00 : 'Wet', 0.00 : 'Not Applicable', 4.00 : 'Ice', 3.00 : 'Snow', 6.03 : 'Slush', 99.00 : 'Unknown', 88.00 : 'Other', 5.00 : 'Mud, Dirt, Gravel', 7.01 : 'Water (standing/moving)', 9.88 : 'Oil', 8.05 : 'Sand'})
merged.head()

Unnamed: 0,ACC_DATE,ACC_TIME,AGENCY_CODE,AREA_CODE,COLLISION_TYPE_CODE,COUNTY_NO,C_M_ZONE_FLAG,DISTANCE,DISTANCE_DIR_FLAG,FEET_MILES_FLAG,FIX_OBJ_CODE,HARM_EVENT_CODE1,HARM_EVENT_CODE2,JUNCTION_CODE,LANE_CODE,LATITUDE,LIGHT_CODE,LOC_CODE,LOGMILE_DIR_FLAG,LOG_MILE,LONGITUDE,MAINROAD_NAME,MUNI_CODE,RD_COND_CODE,RD_DIV_CODE,REFERENCE_NO,REFERENCE_ROAD_NAME,REFERENCE_SUFFIX,REFERENCE_TYPE_CODE,REPORT_NO,REPORT_TYPE,ROUTE_TYPE_CODE,RTE_NO,RTE_SUFFIX,SIGNAL_FLAG,SURF_COND_CODE,WEATHER_CODE,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality,genders,condition
0,2018-04-10 00:00:00,01:50:00,MSP,UNK,17,18.0,N,0.0,N,F,22.03,16.0,11.0,1.0,,38.27723,4.0,,N,2.09,-76.682876,NEWTOWNE NECK RD,0.0,1.0,1.0,166.0,ROSEBANK RD,,CO,MSP6188002Q,Injury Crash,MD,243.0,,N,2.0,6.01,4.0,2.0,3.0,,N,C,2.0,1996-12-13 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,0212a82f-38a7-4a8a-969d-bce0920d3ab3,D,13.0,F,61d02432-7d40-401a-86ce-fea710234832,0,1.0,Dry
1,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,,,,,,1967-11-19 00:00:00,,,1.0,,1.0,,3,,,3.0,,,,,006b94ad-a347-4356-a48b-9b4b0849ae7b,O,13.0,F,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,1.0,Dry
2,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,0.0,,N,A,1.0,1992-06-12 00:00:00,,0.0,1.0,,1.0,N,1,DE,,,,,,,8d4526c4-2f46-4dae-bdeb-64cd811bda25,D,13.0,M,f583dc59-17b6-43cd-a853-c1be06240566,0,0.0,Dry
3,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,0.0,,N,C,1.0,1954-08-18 00:00:00,,0.0,1.0,B,1.0,N,3,MD,,,,,,,4fb7ba1d-4e1d-466c-a84c-de24d99ff7d0,D,13.0,M,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,0.0,Dry
4,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,5.0,,0.0,,N,B,99.0,1948-04-23 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,a6ad59e0-a578-4093-a6df-99a66c936821,D,13.0,M,3326638d-db2b-416e-974c-fc385d6552b9,0,0.0,Dry


Confirm you have 147,803 wet, and a few codes you can't understand, like `6.03` and `7.01`.

In [18]:
merged.condition.value_counts()

Dry                        619569
Wet                        147803
Not Applicable              30921
Snow                         9418
Ice                          9111
Unknown                      3616
Slush                        1697
Other                        1064
Mud, Dirt, Gravel             447
Water (standing/moving)       258
Oil                           141
Sand                           36
Name: condition, dtype: int64

Replace the codes you don't understand with `Other`.

In [19]:
merged['condition'] = merged.condition.replace({'Slush': 'Other', 'Water (standing/moving)' : 'Other', 'Sand' : 'Other', 'Oil': 'Other'})

Confirm you have 3,196 'Other'.

In [20]:
merged['condition'].value_counts()

Dry                  619569
Wet                  147803
Not Applicable        30921
Snow                   9418
Ice                    9111
Unknown                3616
Other                  3196
Mud, Dirt, Gravel       447
Name: condition, dtype: int64

## One-hot encoding

We're going to use `pd.get_dummies` to build a variable you'll call `surf_dummies`. Each surface condition should be a `0` or `1` as to whether it was that condition (dry, icy, wet, etc).

Use a `prefix=` so we know they are **surface** conditions.

You'll want to drop the column you'll use as the reference category.

**Before we do this: which column works best as the reference?**

In [21]:
pd.get_dummies(merged.condition, prefix='surface').head()

Unnamed: 0,surface_Dry,surface_Ice,"surface_Mud, Dirt, Gravel",surface_Not Applicable,surface_Other,surface_Snow,surface_Unknown,surface_Wet
0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0


Now build your `surf_dummies` variable.

In [22]:
surf_dummies = pd.get_dummies(merged.condition, prefix='surface').drop('surface_Dry', axis=1)
surf_dummies.head() 

Unnamed: 0,surface_Ice,"surface_Mud, Dirt, Gravel",surface_Not Applicable,surface_Other,surface_Snow,surface_Unknown,surface_Wet
0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0


Confirm your `surf_dummies` looks roughly like this:

|surface_Ice|Surce_Mud, Dirt, Gravel|surface_Not Applicable|...|surface_Wet|
|---|---|---|---|---|
|0|0|0|...|0|
|0|0|0|...|0|
|0|0|1|...|0|
|0|0|1|...|0|
|0|0|0|...|1|

## Another regression

Let's run another regression to see the impact of both **sex and surface condition** on fatalities.

### Build your `train_df`

To build your `train_df`, I recommend doing it either of these two ways:

```python
train_df = pd.DataFrame({
    'col1': merged.col1,
    'col2': merged.col2,
    'col3': merged.col3,
})
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()
```

or like this:

```python
train_df = train_df[['col1','col2','col3']].copy()
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()
```

The second one is shorter, but the first one makes it easier to use comments to remove columns later.


In [23]:
train_df = pd.DataFrame({
   'is_fatality': merged['is_fatality'],
   'genders': merged['genders'],
   'condition': merged['condition'],
})

train_df = merged[['is_fatality', "condition"]].join(surf_dummies)
train_df = train_df.dropna()

In [24]:
train_df = merged[['is_fatality', "genders"]].join(surf_dummies)
train_df = train_df.dropna()

### Run your regression and check your odds ratios

Actually no, wait, first - what kind of surface do you think will have the **highest fatality rate?**

In [25]:
# snow?

In [26]:
X = train_df.drop(columns=['is_fatality'])
y = train_df.is_fatality

In [27]:
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [28]:
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
2,"surface_Mud, Dirt, Gravel",1.11515,3.05
4,surface_Other,0.59623,1.8153
6,surface_Unknown,0.3402,1.4052
7,surface_Wet,-0.098825,0.9059
1,surface_Ice,-0.557708,0.5725
5,surface_Snow,-0.65248,0.5208
0,genders,-0.718133,0.4877
3,surface_Not Applicable,-0.730582,0.4816


Confirm your `train_df` has 815,843 rows and 9 columns.

* **Tip:** When you run your regression, if you get an error about not knowing what to do with `U`, it's because you didn't convert your sex to numbers (or if you did, you didn't do it in your original dataframe)

In [29]:
train_df.shape

(815843, 9)

**Is this what you expected?** Why do you think this result might be the case?

# More features: Vehicles

Maybe the car they're in is related to the car they were in. Luckily, we have this information - **read in `combined_vehicle_data` as `vehicles`.**

In [30]:
vehicles = pd.read_csv('combined-vehicle-data.csv')
vehicles.head()

Unnamed: 0,AREA_DAMAGED_CODE1,AREA_DAMAGED_CODE2,AREA_DAMAGED_CODE3,AREA_DAMAGED_CODE_IMP1,AREA_DAMAGED_CODE_MAIN,BODY_TYPE_CODE,COMMERCIAL_FLAG,CONTI_DIRECTION_CODE,CV_BODY_TYPE_CODE,DAMAGE_CODE,DRIVERLESS_FLAG,FIRE_FLAG,GOING_DIRECTION_CODE,GVW_CODE,HARM_EVENT_CODE,HAZMAT_SPILL_FLAG,HIT_AND_RUN_FLAG,HZM_NUM,MOVEMENT_CODE,NUM_AXLES,PARKED_FLAG,REPORT_NO,SPEED_LIMIT,TOWED_AWAY_FLAG,TOWED_VEHICLE_CONFIG_CODE,VEHICLE_ID,VEH_MAKE,VEH_MODEL,VEH_YEAR,VIN_NO
0,8.0,9.0,10.0,10.0,10.0,23.08,N,E,,5,N,N,E,,9.0,,N,,1.0,,N,ADJ487004H,30,Y,0,000238fd-44fa-4cd5-8eb7-41ab30500bec,CHEVY,TAHOE,2005.0,1GNEK13Q2J285593
1,12.0,,,12.0,12.0,2.0,N,E,,4,N,N,E,,1.0,,N,,1.0,,N,MCP2487000M,40,Y,0,00038116-1bf9-48cc-b317-4f4375d14b60,INFI,4S,2003.0,JNKCV51E63M013580
2,6.0,7.0,,6.0,6.0,20.0,N,N,,3,N,N,N,,1.0,,N,,6.0,,N,CB5190006B,55,N,0,0003b659-2785-4868-8877-0b786a284827,TOYT,TK,2011.0,5TFUY5F1XBX167340
3,8.0,,,8.0,8.0,2.0,N,W,,2,N,,W,,1.0,,N,,8.0,,N,ADJ4590035,5,N,0,00050484-d08f-4b6e-bc7e-9ec270e94660,HONDA,CIVIC,2015.0,2HGFG4A59FH702545
4,12.0,,,12.0,12.0,2.0,N,N,,4,N,,N,,0.0,,N,,1.0,,N,ADJ849000Z,10,N,0,00057af4-d848-4cee-b854-707f57581f4e,HONDA,ACCORD,2003.0,1HGCM66313A037175


## Weights of those cars

The car weights are stored in **another file** since the info had to come from an API. I looked up the VINs - vehicle identification numbers - in a government database to try to get data for each of them.

**Read them and build a new dataframe that is both the vehicle data along with their weights.** You can call it `vehicles` since you don't need the original weightless vehicle data any more.

In [31]:
vins_and_weights = pd.read_csv('vins_and_weights.csv')
vins_and_weights.head()

Unnamed: 0,VIN,Make,Model,ModelYear,weight
0,2FMDA5143TBB45576,FORD,WINDSTAR,1996,3733.0
1,2G1WC5E37E1120089,CHEVROLET,IMPALA,2014,3618.0
2,5J6RE4H55AL053951,HONDA,CR-V,2010,3389.0
3,1N4AA5AP0EC435185,NISSAN,MAXIMA,2014,3556.0
4,JTHCK262075010440,LEXUS,IS,2007,3527.0


In [32]:
vehicles1 = pd.merge(vehicles, vins_and_weights, left_on='VIN_NO', right_on='VIN')
vehicles1.head() 

Unnamed: 0,AREA_DAMAGED_CODE1,AREA_DAMAGED_CODE2,AREA_DAMAGED_CODE3,AREA_DAMAGED_CODE_IMP1,AREA_DAMAGED_CODE_MAIN,BODY_TYPE_CODE,COMMERCIAL_FLAG,CONTI_DIRECTION_CODE,CV_BODY_TYPE_CODE,DAMAGE_CODE,DRIVERLESS_FLAG,FIRE_FLAG,GOING_DIRECTION_CODE,GVW_CODE,HARM_EVENT_CODE,HAZMAT_SPILL_FLAG,HIT_AND_RUN_FLAG,HZM_NUM,MOVEMENT_CODE,NUM_AXLES,PARKED_FLAG,REPORT_NO,SPEED_LIMIT,TOWED_AWAY_FLAG,TOWED_VEHICLE_CONFIG_CODE,VEHICLE_ID,VEH_MAKE,VEH_MODEL,VEH_YEAR,VIN_NO,VIN,Make,Model,ModelYear,weight
0,8.0,9.0,10.0,10.0,10.0,23.08,N,E,,5,N,N,E,,9.0,,N,,1.0,,N,ADJ487004H,30,Y,0,000238fd-44fa-4cd5-8eb7-41ab30500bec,CHEVY,TAHOE,2005.0,1GNEK13Q2J285593,1GNEK13Q2J285593,CHEVROLET,GMT-400,1988,4300.0
1,12.0,,,12.0,12.0,2.0,N,E,,4,N,N,E,,1.0,,N,,1.0,,N,MCP2487000M,40,Y,0,00038116-1bf9-48cc-b317-4f4375d14b60,INFI,4S,2003.0,JNKCV51E63M013580,JNKCV51E63M013580,INFINITI,G35,2003,3468.0
2,6.0,7.0,,6.0,6.0,20.0,N,N,,3,N,N,N,,1.0,,N,,6.0,,N,CB5190006B,55,N,0,0003b659-2785-4868-8877-0b786a284827,TOYT,TK,2011.0,5TFUY5F1XBX167340,5TFUY5F1XBX167340,TOYOTA,TUNDRA,2011,5480.0
3,8.0,,,8.0,8.0,2.0,N,W,,2,N,,W,,1.0,,N,,8.0,,N,ADJ4590035,5,N,0,00050484-d08f-4b6e-bc7e-9ec270e94660,HONDA,CIVIC,2015.0,2HGFG4A59FH702545,2HGFG4A59FH702545,HONDA,CIVIC,2015,2754.0
4,12.0,,,12.0,12.0,2.0,N,N,,4,N,,N,,0.0,,N,,1.0,,N,ADJ849000Z,10,N,0,00057af4-d848-4cee-b854-707f57581f4e,HONDA,ACCORD,2003.0,1HGCM66313A037175,1HGCM66313A037175,HONDA,ACCORD,2003,3023.0


Confirm that your combined `vehicles` dataset should have 534,436 rows and 35 columns. And yes, that's less than we were working with before - you haven't combined it with the people/crashes dataset yet.

In [33]:
vehicles1.shape

(534436, 35)

### Filter your data

We only want vehicles that are "normal" - somewhere between 1500 and 6000 pounds. Filter your vehicles to only include those in that weight range.

In [34]:
vehicles2 = vehicles1[(vehicles1['weight']>1500) & (vehicles1['weight']<6000)]
vehicles2.sort_values(by='weight')

Unnamed: 0,AREA_DAMAGED_CODE1,AREA_DAMAGED_CODE2,AREA_DAMAGED_CODE3,AREA_DAMAGED_CODE_IMP1,AREA_DAMAGED_CODE_MAIN,BODY_TYPE_CODE,COMMERCIAL_FLAG,CONTI_DIRECTION_CODE,CV_BODY_TYPE_CODE,DAMAGE_CODE,DRIVERLESS_FLAG,FIRE_FLAG,GOING_DIRECTION_CODE,GVW_CODE,HARM_EVENT_CODE,HAZMAT_SPILL_FLAG,HIT_AND_RUN_FLAG,HZM_NUM,MOVEMENT_CODE,NUM_AXLES,PARKED_FLAG,REPORT_NO,SPEED_LIMIT,TOWED_AWAY_FLAG,TOWED_VEHICLE_CONFIG_CODE,VEHICLE_ID,VEH_MAKE,VEH_MODEL,VEH_YEAR,VIN_NO,VIN,Make,Model,ModelYear,weight
97180,6.0,7.0,,7.0,7.0,2.00,N,,,3,Y,N,,,2.0,,N,,10.00,,Y,ADG902000T,0,N,0,42055bd3-9ce7-4f60-8643-ffac4c9fe25e,GEO,METRO,1980.0,2C1MR6467M6792313,2C1MR6467M6792313,GEO,METRO,1991,1701.0
59842,1.0,11.0,12.0,12.0,12.0,2.00,N,W,,5,N,N,W,,9.0,,N,,1.00,,N,MCP2564000Q,35,Y,0,71b323a9-1c67-4c06-8a38-7167915567db,GEO,METRO,1992.0,2C1MR2464N6705980,2C1MR2464N6705980,GEO,METRO,1992,1701.0
233396,6.0,12.0,,6.0,6.0,2.00,N,,,3,Y,,,,0.0,,N,,10.00,,Y,AE5277000V,30,N,0,bfba35b1-39aa-43d9-92d6-9278d04c10cd,GEO,METRO,1991.0,2C1MR6463M6742413,2C1MR6463M6742413,GEO,METRO,1991,1701.0
339961,12.0,,,12.0,12.0,2.00,N,E,,4,N,N,E,,1.0,,N,,1.00,,N,MSP617100F2,55,Y,0,9f073ce1-94a9-4429-92cd-299ff2467e84,GEO,METRO,1991.0,2C1MR2468M6702028,2C1MR2468M6702028,GEO,METRO,1991,1701.0
436224,6.0,,,6.0,6.0,2.00,N,W,,2,N,N,W,,1.0,,N,,4.00,,N,ADJ6960002,35,Y,0,1e960d5d-9ce5-4a48-886f-2a455b937b4a,CHEVROLET,GEO,1994.0,2C1MR2469R6770796,2C1MR2469R6770796,GEO,METRO,1994,1779.0
43038,1.0,2.0,12.0,12.0,1.0,2.00,N,W,,4,N,N,W,,1.0,,N,,3.00,,N,ADJ323000Q,35,Y,0,1cca948a-43da-45b5-9f38-b3931e932e7e,GEO,METRO,1994.0,2C1MS2463R6701901,2C1MS2463R6701901,GEO,METRO,1994,1779.0
212696,5.0,6.0,7.0,6.0,6.0,2.00,N,W,,3,N,N,W,,1.0,,N,,1.00,,N,DA39490017,30,N,0,3635e246-80f4-4e47-9ed3-8b67feea4265,GEO,2S,1994.0,2C1MR2466R6753597,2C1MR2466R6753597,GEO,METRO,1994,1779.0
219785,11.0,,,11.0,11.0,2.00,N,S,,3,N,N,S,,1.0,,N,,1.00,,N,MSP47970022,50,N,0,652c540d-b7ea-4799-a68f-9dab7694f8bf,GEO,METRO,1997.0,2C1MR2296V6769676,2C1MR2296V6769676,GEO,METRO,1997,1832.0
260486,1.0,11.0,12.0,12.0,12.0,2.00,N,N,,5,N,N,N,,1.0,,N,,1.00,,N,MSP6491001F,50,N,0,6753f811-03b6-4dda-b0f0-5b15ef692643,GEO,METRO,1997.0,2C1MR2296V6749279,2C1MR2296V6749279,GEO,METRO,1997,1832.0
14291,6.0,,,6.0,6.0,2.00,N,E,,4,N,N,E,,1.0,,N,,19.07,,N,MSP5747008D,35,Y,0,53e490a9-08df-4ebd-8921-0ec31620de23,GEO,METRO,1997.0,2C1MR2292V6755726,2C1MR2292V6755726,GEO,METRO,1997,1832.0


Confirm that you have 532,370 vehicles in the dataset.

In [35]:
vehicles2.shape

(532370, 35)

## Add this vehicle information to your merged data

Now we'll have a dataframe that contains information on:

* The people themselves and their injuries
* The crash
* The vehicles

Every person came with a `VEHICLE_ID` column that is the vehicle they were in. You'll want to merge on that.

In [36]:
merged2 = pd.merge(merged, vehicles2, on='VEHICLE_ID')
merged2.head()

Unnamed: 0,ACC_DATE,ACC_TIME,AGENCY_CODE,AREA_CODE,COLLISION_TYPE_CODE,COUNTY_NO,C_M_ZONE_FLAG,DISTANCE,DISTANCE_DIR_FLAG,FEET_MILES_FLAG,FIX_OBJ_CODE,HARM_EVENT_CODE1,HARM_EVENT_CODE2,JUNCTION_CODE,LANE_CODE,LATITUDE,LIGHT_CODE,LOC_CODE,LOGMILE_DIR_FLAG,LOG_MILE,LONGITUDE,MAINROAD_NAME,MUNI_CODE,RD_COND_CODE,RD_DIV_CODE,REFERENCE_NO,REFERENCE_ROAD_NAME,REFERENCE_SUFFIX,REFERENCE_TYPE_CODE,REPORT_NO_x,REPORT_TYPE,ROUTE_TYPE_CODE,RTE_NO,RTE_SUFFIX,SIGNAL_FLAG,SURF_COND_CODE,WEATHER_CODE,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE_x,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality,genders,condition,AREA_DAMAGED_CODE1,AREA_DAMAGED_CODE2,AREA_DAMAGED_CODE3,AREA_DAMAGED_CODE_IMP1,AREA_DAMAGED_CODE_MAIN,BODY_TYPE_CODE,COMMERCIAL_FLAG,CONTI_DIRECTION_CODE,CV_BODY_TYPE_CODE,DAMAGE_CODE,DRIVERLESS_FLAG,FIRE_FLAG,GOING_DIRECTION_CODE,GVW_CODE,HARM_EVENT_CODE,HAZMAT_SPILL_FLAG,HIT_AND_RUN_FLAG,HZM_NUM,MOVEMENT_CODE_y,NUM_AXLES,PARKED_FLAG,REPORT_NO_y,SPEED_LIMIT,TOWED_AWAY_FLAG,TOWED_VEHICLE_CONFIG_CODE,VEH_MAKE,VEH_MODEL,VEH_YEAR,VIN_NO,VIN,Make,Model,ModelYear,weight
0,2018-04-10 00:00:00,01:50:00,MSP,UNK,17,18.0,N,0.0,N,F,22.03,16.0,11.0,1.0,,38.27723,4.0,,N,2.09,-76.682876,NEWTOWNE NECK RD,0.0,1.0,1.0,166.0,ROSEBANK RD,,CO,MSP6188002Q,Injury Crash,MD,243.0,,N,2.0,6.01,4.0,2.0,3.0,,N,C,2.0,1996-12-13 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,0212a82f-38a7-4a8a-969d-bce0920d3ab3,D,13.0,F,61d02432-7d40-401a-86ce-fea710234832,0,1.0,Dry,1.0,11.0,12.0,12.0,12.0,2.0,N,N,,5,N,N,N,,16.0,,N,,1.0,,N,MSP6188002Q,40,Y,0,MITSUBISHI,MIRAGE,2017.0,ML32A4HJ0HH009937,ML32A4HJ0HH009937,MITSUBISHI,MIRAGE,2017,2045.0
1,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,,,,,,1967-11-19 00:00:00,,,1.0,,1.0,,3,,,3.0,,,,,006b94ad-a347-4356-a48b-9b4b0849ae7b,O,13.0,F,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,1.0,Dry,6.0,11.0,12.0,6.0,6.0,23.08,N,W,,5,N,N,N,,1.0,,N,,3.0,,N,BK0227001M,35,N,0,HONDA,CRV,2006.0,SHSRD78946U439857,SHSRD78946U439857,HONDA,CR-V,2006,3318.0
2,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,1.0,,0.0,,N,C,1.0,1954-08-18 00:00:00,,0.0,1.0,B,1.0,N,3,MD,,,,,,,4fb7ba1d-4e1d-466c-a84c-de24d99ff7d0,D,13.0,M,089ece89-1c91-4fc4-a2f9-3368a8649a96,0,0.0,Dry,6.0,11.0,12.0,6.0,6.0,23.08,N,W,,5,N,N,N,,1.0,,N,,3.0,,N,BK0227001M,35,N,0,HONDA,CRV,2006.0,SHSRD78946U439857,SHSRD78946U439857,HONDA,CR-V,2006,3318.0
3,2018-05-02 00:00:00,11:06:00,ELKTON,UNK,5,7.0,N,15.0,S,F,0.0,1.0,1.0,3.0,,39.613747,1.0,,N,19.41,-75.837454,N BRIDGE ST,52.0,1.0,3.0,362.0,LAUREL DR,,MU,BK0227001M,Injury Crash,MD,213.0,,N,2.0,6.01,5.0,,0.0,,N,B,99.0,1948-04-23 00:00:00,,0.0,1.0,A,1.0,Y,3,MD,,,,,,,a6ad59e0-a578-4093-a6df-99a66c936821,D,13.0,M,3326638d-db2b-416e-974c-fc385d6552b9,0,0.0,Dry,11.0,12.0,,12.0,12.0,23.08,N,N,,4,N,N,N,,1.0,,N,,1.0,,N,BK0227001M,35,Y,0,DODGE,DURANGO,2015.0,1C4RDJDG1FC843317,1C4RDJDG1FC843317,DODGE,DURANGO,2015,4756.0
4,2018-05-02 00:00:00,12:45:00,MONTGOMERY,UNK,7,15.0,N,0.0,E,F,0.0,1.0,1.0,2.0,2.0,39.015649,1.0,55122.0,E,0.0,-77.042728,FOREST GLEN RD,0.0,0.0,4.0,97.0,GEORGIA AVE,,MD,MCP2891005S,Property Damage Crash,CO,697.0,,Y,2.0,6.01,1.0,,0.0,,N,C,1.0,1988-09-11 00:00:00,,0.0,1.0,,0.0,N,1,MD,,,,,,,76e97f6c-9a05-462a-8a7d-833753dd48f7,D,13.0,M,0fbcd761-3428-4cad-a121-0d3696262b59,0,0.0,Dry,7.0,,,7.0,7.0,2.0,N,S,,3,N,N,W,,1.0,,N,,12.0,,N,MCP2891005S,35,N,0,NISSAN,SENTRA,2013.0,3N1AB7AP1DL615347,3N1AB7AP1DL615347,NISSAN,SENTRA,2013,2902.0


Confirm you have 99 columns and 616,212 rows. **That is a lot of possible features!**

In [37]:
merged2.shape

(616212, 101)

## Another regression, because we can't get enough

Build another `train_df` and run another regression about **how car weight impacts the chance of fatalities**. You'll want to confirm that your dataset has 616,212 and 2 columns.

In [38]:
train_df = merged2.copy()
train_df = train_df[['is_fatality', 'weight']] 

In [39]:
X = train_df.drop(columns=['is_fatality'])
y = train_df.is_fatality 

In [40]:
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y) 

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [41]:
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False) 

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
0,weight,-0.000147,0.999853


**Can you translate that into plain English?** Remember weight is in **pounds**.

In [42]:
# Every pound heavier your car is translates to a 1% decrease in fatalities


I feel like pounds isn't the best measure for something like this. Remember how we had to adjust percentages with AP and life expecntancy, and then change around the way we said things? It sounded like this:

> Every 10% increase in unemployment translates to a year and a half loss of life expectancy

Instead of every single pound, maybe we could do every... some other number of pounds? One hundred? One thousand?

**Run another regression with weight in thousands of pounds.** Get another odds ratio. Give me another sentence English.

In [43]:
train_df['weight'] = train_df.weight/1000 

In [44]:
train_df = merged2.copy()
train_df = train_df[['is_fatality', 'weight']]
train_df['weight'] = train_df.weight/1000 

In [45]:
X = train_df.drop(columns=['is_fatality'])
y = train_df.is_fatality 

In [46]:
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [47]:
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False) 

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
0,weight,-0.146881,0.863397


In [48]:
# Every thousand pounds heavier your car is increase translates to a 15% decrease in fatalities


# Two-car accidents, struck and striker

Here's the thing, though: **it isn't just the weight of your car.** It's the weight of both cars! If I'm in a big car and I have a wreck with a smaller car, it's the smaller car that's in trouble.

To get that value, we need to do some **feature engineering**, some calculating of *new* variables from our *existing* variables.

We need to jump through some hoops to do that.

##  Two-car accidents

First we're going to count how many vehicles were in each accident. Since we're looking to compare the weight of two cars hitting each other, **we're only going to want crashes with only two cars.**

In [49]:
counted = vehicles.REPORT_NO.value_counts()
counted.head(10)

MDTA1229000H    69
CB53480003      35
MSP67460095     18
CE4636002C      17
MSP6063005N     17
AS0286000B      17
MSP6063006C     15
MSP637800B1     15
DA3276000V      14
CC0237003W      14
Name: REPORT_NO, dtype: int64

By using `.value_counts` I can see how many cars were in each crash, and now I'm going to filter to get a list of all of the ones with two vehicles.

In [50]:
two_car_report_nos = counted[counted == 2].index
two_car_report_nos

Index(['ADJ2990014', 'MTA03430005', 'ADJ216000J', 'CBPD0091001B',
       'MSP12810012', 'MDTA1553000Q', 'AE4747000K', 'DA3805001K',
       'MSP6593006D', 'DA3375001C',
       ...
       'MSP546600JB', 'BX5185000D', 'EJ78560032', 'MSP6171009J', 'ADD721006J',
       'ADJ465002W', 'AE57320015', 'ZR0279001D', 'ZL0846000C', 'AE5917000T'],
      dtype='object', length=253146)

And now we'll filter my vehicles so we only have those that were in two-vehicle crashes.

In [51]:
vehicles = vehicles[vehicles.REPORT_NO.isin(two_car_report_nos)]

### Struck and striker

To do the math correctly, we need both the risk of someone dying in the smaller car _and_ the risk of someone dying in the bigger car. To do this we need to separate our cars into two groups:

* The 'struck' vehicle: did the person die inside?
* The 'striker' vehicle: how much heavier was it than the struck car?

But we don't know which car was which, so we have to try out both versions - pretending car A was the striker, then pretending car B was the striker. It's hard to explain, but you can read `Pounds That Kill - The External Costs of Vehicle Weight.pdf` for more details on how it works.

In [52]:
cars_1 = vehicles.drop_duplicates(subset='REPORT_NO', keep='first')
cars_2 = vehicles.drop_duplicates(subset='REPORT_NO', keep='last')

In [53]:
cars_merged_1 = cars_1.merge(cars_2, on='REPORT_NO', suffixes=['_striker', '_struck'])
cars_merged_2 = cars_2.merge(cars_1, on='REPORT_NO', suffixes=['_striker', '_struck'])
vehicles_complete = pd.concat([cars_merged_1, cars_merged_2])
vehicles_complete.head()

Unnamed: 0,AREA_DAMAGED_CODE1_striker,AREA_DAMAGED_CODE2_striker,AREA_DAMAGED_CODE3_striker,AREA_DAMAGED_CODE_IMP1_striker,AREA_DAMAGED_CODE_MAIN_striker,BODY_TYPE_CODE_striker,COMMERCIAL_FLAG_striker,CONTI_DIRECTION_CODE_striker,CV_BODY_TYPE_CODE_striker,DAMAGE_CODE_striker,DRIVERLESS_FLAG_striker,FIRE_FLAG_striker,GOING_DIRECTION_CODE_striker,GVW_CODE_striker,HARM_EVENT_CODE_striker,HAZMAT_SPILL_FLAG_striker,HIT_AND_RUN_FLAG_striker,HZM_NUM_striker,MOVEMENT_CODE_striker,NUM_AXLES_striker,PARKED_FLAG_striker,REPORT_NO,SPEED_LIMIT_striker,TOWED_AWAY_FLAG_striker,TOWED_VEHICLE_CONFIG_CODE_striker,VEHICLE_ID_striker,VEH_MAKE_striker,VEH_MODEL_striker,VEH_YEAR_striker,VIN_NO_striker,AREA_DAMAGED_CODE1_struck,AREA_DAMAGED_CODE2_struck,AREA_DAMAGED_CODE3_struck,AREA_DAMAGED_CODE_IMP1_struck,AREA_DAMAGED_CODE_MAIN_struck,BODY_TYPE_CODE_struck,COMMERCIAL_FLAG_struck,CONTI_DIRECTION_CODE_struck,CV_BODY_TYPE_CODE_struck,DAMAGE_CODE_struck,DRIVERLESS_FLAG_struck,FIRE_FLAG_struck,GOING_DIRECTION_CODE_struck,GVW_CODE_struck,HARM_EVENT_CODE_struck,HAZMAT_SPILL_FLAG_struck,HIT_AND_RUN_FLAG_struck,HZM_NUM_struck,MOVEMENT_CODE_struck,NUM_AXLES_struck,PARKED_FLAG_struck,SPEED_LIMIT_struck,TOWED_AWAY_FLAG_struck,TOWED_VEHICLE_CONFIG_CODE_struck,VEHICLE_ID_struck,VEH_MAKE_struck,VEH_MODEL_struck,VEH_YEAR_struck,VIN_NO_struck
0,8.0,,,8.0,8.0,2.0,N,W,,2,N,,W,,1.0,,N,,8.0,,N,ADJ4590035,5,N,0,00050484-d08f-4b6e-bc7e-9ec270e94660,HONDA,CIVIC,2015.0,2HGFG4A59FH702545,99.0,,,99.0,99.0,2.0,N,W,,99,N,N,W,,1.0,,Y,,1.0,,N,5,N,0,3ebfe7fc-4832-414d-a5e8-391b65a2f280,HONDA,ACCORD,2013.0,1HGCT2B81DA005695
1,12.0,,,12.0,12.0,2.0,N,N,,4,N,,N,,0.0,,N,,1.0,,N,ADJ849000Z,10,N,0,00057af4-d848-4cee-b854-707f57581f4e,HONDA,ACCORD,2003.0,1HGCM66313A037175,12.0,,,12.0,12.0,2.0,N,,,3,Y,,,,1.0,,N,,10.0,,Y,0,N,0,4cf496cc-f537-447e-bb8e-3ef48c8c3ddf,HONDA,TK,2010.0,5J6RE3H73AL008454
2,1.0,11.0,12.0,12.0,12.0,2.0,N,S,,4,N,N,S,,1.0,,N,,12.0,,N,AE5207008Z,30,Y,0,00089d4a-7038-4693-9e02-b402676631af,FORD,4D,2016.0,1FADP3K28GL258987,11.0,12.0,,12.0,12.0,23.08,N,N,,4,N,N,N,,1.0,,N,,1.0,,N,30,Y,0,1e69b284-719f-429b-baae-6ae4976a546b,BUICK,TK,2002.0,3G5DB03E52S528316
3,6.0,,,6.0,6.0,2.0,N,S,,2,N,N,S,,1.0,,N,,3.0,,N,MCP27070015,25,N,0,000f45d8-bc0e-4f9c-820a-474212e669cd,ACUR,TSX,2008.0,JH4CL96848C021626,12.0,,,12.0,12.0,2.0,N,S,,2,N,N,S,,1.0,,N,,3.0,,N,25,N,0,6e89f5d9-b4c5-4571-a511-6259f408bde9,ACUR,TL,2010.0,19UUA8F56AA011310
4,2.0,,,2.0,2.0,2.0,N,S,,99,N,N,S,,1.0,,Y,,1.0,,N,ADJ859000Y,15,N,0,0010076b-0a45-45f3-8796-f387b39cd85d,HONDA,CIVIC,2009.0,2HGFA16689H357624,10.0,,,10.0,10.0,2.0,N,,,3,Y,N,,,1.0,,N,,10.0,,Y,15,N,0,b14273cb-b3e5-42c7-af59-df069e8c2f0f,HONDA,CIVIC,2011.0,2HGFG1B87BH517469


## Put people in their cars

Which car was each person in? We'll assign that now.

In [54]:
merged = people.merge(vehicles_complete, left_on='VEHICLE_ID', right_on='VEHICLE_ID_struck')
merged.head(3)

Unnamed: 0,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,REPORT_NO_x,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality,genders,AREA_DAMAGED_CODE1_striker,AREA_DAMAGED_CODE2_striker,AREA_DAMAGED_CODE3_striker,AREA_DAMAGED_CODE_IMP1_striker,AREA_DAMAGED_CODE_MAIN_striker,BODY_TYPE_CODE_striker,COMMERCIAL_FLAG_striker,CONTI_DIRECTION_CODE_striker,CV_BODY_TYPE_CODE_striker,DAMAGE_CODE_striker,DRIVERLESS_FLAG_striker,FIRE_FLAG_striker,GOING_DIRECTION_CODE_striker,GVW_CODE_striker,HARM_EVENT_CODE_striker,HAZMAT_SPILL_FLAG_striker,HIT_AND_RUN_FLAG_striker,HZM_NUM_striker,MOVEMENT_CODE_striker,NUM_AXLES_striker,PARKED_FLAG_striker,REPORT_NO_y,SPEED_LIMIT_striker,TOWED_AWAY_FLAG_striker,TOWED_VEHICLE_CONFIG_CODE_striker,VEHICLE_ID_striker,VEH_MAKE_striker,VEH_MODEL_striker,VEH_YEAR_striker,VIN_NO_striker,AREA_DAMAGED_CODE1_struck,AREA_DAMAGED_CODE2_struck,AREA_DAMAGED_CODE3_struck,AREA_DAMAGED_CODE_IMP1_struck,AREA_DAMAGED_CODE_MAIN_struck,BODY_TYPE_CODE_struck,COMMERCIAL_FLAG_struck,CONTI_DIRECTION_CODE_struck,CV_BODY_TYPE_CODE_struck,DAMAGE_CODE_struck,DRIVERLESS_FLAG_struck,FIRE_FLAG_struck,GOING_DIRECTION_CODE_struck,GVW_CODE_struck,HARM_EVENT_CODE_struck,HAZMAT_SPILL_FLAG_struck,HIT_AND_RUN_FLAG_struck,HZM_NUM_struck,MOVEMENT_CODE_struck,NUM_AXLES_struck,PARKED_FLAG_struck,SPEED_LIMIT_struck,TOWED_AWAY_FLAG_struck,TOWED_VEHICLE_CONFIG_CODE_struck,VEHICLE_ID_struck,VEH_MAKE_struck,VEH_MODEL_struck,VEH_YEAR_struck,VIN_NO_struck
0,1.0,,0.0,,N,C,0.0,1952-04-20 00:00:00,,0.0,1.0,,1.0,N,1,PA,,,,,,,48dd00ee-e033-47e7-ad1e-0b734020301b,D,AB4284000S,13.0,F,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,0,1.0,12.0,,,12.0,12.0,2.0,N,N,,2,N,N,N,,1.0,,Y,,2.0,,N,AB4284000S,25,N,0,e95fc5fb-a269-4cce-9e09-643cd6a5bde7,TOYT,4S,2007.0,JTNBB46K073032308,6.0,,,6.0,6.0,2.0,N,N,,2,N,N,N,,1.0,,N,,6.0,,N,25,N,0,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,HYND,SONATA,2015.0,5NPE34AB7FH113136
1,1.0,,0.0,,N,,0.0,1985-05-28 00:00:00,,0.0,0.0,,0.0,N,1,MD,,,,,,,166296bd-ffd3-4c16-aa74-4f4bf4139d8d,D,AB4313000X,13.0,F,b463eb20-2f01-4200-9d6f-b18888ce2593,0,1.0,3.0,,,3.0,3.0,2.0,N,E,,2,N,,E,,1.0,,Y,,2.0,,N,AB4313000X,25,N,0,3c8629d0-d524-47c1-bfbc-b18e07f3087e,LEXU,TK,2004.0,2T2HA31U24C031048,7.0,,,7.0,7.0,2.0,N,S,,2,N,N,S,,1.0,,N,,6.0,,N,25,N,0,b463eb20-2f01-4200-9d6f-b18888ce2593,CHEVY,TK,2007.0,2CNDL13F576252855
2,1.0,,0.0,,N,C,0.0,1960-10-04 00:00:00,,0.0,0.0,,0.0,Y,1,MD,,,,,,,f3b2743f-fbc3-4345-9419-56a0ca29102c,D,AB4313000X,13.0,F,3c8629d0-d524-47c1-bfbc-b18e07f3087e,0,1.0,7.0,,,7.0,7.0,2.0,N,S,,2,N,N,S,,1.0,,N,,6.0,,N,AB4313000X,25,N,0,b463eb20-2f01-4200-9d6f-b18888ce2593,CHEVY,TK,2007.0,2CNDL13F576252855,3.0,,,3.0,3.0,2.0,N,E,,2,N,,E,,1.0,,Y,,2.0,,N,25,N,0,3c8629d0-d524-47c1-bfbc-b18e07f3087e,LEXU,TK,2004.0,2T2HA31U24C031048


# Add the crash details

You did this already! I'm going to do it for you. We're merging on `REPORT_NO_x` because there are so many `REPORT_NO` columns duplicated across our files that pandas started giving them weird names.

In [55]:
merged = merged.merge(combined, left_on='REPORT_NO_x', right_on='REPORT_NO')
merged.head(3)

Unnamed: 0,AIRBAG_DEPLOYED,ALCOHOL_TESTTYPE_CODE,ALCOHOL_TEST_CODE,BAC_CODE,CDL_FLAG,CLASS,CONDITION_CODE,DATE_OF_BIRTH,DRUG_TESTRESULT_CODE,DRUG_TEST_CODE,EJECT_CODE,EMS_UNIT_LABEL,EQUIP_PROB_CODE,FAULT_FLAG,INJ_SEVER_CODE,LICENSE_STATE_CODE,MOVEMENT_CODE,OCC_SEAT_POS_CODE,PED_LOCATION_CODE,PED_OBEY_CODE,PED_TYPE_CODE,PED_VISIBLE_CODE,PERSON_ID,PERSON_TYPE,REPORT_NO_x,SAF_EQUIP_CODE,SEX_CODE,VEHICLE_ID,is_fatality,genders,AREA_DAMAGED_CODE1_striker,AREA_DAMAGED_CODE2_striker,AREA_DAMAGED_CODE3_striker,AREA_DAMAGED_CODE_IMP1_striker,AREA_DAMAGED_CODE_MAIN_striker,BODY_TYPE_CODE_striker,COMMERCIAL_FLAG_striker,CONTI_DIRECTION_CODE_striker,CV_BODY_TYPE_CODE_striker,DAMAGE_CODE_striker,DRIVERLESS_FLAG_striker,FIRE_FLAG_striker,GOING_DIRECTION_CODE_striker,GVW_CODE_striker,HARM_EVENT_CODE_striker,HAZMAT_SPILL_FLAG_striker,HIT_AND_RUN_FLAG_striker,HZM_NUM_striker,MOVEMENT_CODE_striker,NUM_AXLES_striker,PARKED_FLAG_striker,REPORT_NO_y,SPEED_LIMIT_striker,TOWED_AWAY_FLAG_striker,TOWED_VEHICLE_CONFIG_CODE_striker,VEHICLE_ID_striker,VEH_MAKE_striker,VEH_MODEL_striker,VEH_YEAR_striker,VIN_NO_striker,AREA_DAMAGED_CODE1_struck,AREA_DAMAGED_CODE2_struck,AREA_DAMAGED_CODE3_struck,AREA_DAMAGED_CODE_IMP1_struck,AREA_DAMAGED_CODE_MAIN_struck,BODY_TYPE_CODE_struck,COMMERCIAL_FLAG_struck,CONTI_DIRECTION_CODE_struck,CV_BODY_TYPE_CODE_struck,DAMAGE_CODE_struck,DRIVERLESS_FLAG_struck,FIRE_FLAG_struck,GOING_DIRECTION_CODE_struck,GVW_CODE_struck,HARM_EVENT_CODE_struck,HAZMAT_SPILL_FLAG_struck,HIT_AND_RUN_FLAG_struck,HZM_NUM_struck,MOVEMENT_CODE_struck,NUM_AXLES_struck,PARKED_FLAG_struck,SPEED_LIMIT_struck,TOWED_AWAY_FLAG_struck,TOWED_VEHICLE_CONFIG_CODE_struck,VEHICLE_ID_struck,VEH_MAKE_struck,VEH_MODEL_struck,VEH_YEAR_struck,VIN_NO_struck,ACC_DATE,ACC_TIME,AGENCY_CODE,AREA_CODE,COLLISION_TYPE_CODE,COUNTY_NO,C_M_ZONE_FLAG,DISTANCE,DISTANCE_DIR_FLAG,FEET_MILES_FLAG,FIX_OBJ_CODE,HARM_EVENT_CODE1,HARM_EVENT_CODE2,JUNCTION_CODE,LANE_CODE,LATITUDE,LIGHT_CODE,LOC_CODE,LOGMILE_DIR_FLAG,LOG_MILE,LONGITUDE,MAINROAD_NAME,MUNI_CODE,RD_COND_CODE,RD_DIV_CODE,REFERENCE_NO,REFERENCE_ROAD_NAME,REFERENCE_SUFFIX,REFERENCE_TYPE_CODE,REPORT_NO,REPORT_TYPE,ROUTE_TYPE_CODE,RTE_NO,RTE_SUFFIX,SIGNAL_FLAG,SURF_COND_CODE,WEATHER_CODE
0,1.0,,0.0,,N,C,0.0,1952-04-20 00:00:00,,0.0,1.0,,1.0,N,1,PA,,,,,,,48dd00ee-e033-47e7-ad1e-0b734020301b,D,AB4284000S,13.0,F,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,0,1.0,12.0,,,12.0,12.0,2.0,N,N,,2,N,N,N,,1.0,,Y,,2.0,,N,AB4284000S,25,N,0,e95fc5fb-a269-4cce-9e09-643cd6a5bde7,TOYT,4S,2007.0,JTNBB46K073032308,6.0,,,6.0,6.0,2.0,N,N,,2,N,N,N,,1.0,,N,,6.0,,N,25,N,0,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,HYND,SONATA,2015.0,5NPE34AB7FH113136,2018-05-25 00:00:00,10:58:00,ANNAPOLIS,UNK,3,2.0,N,20.0,S,F,0.0,1.0,0.0,3.0,1.0,38.950029,1.0,,S,0.0,-76.492578,HILLSMERE DR,3.0,1.0,5.01,2930.0,BAY RIDGE RD,,CO,AB4284000S,Property Damage Crash,CO,2885.0,,Y,2.0,6.01
1,0.0,,0.0,,N,C,1.0,1994-12-13 00:00:00,,0.0,1.0,,1.0,Y,1,MD,,,,,,,974e41b6-480e-431f-9585-4b87f4c29b19,D,AB4284000S,13.0,F,e95fc5fb-a269-4cce-9e09-643cd6a5bde7,0,1.0,6.0,,,6.0,6.0,2.0,N,N,,2,N,N,N,,1.0,,N,,6.0,,N,AB4284000S,25,N,0,eb6aadb8-dacb-4744-a1a7-ab812c96f27f,HYND,SONATA,2015.0,5NPE34AB7FH113136,12.0,,,12.0,12.0,2.0,N,N,,2,N,N,N,,1.0,,Y,,2.0,,N,25,N,0,e95fc5fb-a269-4cce-9e09-643cd6a5bde7,TOYT,4S,2007.0,JTNBB46K073032308,2018-05-25 00:00:00,10:58:00,ANNAPOLIS,UNK,3,2.0,N,20.0,S,F,0.0,1.0,0.0,3.0,1.0,38.950029,1.0,,S,0.0,-76.492578,HILLSMERE DR,3.0,1.0,5.01,2930.0,BAY RIDGE RD,,CO,AB4284000S,Property Damage Crash,CO,2885.0,,Y,2.0,6.01
2,1.0,,0.0,,N,,0.0,1985-05-28 00:00:00,,0.0,0.0,,0.0,N,1,MD,,,,,,,166296bd-ffd3-4c16-aa74-4f4bf4139d8d,D,AB4313000X,13.0,F,b463eb20-2f01-4200-9d6f-b18888ce2593,0,1.0,3.0,,,3.0,3.0,2.0,N,E,,2,N,,E,,1.0,,Y,,2.0,,N,AB4313000X,25,N,0,3c8629d0-d524-47c1-bfbc-b18e07f3087e,LEXU,TK,2004.0,2T2HA31U24C031048,7.0,,,7.0,7.0,2.0,N,S,,2,N,N,S,,1.0,,N,,6.0,,N,25,N,0,b463eb20-2f01-4200-9d6f-b18888ce2593,CHEVY,TK,2007.0,2CNDL13F576252855,2018-04-19 00:00:00,15:50:00,ANNAPOLIS,UNK,3,2.0,N,0.0,S,F,0.0,1.0,0.0,0.0,1.0,38.976882,1.0,,N,0.01,-76.486787,COMPROMISE ST,3.0,0.0,1.0,960.0,DUKE OF GLOUCHESTER ST,,MU,AB4313000X,Property Damage Crash,MU,740.0,,N,0.0,7.01


## Filter

We already filtered out vehicles by weight, so we don't have to do that again.

# Calculated features

I'm sure you forgot what all the features are, so we'll bring back whether there was a fatality or not

## Feature: Accident was fatal

In [56]:
merged['had_fatality'] = (merged.INJ_SEVER_CODE == 5).astype(int)
merged.had_fatality.value_counts()

0    601107
1       598
Name: had_fatality, dtype: int64

## Feature: Weight difference

**Remove everything missing weights for strikers or struck vehicles.** You might need to `merged.columns` to remind yourself what the column names are.

In [57]:
merged.columns

Index(['AIRBAG_DEPLOYED', 'ALCOHOL_TESTTYPE_CODE', 'ALCOHOL_TEST_CODE',
       'BAC_CODE', 'CDL_FLAG', 'CLASS', 'CONDITION_CODE', 'DATE_OF_BIRTH',
       'DRUG_TESTRESULT_CODE', 'DRUG_TEST_CODE',
       ...
       'REFERENCE_TYPE_CODE', 'REPORT_NO', 'REPORT_TYPE', 'ROUTE_TYPE_CODE',
       'RTE_NO', 'RTE_SUFFIX', 'SIGNAL_FLAG', 'SURF_COND_CODE', 'WEATHER_CODE',
       'had_fatality'],
      dtype='object', length=127)

In [59]:
# merged = merged.dropna(subset=['weight_striker', 'weight_struck'])
# merged.head() 

Confirm your dataset has 334,396 rows.

Create a new feature called `weight_diff` about how much heavier the striking car was compared to the struck car. **Make sure you've done the math correctly!**

### Feature adjustment

Make all of your weight columns in **thousands of pounds** instead of just in pounds. It'll help you interpret your results much better.

# Another regression!!!

**What is the impact of weight difference on fatality rate?** Create your `train_df`, drop missing values, run your regression, analyze your odds ratios.

Please translate your odds ratio into plain English.

## Adding in more features

How about speed limit? That's important, right? We can add the speed limit of the striking vehicle with `SPEED_LIMIT_striker`.

Can you translate the speed limit odds ratio into plain English?

### Feature engineering: Speed limits

Honestly, that's a pretty bad way to go about things. What's more fun is if we **translate speed limits into bins.**

First, we'll use `pd.cut` to assign each speed limit a category.

In [None]:
speed_bins = [-np.inf, 10, 20, 30, 40, 50, np.inf]
merged['speed_bin'] = pd.cut(merged.SPEED_LIMIT_struck, bins=speed_bins)
merged[['SPEED_LIMIT_striker', 'speed_bin']].head(10)

Then we'll one-hot encode around 20-30mph speed limits.

In [None]:
speed_dummies = pd.get_dummies(merged.speed_bin, 
                               prefix='speed').drop('speed_(20.0, 30.0]', axis=1)
speed_dummies.head()

## Running a regression

I like this layout for creating `train_df`, it allows us to easily add dummies and do a little replacing/encoding when we're building binary features like for sex.

> If the below gives you an error, it's because `SEX_CODE` is already a number. In that case, just remove `.replace({'M': 1, 'F': 0, 'U': np.nan })`.

In [None]:
# Start with our normal features
train_df = pd.DataFrame({
    'weight_diff': merged.weight_diff,
    'sex': merged.SEX_CODE,#.replace({'M': 1, 'F': 0, 'U': np.nan }),
    'had_fatality': merged.had_fatality,
})
# Add the one-hot encoded features
train_df = train_df.join(speed_dummies)
train_df = train_df.join(surf_dummies)
# Drop missing values
train_df = train_df.dropna()
train_df.head()

Describe the impact of the different variables in simple language. What has the largest impact?

## Now you pick the features

Up above you have examples of:

* Creating features from numbers (speed limits)
* Creating features from 0/1 (sex)
* Creating features from binning numbers that are one-hot encoded (speed limit bins - `speed_bins`)
* Creating features from categories that are one-hot encoded (surface - `surf_dummies`

What else do you think matters? Try to plug in more features and see if you can get anything interesting.

> * **Hot tip:** The funniest/most interesting thing feature you can add is also the dumbest. Ask me about it in #algorithms if you end up getting down here.