<a href="https://colab.research.google.com/github/jamesluttringer2019/DS-Unit-2-Applied-Modeling/blob/master/module2/LS_DS_232_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

Your code starts here:

In [167]:
pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 17.6MB/s eta 0:00:01[K     |██████▌                         | 20kB 1.4MB/s eta 0:00:01[K     |█████████▉                      | 30kB 2.0MB/s eta 0:00:01[K     |█████████████                   | 40kB 1.4MB/s eta 0:00:01[K     |████████████████▍               | 51kB 1.8MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 2.1MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 2.4MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 2.8MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 3.1MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.0MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
import re

In [0]:
df = pd.read_csv('https://github.com/jamesluttringer2019/unit2build/raw/master/trafficsample.csv', low_memory=False)

In [220]:
train, val = train_test_split(df, train_size=.8, test_size=.2, random_state=13)
train.shape, val.shape

((80000, 26), (20000, 26))

In [0]:
def wrangle(X):
  #avoid copy warning
  X = X.copy()

  #round times to nearest 5 minute interval
  times = []
  for i in X['stop_time']:
    i = str(i)  
    if i == 'nan':
      times.append(np.nan)
    else:
      if int(i[4]) < 3:
        i = (i[:4] + '0')
        times.append(i)
      elif int(i[4]) in (range(3,8)):
        i = (i[:4] + '5')
        times.append(i)
      else:
        i = (i[:3] + str(int(i[3])+1) + '0')
        times.append(i)
    
  X['stop_time'] = times

  #create individual columns for year, month, and day
  X['stop_date'] = pd.to_datetime(X['stop_date'])
  X['year'] = X['stop_date'].dt.year
  X['month'] = X['stop_date'].dt.month
  X['day'] = X['stop_date'].dt.day

  #Seperate vehicle make and year into their own columns
  make = []
  car_year = []
  for vehicle in X['vehicle_type']:
    make.append(re.search(r'^\w+|$', vehicle)[0])
    car_year.append(re.findall(r'\d+|$', vehicle)[0])

  X['vehicle_make'] = make
  X['vehicle_year'] = car_year

  #assume null for drugs_related_stop == false, replace values
  X['drugs_related_stop'].replace({np.nan:'False'}, inplace=True)

  #fill nulls in fine_grained_location
  X['fine_grained_location'].replace({np.nan:0}, inplace=True)

  #drop repetitive or unwanted columns
  drop_cols = ['id','state','location_raw', 'county_name',
             'county_fips','police_department','driver_race',
             'violation', 'district', 'search_type_raw',
             'stop_date', 'vehicle_type', 'is_arrested', 'search_type']
  X = X.drop(drop_cols, axis=1) 
  return X

In [0]:
train = wrangle(train)
val = wrangle(val)

In [217]:
#check nulls to make sure they will be ok to impute
train.isnull().sum(), val.isnull().sum()

(stop_time                   47
 fine_grained_location        0
 driver_gender                0
 driver_age_raw               0
 driver_age                  61
 driver_race_raw              0
 violation_raw                0
 search_conducted             0
 contraband_found             0
 stop_outcome                 0
 stop_duration            21889
 drugs_related_stop           0
 year                         0
 month                        0
 day                          0
 vehicle_make                 0
 vehicle_year                 0
 dtype: int64, stop_time                  13
 fine_grained_location       0
 driver_gender               0
 driver_age_raw              0
 driver_age                 18
 driver_race_raw             0
 violation_raw               0
 search_conducted            0
 contraband_found            0
 stop_outcome                0
 stop_duration            5433
 drugs_related_stop          0
 year                        0
 month                       0
 day    

In [222]:
train.head()

Unnamed: 0,stop_time,fine_grained_location,driver_gender,driver_age_raw,driver_age,driver_race_raw,violation_raw,search_conducted,contraband_found,stop_outcome,stop_duration,drugs_related_stop,year,month,day,vehicle_make,vehicle_year
61975,08:40,7,M,1947.0,61.0,Caucasian,"Moving Violation,Speed",False,False,Written Warning,15.0,False,2008,9,28,Toyo,1998
75905,21:15,10,F,1982.0,26.0,Caucasian,"Moving Violation,Speed",False,False,Written Warning,9.0,False,2008,4,21,Chev,2001
90174,00:20,17,M,1976.0,31.0,African American,"Moving Violation,Speed",False,False,Citation,14.0,False,2007,9,30,Linc,2003
60156,22:35,10,M,1990.0,24.0,Asian/Pacific Islander,"Equipment,Not applicable",False,False,Written Warning,7.0,False,2014,5,17,Kia,2004
62373,02:10,15,M,1986.0,20.0,African American,"Moving Violation,SeatBelt",False,False,Citation,,False,2006,7,2,Satu,1994


In [0]:
#break into X matrices and y vectors
target = 'driver_gender'
features = train.columns.drop(target)

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

In [226]:
#get baseline f1_score by guessing majority class
y_pred = ['M'] * len(y_val)
f1_score(y_val, y_pred, average='weighted')

  'precision', 'predicted', average, warn_for)


0.5889985113041443

In [0]:
#create pipeline to fit model
