# WAZE TRAFFIC ALERT DATA INITIAL EXPLORATION

**Note to Nateé**: I signed a document stating that I would not publish this dataset. Please take a look at it here. If you're interested in working with this data I'll give it to you on Monday. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
waze = pd.read_csv('../../data/01_raw/waze.csv')

## WHOLE DATASET

### Discovered Issues: 

1. **Country** - There are 15 different unique values listed in the country variables After closer inspection, the 355 entries without with values other than US have NaN values in every row. These rows should be dropped during cleaning.
1. **nTHumbsup** - The vast majority of the column contains 0 values. This does not look like it will add useful information. This will need to be dropped in data cleaning. 
1. **type** - Weatherhazard needs to be broken up into hazard and weather hazard. 
1. **street** - 2% of street names in the dataset are missing (that's over 200K out of 10 Million). 
1. **pubmills** - This needs to be turned into a timestamp ((Unix time – milliseconds since epoch))
1. **scrapedt** - This is the date that the data wasscraped from the site
1. **subtype** - change all nan values to NO_SUBTYPE

In [4]:
waze.head()

Unnamed: 0,country,nTHumbsUp,city,reportRating,confidence,reliability,type,uuid,roadType,magvar,subtype,street,location_x,location_y,pubMillis,reportDescription,scrape_dt
0,US,0.0,"Joliet, IL",0.0,0.0,7.0,ROAD_CLOSED,12d90f41-fd58-3d73-9bac-d2d24a4e1dbb,0.0,0.0,ROAD_CLOSED_EVENT,Briggs St,-88.04366,41.54111,1510536000000.0,Road Closed,2017-11-15T09:21:00Z
1,US,0.0,,0.0,3.0,10.0,ROAD_CLOSED,6bd6a1ff-55b3-3e57-8a57-13ef89e4b391,0.0,0.0,ROAD_CLOSED_EVENT,Smith Rd,-88.02769,41.64151,1504565000000.0,Construction,2017-11-15T09:21:00Z
2,US,0.0,"Lemont, IL",0.0,0.0,7.0,ROAD_CLOSED,0e957bc8-cdcc-395e-913c-facbef9a4494,0.0,0.0,ROAD_CLOSED_EVENT,135th St,-88.00416,41.64081,1510366000000.0,Road Closed,2017-11-15T09:21:00Z
3,US,0.0,"Joliet, IL",0.0,0.0,6.0,ROAD_CLOSED,dfa91aba-dc86-365d-b270-0572fca5ccab,0.0,0.0,ROAD_CLOSED_EVENT,Briggs St,-88.04355,41.53795,1510536000000.0,Road Closed,2017-11-15T09:21:00Z
4,US,0.0,"Lemont, IL",0.0,0.0,7.0,ROAD_CLOSED,4e9af18a-fde4-3f5b-9aeb-1d471988b735,0.0,0.0,ROAD_CLOSED_EVENT,135th St,-88.00494,41.64137,1510366000000.0,Road Closed,2017-11-15T09:21:00Z


In [3]:
waze.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10055207 entries, 0 to 10055206
Data columns (total 17 columns):
country              object
nTHumbsUp            float64
city                 object
reportRating         float64
confidence           float64
reliability          float64
type                 object
uuid                 object
roadType             float64
magvar               float64
subtype              object
street               object
location_x           float64
location_y           float64
pubMillis            float64
reportDescription    object
scrape_dt            object
dtypes: float64(9), object(8)
memory usage: 1.3+ GB


In [65]:
waze.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
nTHumbsUp,10055199.0,0.0006029717,0.1040078,0.0,0.0,0.0,0.0,18.0
reportRating,10054852.0,1.968055,1.529827,0.0,0.0,2.0,3.0,5.0
confidence,10054852.0,0.6834754,1.406939,-2.0,0.0,0.0,1.0,5.0
reliability,10054852.0,6.416892,1.788209,5.0,5.0,6.0,7.0,10.0
roadType,10054852.0,3.026482,2.55685,0.0,0.0,3.0,4.0,20.0
magvar,10054852.0,145.7354,121.359,0.0,0.0,131.0,268.0,359.0
location_x,10054852.0,-87.78549,0.1426022,-88.13767,-87.88483,-87.74752,-87.6624,-87.55329
location_y,10054852.0,41.86983,0.12727,41.51103,41.80239,41.88302,41.96277,42.11177
pubMillis,10054852.0,1515847000000.0,4185774000.0,1500986000000.0,1512685000000.0,1516810000000.0,1519243000000.0,1522050000000.0


In [6]:
print('Number of Observations in the Dataset: ',len(waze))

Number of Observations in the Dataset:  10055207


In [7]:
waze.isnull().sum()

country                  175
nTHumbsUp                  8
city                 2761258
reportRating             355
confidence               355
reliability              355
type                     355
uuid                     355
roadType                 355
magvar                   355
subtype               403764
street                230610
location_x               355
location_y               355
pubMillis                355
reportDescription    8031443
scrape_dt                488
dtype: int64

## INDIVIDUAL VARIABLES

Let's take a look at each individual variable and see what we can learn about them.

### Country Variable

The majority of values in Country are US, with around 100 marked as other countries. Since this is a dataset focused on Illinois it looks like These are mistakes. We will need to change all values to US. 

In [13]:
print('Number of Unique Values in Country: ',len(waze.country.unique()))

Number of Unique Values in Country:  15


In [12]:
waze.country.value_counts()

US    10054852
 s          54
 w          48
."          12
 C           8
 r           8
,"           8
d"           8
gh           7
ri           6
my           6
on           6
âœ           5
se           4
Name: country, dtype: int64

The entries other than US are have NaN values in all entries. These should be dropped

In [15]:
waze.loc[waze['country'] != 'US'].head()

Unnamed: 0,country,nTHumbsUp,city,reportRating,confidence,reliability,type,uuid,roadType,magvar,subtype,street,location_x,location_y,pubMillis,reportDescription,scrape_dt
2371069,âœ,17.0,,,,,,,,,,,,,,,
2371974,âœ,17.0,,,,,,,,,,,,,,,
2372903,âœ,17.0,,,,,,,,,,,,,,,
2373883,âœ,17.0,,,,,,,,,,,,,,,
2374911,âœ,17.0,,,,,,,,,,,,,,,


### nThumbsup

This category is mostly empty. This will need to be dropped. 

In [19]:
waze.nTHumbsUp.unique()

array([ 0., 17., 18., nan])

In [20]:
waze.nTHumbsUp.sum()

6063.0

In [21]:
waze.nTHumbsUp.value_counts()

0.0     10054861
18.0         317
17.0          21
Name: nTHumbsUp, dtype: int64

### city

All locations are in the Chicago area. 27% of the location column is empty. There is a location column (with latitude and longitude) so we may not need this column anyway. We also may be able to infer the exsct location from the lat and long columns later in the dataset. 

In [45]:
waze.city.value_counts().head()

Chicago, IL        5155542
Niles, IL           186732
Westmont, IL        138338
Des Plaines, IL      92935
Dixmoor, IL          84635
Name: city, dtype: int64

In [29]:
print('Ratio of Empty City Observation to All observations: ', waze.city.isnull().sum()/len(waze))

Ratio of Empty City Observation to All observations:  0.27460976188754743


### Report Rating

In [34]:
waze.reportRating.value_counts()

0.0    2653131
2.0    2341233
3.0    2230505
1.0    1158867
4.0    1099923
5.0     571193
Name: reportRating, dtype: int64

In [36]:
waze.reportRating.isnull().sum()

355

### confidence

In [37]:
waze.confidence.isnull().sum()

355

In [38]:
waze.confidence.value_counts()

 0.0    7356841
 1.0    1097641
 5.0     640066
 2.0     501556
 3.0     263412
 4.0     195253
-1.0         79
-2.0          4
Name: confidence, dtype: int64

### Reliability

The only null values are in the 355 observations that we will drop. 

In [42]:
print('Number of Null Values: ',waze.reliability.isnull().sum())

Number of Null Values:  355


In [39]:
waze.reliability.value_counts()

5.0     4568429
6.0     2264024
10.0    1487991
7.0      944369
8.0      506231
9.0      283808
Name: reliability, dtype: int64

### type

The only null values are in the 355 observations that we will drop. 

In [44]:
waze.type.isnull().sum()

355

In [43]:
waze.type.value_counts()

WEATHERHAZARD    4193220
JAM              3381627
ROAD_CLOSED      2243331
ACCIDENT          236674
Name: type, dtype: int64

### uuid

In [46]:
waze.uuid.isnull().sum()

355

In [48]:
waze.uuid.nunique()

931216

Examples of uuid

In [50]:
waze.uuid.head()

0    12d90f41-fd58-3d73-9bac-d2d24a4e1dbb
1    6bd6a1ff-55b3-3e57-8a57-13ef89e4b391
2    0e957bc8-cdcc-395e-913c-facbef9a4494
3    dfa91aba-dc86-365d-b270-0572fca5ccab
4    4e9af18a-fde4-3f5b-9aeb-1d471988b735
Name: uuid, dtype: object

### roadType

In [54]:
print('Number of Unique Roadtypes: ', waze.roadType.nunique())

Number of Unique Roadtypes:  11


In [55]:
waze.roadType.value_counts()

3.0     3192477
0.0     2575496
6.0     1455295
7.0     1001827
2.0      809507
4.0      590278
1.0      392640
20.0      33870
17.0       3444
8.0          12
5.0           6
Name: roadType, dtype: int64

In [57]:
print('Number of Null Values: ',waze.roadType.isnull().sum())

Number of Null Values:  355


### magvar

In [59]:
print('Number of Unique Values: ', waze.magvar.nunique())

Number of Unique Values:  360


In [67]:
print('Number of Nan values: ',waze.magvar.isnull().sum())

Number of Nan values:  355


### subtype

In [69]:
print('Number of Unique Subtypes: ', waze.subtype.nunique())

Number of Unique Subtypes:  25


Looks like hazard has weather hazard and regular hazard mixed together. 

In [76]:
waze.loc[waze['type']=='WEATHERHAZARD'].subtype.unique()

array(['HAZARD_ON_SHOULDER_CAR_STOPPED', 'HAZARD_ON_ROAD_OBJECT',
       'HAZARD_ON_ROAD_CAR_STOPPED', 'HAZARD_ON_ROAD_POT_HOLE',
       'HAZARD_ON_ROAD_TRAFFIC_LIGHT_FAULT', 'HAZARD_WEATHER_FLOOD',
       'HAZARD_ON_ROAD_CONSTRUCTION', nan, 'HAZARD_ON_ROAD_ROAD_KILL',
       'HAZARD_WEATHER_FOG', 'HAZARD_ON_SHOULDER_MISSING_SIGN',
       'HAZARD_ON_ROAD', 'HAZARD_ON_SHOULDER',
       'HAZARD_ON_SHOULDER_ANIMALS', 'HAZARD_WEATHER',
       'HAZARD_ON_ROAD_ICE', 'HAZARD_WEATHER_HAIL',
       'HAZARD_WEATHER_HEAVY_SNOW'], dtype=object)

In [77]:
waze.loc[waze['type']=='JAM'].subtype.unique()

array(['JAM_MODERATE_TRAFFIC', 'JAM_HEAVY_TRAFFIC',
       'JAM_STAND_STILL_TRAFFIC', nan], dtype=object)

In [78]:
waze.loc[waze['type']=='ROAD_CLOSED'].subtype.unique()

array(['ROAD_CLOSED_EVENT', nan, 'ROAD_CLOSED_CONSTRUCTION',
       'ROAD_CLOSED_HAZARD'], dtype=object)

In [79]:
waze.loc[waze['type']=='ACCIDENT'].subtype.unique()

array([nan, 'ACCIDENT_MINOR', 'ACCIDENT_MAJOR'], dtype=object)

In [107]:
waze.loc[waze['subtype']==np.nan]

Unnamed: 0,country,nTHumbsUp,city,reportRating,confidence,reliability,type,uuid,roadType,magvar,subtype,street,location_x,location_y,pubMillis,reportDescription,scrape_dt


In [108]:
bool_series = pd.isnull(waze["subtype"])

In [113]:
waze[bool_series].type.unique()

array(['ACCIDENT', 'ROAD_CLOSED', 'JAM', 'WEATHERHAZARD', nan],
      dtype=object)

### street

Around 2% of the street names are missing from the dataset

In [82]:
print('Rate of missing street names: ',waze.street.isnull().sum()/len(waze))

Rate of missing street names:  0.022934386134467445


In [83]:
waze.street.isnull().sum()

230610

### location_x & location_y

In [84]:
waze.location_x.isnull().sum()

355

In [85]:
waze.location_y.isnull().sum()

355

### pubMillis

In [90]:
waze.pubMillis.isnull().sum()

355

In [89]:
type(waze.pubMillis[0])

numpy.float64

### reportDescription

In [98]:
print('Percentage of Nan values in reportDescription variable:',(waze.reportDescription.isnull().sum()/len(waze))*100,'%')

Percentage of Nan values in reportDescription variable: 79.87347252025742 %


### scrape_dt

In [99]:
waze.scrape_dt.isnull().sum()

488

In [103]:
waze.scrape_dt.unique()

array(['2017-11-15T09:21:00Z', '2017-11-15T09:30:00Z',
       '2017-11-15T10:05:00Z', ..., '2018-04-01T15:15:00Z',
       '2018-04-01T15:20:00Z', '2018-04-01T15:25:00Z'], dtype=object)