**The first dataset I'll explore is the Uber Fares dataset. Let's start by importing the data:**

In [73]:
import pandas as pd

with open('uber.csv') as f:
    uber = pd.DataFrame(pd.read_csv(f))

uber

Unnamed: 0,id,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.994710,40.750325,1
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.740770,-73.962565,40.772647,1
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5
...,...,...,...,...,...,...,...,...,...
199995,42598914,2012-10-28 10:49:00.00000053,3.0,2012-10-28 10:49:00 UTC,-73.987042,40.739367,-73.986525,40.740297,1
199996,16382965,2014-03-14 01:09:00.0000008,7.5,2014-03-14 01:09:00 UTC,-73.984722,40.736837,-74.006672,40.739620,1
199997,27804658,2009-06-29 00:42:00.00000078,30.9,2009-06-29 00:42:00 UTC,-73.986017,40.756487,-73.858957,40.692588,2
199998,20259894,2015-05-20 14:56:25.0000004,14.5,2015-05-20 14:56:25 UTC,-73.997124,40.725452,-73.983215,40.695415,1


**Let's clean up the data and for any missing variables, etc.**

In [74]:
uber.set_index('id', inplace=True)

uber.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 24238194 to 11951496
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   key                200000 non-null  object 
 1   fare_amount        200000 non-null  float64
 2   pickup_datetime    200000 non-null  object 
 3   pickup_longitude   200000 non-null  float64
 4   pickup_latitude    200000 non-null  float64
 5   dropoff_longitude  199999 non-null  float64
 6   dropoff_latitude   199999 non-null  float64
 7   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 13.7+ MB


**It looks line only one row is missing entries in the `dropoff_longitude` and `dropoff_latitude` columns. Let's get rid of that now:**

In [75]:
uber.dropna(axis=0, inplace=True)
uber.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199999 entries, 24238194 to 11951496
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   key                199999 non-null  object 
 1   fare_amount        199999 non-null  float64
 2   pickup_datetime    199999 non-null  object 
 3   pickup_longitude   199999 non-null  float64
 4   pickup_latitude    199999 non-null  float64
 5   dropoff_longitude  199999 non-null  float64
 6   dropoff_latitude   199999 non-null  float64
 7   passenger_count    199999 non-null  int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 13.7+ MB


**The attributes `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude` don't help us much on their own... let's calculate the distance using the coordinates and create a new column in the dataframe:**

In [76]:
import haversine as hs
from haversine import Unit

distance = []

for i in range(len(uber)):
    row = uber.iloc[[i]]
    distance.append(hs.haversine( (row['pickup_longitude'], row['pickup_latitude'] ), (row['dropoff_longitude'], row['dropoff_latitude'] ), unit=Unit.MILES))

uber['distance'] = distance

In [77]:
uber.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance
count,199999.0,199999.0,199999.0,199999.0,199999.0,199999.0,199999.0
mean,11.359892,-72.527631,39.935881,-72.525292,39.92389,1.684543,12.736598
std,9.90176,11.437815,7.720558,13.117408,6.794829,1.385995,241.979518
min,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0,0.0
25%,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0,0.512261
50%,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0,0.948325
75%,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0,1.744637
max,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0,10141.618345


**Some values for the trip distance are 0 (i.e., the pickup coordinates and the dropoff coordinates are the same), which is obviously a mistake. Let's get rid of those rows and drop the coordinates from the table while we're at it:**

In [78]:
uber = uber[uber.distance != 0]
uber.drop(columns=['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


**Looking forward to our model, we care more about the time of day than the month, year, etc. Let's make a column specifically for the time of day:**

In [79]:
from datetime import time

times = []

for datetime in uber['pickup_datetime']:
    time_string = datetime.split(' ')[1]
    time_string = time_string.split(':')
    time_object = time(int(time_string[0]), int(time_string[1]), int(time_string[2]))
    times.append(time_object)

uber['pickup_time'] = times
uber

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uber['pickup_time'] = times


Unnamed: 0_level_0,key,fare_amount,pickup_datetime,passenger_count,distance,pickup_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,1,0.289051,19:52:06
27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,1,0.421742,20:04:56
44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,1,2.997201,21:45:00
25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,3,0.783947,08:22:21
17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,5,3.336707,17:47:00
...,...,...,...,...,...,...
42598914,2012-10-28 10:49:00.00000053,3.0,2012-10-28 10:49:00 UTC,1,0.039878,10:49:00
16382965,2014-03-14 01:09:00.0000008,7.5,2014-03-14 01:09:00 UTC,1,1.517527,01:09:00
27804658,2009-06-29 00:42:00.00000078,30.9,2009-06-29 00:42:00 UTC,2,8.863743,00:42:00
20259894,2015-05-20 14:56:25.0000004,14.5,2015-05-20 14:56:25 UTC,1,1.118528,14:56:25


In [80]:
from sklearn import preprocessing as pp

uber['distance'] = pp.minmax_scale(uber['distance'])
uber

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uber['distance'] = pp.minmax_scale(uber['distance'])


Unnamed: 0_level_0,key,fare_amount,pickup_datetime,passenger_count,distance,pickup_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,1,0.000028,19:52:06
27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,1,0.000042,20:04:56
44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,1,0.000296,21:45:00
25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,3,0.000077,08:22:21
17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,5,0.000329,17:47:00
...,...,...,...,...,...,...
42598914,2012-10-28 10:49:00.00000053,3.0,2012-10-28 10:49:00 UTC,1,0.000004,10:49:00
16382965,2014-03-14 01:09:00.0000008,7.5,2014-03-14 01:09:00 UTC,1,0.000150,01:09:00
27804658,2009-06-29 00:42:00.00000078,30.9,2009-06-29 00:42:00 UTC,2,0.000874,00:42:00
20259894,2015-05-20 14:56:25.0000004,14.5,2015-05-20 14:56:25 UTC,1,0.000110,14:56:25
