# Exercices Extra : Preprocessing 

Dans ce notebook, nous allons extraire des informations relatives au temps (heures, minutes, etc.) à partir d'une data avec l'heure. 

On va commencer par lire un dataset contenant des informations sur des courses de taxis réalisées à NY (fichier "**mini_taxi.csv**") et regarder ce qu'il contient à l'aide des opérations ***info*** et ***head***.  

In [1]:
import pandas as pnd

dfTaxi = pnd.read_csv('datasets/mini_taxi.csv', index_col=[0])

dfTaxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5999 entries, 2009-06-15 17:26:21.0000001 to 2014-12-12 11:33:00.00000015
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   fare_amount        5999 non-null   float64
 1   pickup_datetime    5999 non-null   object 
 2   pickup_longitude   5999 non-null   float64
 3   pickup_latitude    5999 non-null   float64
 4   dropoff_longitude  5999 non-null   float64
 5   dropoff_latitude   5999 non-null   float64
 6   passenger_count    5999 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 374.9+ KB


In [2]:
dfTaxi.head()

Unnamed: 0_level_0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


On obseve notamment la présence d'une colonne "*pickup_datetime*" qui contient la date et l'heure à laquelle le taxi a pris le client. Selon ***info***, cette colonne est de type ***object***. On va donc la convertir en *DateTime* avec ***to_datetime***. 

In [3]:
dfTaxi['pickup_datetime'] = pnd.to_datetime(dfTaxi['pickup_datetime'])
dfTaxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5999 entries, 2009-06-15 17:26:21.0000001 to 2014-12-12 11:33:00.00000015
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   fare_amount        5999 non-null   float64            
 1   pickup_datetime    5999 non-null   datetime64[ns, UTC]
 2   pickup_longitude   5999 non-null   float64            
 3   pickup_latitude    5999 non-null   float64            
 4   dropoff_longitude  5999 non-null   float64            
 5   dropoff_latitude   5999 non-null   float64            
 6   passenger_count    5999 non-null   int64              
dtypes: datetime64[ns, UTC](1), float64(5), int64(1)
memory usage: 374.9+ KB


In [4]:
dfTaxi.head()

Unnamed: 0_level_0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21+00:00,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16+00:00,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00+00:00,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42+00:00,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00+00:00,-73.968095,40.768008,-73.956655,40.783762,1


Maintenant que notre colonne contient le bon type de données, on va pouvoir en extraire des informations sur la date et l'heure de la course. 

On va commencer par lui extraire d'autres informations intéressantes sur l'heure, et ajouter des nouvelles colonnes avec ces informations :
- heure:minute --> colonne "*time*"
- heure --> colonne "*hour*"
- minute --> colonne "*minute*"

puis, on observe le résultat avec un ***sample***. 

In [5]:
dfTaxi['time'] = dfTaxi['pickup_datetime'].dt.time
dfTaxi['hour'] = dfTaxi['pickup_datetime'].dt.hour
dfTaxi['minute'] = dfTaxi['pickup_datetime'].dt.minute

dfTaxi.sample(5)

Unnamed: 0_level_0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,time,hour,minute
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2011-12-18 03:27:00.000000168,6.9,2011-12-18 03:27:00+00:00,-73.98186,40.76894,-73.972612,40.785017,2,03:27:00,3,27
2013-02-08 14:39:00.00000019,10.5,2013-02-08 14:39:00+00:00,-73.999545,40.749222,-73.974397,40.736762,1,14:39:00,14,39
2011-07-10 02:08:08.0000002,10.1,2011-07-10 02:08:08+00:00,-73.993027,40.725563,-73.958591,40.717069,1,02:08:08,2,8
2009-01-07 22:11:00.000000156,6.9,2009-01-07 22:11:00+00:00,-73.983948,40.725547,-74.007088,40.733038,1,22:11:00,22,11
2011-05-02 22:39:00.000000108,4.5,2011-05-02 22:39:00+00:00,-73.991445,40.731617,-73.99381,40.72075,5,22:39:00,22,39


A l'aide de l'information qu'on a extrait dans la colonne "*hour*", on va pouvoir en déduire s'il s'agit d'une course "*nocturne*" ou pas. 
On considère une course nocture si elle a eu lieu soit avant 7h du matin, soit après 19h du soir. 

On va ainsi créer une nouvelle colonne "**night**", avec la valeur "**True**" si la course a eu lieu avant 7h ( *<7* ) ou après 19h ( *>19* ). L'opération ***apply*** va nous permettre de créer cette nouvelle colonne. 


In [6]:
dfTaxi['night'] = dfTaxi['hour'].apply(lambda x: (x<7 or x>19) )

dfTaxi.sample(5)

Unnamed: 0_level_0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,time,hour,minute,night
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2009-02-05 18:27:00.00000017,7.3,2009-02-05 18:27:00+00:00,-73.99062,40.757548,-73.998488,40.74021,5,18:27:00,18,27,False
2010-08-05 19:38:00.000000210,5.3,2010-08-05 19:38:00+00:00,-73.987588,40.744018,-73.990492,40.732958,1,19:38:00,19,38,False
2010-11-09 11:19:00.000000181,49.57,2010-11-09 11:19:00+00:00,-73.944293,40.782992,-73.795263,40.644525,1,11:19:00,11,19,False
2012-10-23 08:09:25.0000006,12.5,2012-10-23 08:09:25+00:00,-73.982332,40.768633,-73.983343,40.744372,1,08:09:25,8,9,False
2015-02-01 09:23:48.0000002,10.0,2015-02-01 09:23:48+00:00,-73.956123,40.775703,-73.982201,40.778912,1,09:23:48,9,23,False
