### Data Cleaning

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, you will learn how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

### 1. Common data problems 

- Inconsistent column names
- Missing Data
- Outliers
- Duplicate rows
- Untidiness

In [1]:
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
from random import randint

In [2]:
ride_sharing = pd.read_csv('../Datasets/ride_sharing_new.csv')
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


### Numeric data or ... ?
You'll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1 for free riders.

2 for pay per ride.

3 for monthly subscribers.

In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd.

In [3]:
# Print the information of ride_sharing
print(ride_sharing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None


In [4]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


In [5]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

In [6]:
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'


In [7]:
# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


### Summing strings and concatenating numbers
In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical.

In [8]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print('Average ride sharing duration time is {:.2f}'.format(ride_sharing['duration_time'].mean()))

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
Average ride sharing duration time is 11.39


In [9]:
#Trying to create random tire sizes for each bike in the dataset
tire_sizes = []
for s in range(0, 25760):
    n = random.randint(26, 29)
    tire_sizes.append(n)
    
#Creating a tire sizez column in the dataset
ride_sharing['tire_sizes'] = tire_sizes

In [10]:
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,user_type_cat,duration_trim,duration_time,tire_sizes
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,2,12,12,26
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,2,24,24,28
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male,3,8,8,29
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male,1,4,4,28
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male,2,11,11,26


In [11]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
 11  duration_trim    25760 non-null  object  
 12  duration_time    25760 non-null  int64   
 13  tire_sizes       25760 non-null  int64   
dtypes: category(1), int64(8), object(5)
memory usage: 2.6+ MB


In [12]:
#Changing the datatype of tire sizes from integer to category
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
assert ride_sharing['tire_sizes'].dtype == 'category'

In [13]:
#Checking if the data type change really worked
assert ride_sharing['tire_sizes'].dtype == 'category'

### Tire size constraints
In this lesson, you're going to build on top of the work you've been doing with the ride_sharing DataFrame. You'll be working with the tire_sizes column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.

In [21]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
ride_sharing[ride_sharing['tire_sizes'] > 27]

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

count     25760
unique        2
top          27
freq      19316
Name: tire_sizes, dtype: int64


In [22]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
 11  duration_trim    25760 non-null  object  
 12  duration_time    25760 non-null  int64   
 13  tire_sizes       25760 non-null  category
dtypes: category(2), int64(7), object(5)
memory usage: 2.4+ MB


In [23]:
# I want to add a date column to the dataframe. will do that since its necessary for the next exercise
import random
from datetime import datetime, timedelta

min_year = 2017
max_year = datetime.now().year

start = datetime(min_year, 1, 1, 00, 00, 00)
years = max_year - min_year + 2
end = start + timedelta(days=365 * years)

for i in range(25760):
    random_date = start + (end - start) * random.random()
    print(random_date)

2021-07-03 14:43:08.189193
2021-07-03 10:24:50.573701
2020-02-14 11:39:09.429808
2019-10-01 07:13:21.607652
2022-11-02 01:15:17.915101
2020-08-05 01:46:42.516432
2017-09-07 07:32:53.179722
2021-05-20 20:13:18.805811
2020-01-16 10:28:18.087720
2017-11-17 02:35:08.704381
2017-01-06 20:58:52.131335
2018-02-25 23:44:55.915556
2020-10-13 22:02:48.151770
2021-02-05 03:56:14.859708
2022-04-07 22:43:29.146969
2018-07-31 02:50:34.641078
2020-12-03 13:13:58.093169
2021-05-01 07:21:19.737420
2020-08-07 07:02:52.709985
2021-10-14 15:38:36.484395
2018-07-14 04:00:23.915260
2021-12-21 23:50:47.730754
2019-11-16 07:38:35.837304
2022-09-07 10:24:58.697032
2021-02-24 00:34:52.647877
2019-03-28 04:53:21.052959
2019-06-14 13:51:14.418710
2021-08-11 03:56:09.440011
2020-03-25 16:28:47.568036
2020-02-14 06:40:46.004712
2018-11-12 11:16:10.770641
2022-04-06 10:35:54.036015
2021-09-24 00:20:40.864896
2020-11-30 13:12:34.095149
2021-01-28 14:01:49.850498
2022-07-27 04:08:46.110496
2021-10-21 17:07:13.856088
2

2017-02-19 15:11:30.720486
2018-03-25 14:50:51.480647
2018-10-26 21:44:56.711370
2020-01-03 10:41:41.825102
2021-04-26 00:25:46.199203
2020-07-06 06:46:41.168568
2018-09-17 16:41:45.734014
2021-07-23 18:38:14.268648
2022-07-11 04:48:18.880864
2022-04-13 15:57:58.747404
2017-12-12 08:04:25.703559
2019-11-10 06:52:02.952924
2020-07-04 09:43:40.294764
2022-11-24 03:41:59.101275
2018-11-06 23:31:00.136700
2020-03-05 22:06:52.919675
2020-11-13 17:27:12.342233
2019-02-06 18:29:47.297359
2018-06-28 08:46:40.169974
2019-05-02 16:43:23.668883
2020-01-14 23:58:42.862139
2018-06-13 04:41:59.511652
2017-08-12 04:12:27.889084
2020-07-21 15:39:01.362053
2022-01-18 16:18:34.936494
2020-03-08 11:10:50.241755
2020-02-19 12:30:35.827164
2020-11-06 01:54:05.403835
2018-05-11 12:17:26.320741
2020-10-07 11:03:05.734400
2018-01-02 10:56:29.866252
2019-05-14 22:01:00.493861
2019-01-26 23:33:13.345741
2020-01-18 05:29:41.149685
2017-06-22 13:00:55.269173
2019-08-11 03:46:28.476033
2018-12-08 00:44:17.394350
2

2022-10-03 03:14:11.880766
2019-03-02 04:00:23.625537
2017-02-03 13:37:27.444427
2021-07-20 18:45:25.595381
2020-08-01 23:11:37.787714
2019-05-03 07:22:48.720235
2017-02-11 10:01:33.479969
2017-04-13 01:01:31.166099
2018-11-01 20:09:56.563702
2020-11-20 23:09:22.413605
2019-02-05 09:45:25.968067
2017-02-22 06:04:31.270327
2017-07-21 01:01:24.064251
2021-02-08 23:34:20.904210
2019-05-13 12:50:08.533637
2018-11-09 05:42:08.121700
2019-10-13 07:10:37.045611
2022-05-04 06:33:55.272648
2019-02-25 16:38:26.031025
2021-05-12 04:00:00.525981
2018-04-11 03:09:56.118946
2021-08-09 12:47:35.798324
2019-09-03 02:38:11.016328
2021-09-14 12:49:15.579278
2022-06-14 17:52:30.178611
2022-12-15 13:38:13.866162
2021-02-28 10:56:16.570902
2022-12-28 05:31:56.766322
2020-02-14 15:58:14.415970
2019-03-20 14:59:43.033000
2019-12-23 00:50:46.365611
2019-10-29 19:13:45.346171
2022-07-28 19:43:29.769888
2018-02-03 20:46:06.687276
2022-07-26 06:30:50.034557
2021-06-25 02:14:58.320796
2018-10-14 07:53:41.570729
2

2017-06-12 22:35:39.893529
2019-04-08 14:00:12.587025
2022-06-01 21:51:38.987296
2018-05-28 11:30:44.017249
2017-10-05 18:56:06.555887
2018-11-21 01:29:32.645789
2019-11-02 09:09:32.298737
2018-10-05 05:52:24.584144
2022-08-07 18:33:57.426870
2019-08-10 22:31:07.004012
2017-06-05 10:43:16.373132
2017-09-19 07:24:47.539642
2022-01-24 13:45:22.593255
2022-08-15 03:49:41.260045
2018-01-08 23:34:59.238586
2021-12-22 20:28:09.244080
2021-02-01 12:29:20.184658
2019-07-10 14:07:15.981143
2022-03-24 15:40:55.623114
2019-05-03 08:15:01.815181
2017-01-31 21:01:14.504225
2020-11-19 11:36:45.652076
2018-01-04 13:37:46.559110
2020-12-29 00:05:21.926904
2020-11-19 09:21:28.798358
2018-01-25 03:48:08.928562
2022-05-15 06:56:33.584528
2021-03-21 11:40:41.545129
2017-05-10 07:41:04.964897
2022-05-17 05:01:47.179165
2020-08-14 10:57:15.274385
2017-09-02 01:23:31.191519
2022-12-19 09:15:17.842888
2017-03-09 20:57:41.242292
2022-12-25 04:14:09.197531
2021-05-12 04:33:17.505041
2020-01-10 03:10:56.459420
2

2017-04-26 07:25:36.979988
2021-06-21 08:15:05.037308
2019-11-25 14:47:02.361504
2018-10-25 12:50:57.159027
2022-12-23 13:33:54.202100
2017-03-15 21:34:47.626144
2021-10-13 05:04:29.991953
2017-07-27 18:58:02.011580
2018-04-25 20:37:31.507676
2019-05-18 23:47:50.659419
2018-11-09 06:08:37.486163
2019-08-11 10:05:35.533665
2021-12-26 23:29:38.591180
2022-07-10 06:53:02.482732
2020-02-22 16:30:22.486124
2021-03-20 00:13:44.733726
2020-11-01 09:43:20.215906
2020-06-03 15:51:11.601005
2022-05-26 00:55:25.664138
2020-11-19 23:28:11.387160
2022-12-21 01:23:39.687562
2017-07-13 14:53:37.625503
2021-03-01 07:24:35.145401
2017-11-15 09:44:56.096502
2018-05-25 10:25:58.133944
2021-01-26 12:58:11.140106
2020-01-07 16:12:40.807185
2020-02-02 09:59:38.763047
2022-12-25 10:17:21.529800
2020-10-28 13:31:18.123623
2018-01-26 05:42:33.562734
2019-12-28 11:34:22.154169
2019-10-12 20:53:42.400920
2017-07-29 23:11:34.918655
2017-05-28 14:17:10.121288
2022-11-18 01:29:34.578109
2017-02-15 10:41:56.905272
2

2022-01-11 01:57:41.515878
2019-08-24 16:39:33.465748
2020-11-15 03:44:26.790420
2018-03-12 21:14:02.543968
2018-11-08 04:33:09.263490
2017-07-24 03:22:27.940100
2020-06-04 01:18:32.884524
2019-01-20 23:15:48.763225
2022-08-10 08:50:01.890598
2022-09-08 09:25:59.720569
2020-08-30 09:04:37.808360
2017-07-06 04:46:50.893272
2020-04-19 13:41:24.754626
2021-10-09 12:29:05.362809
2017-05-13 19:33:51.403619
2022-07-27 12:26:48.768680
2021-02-04 08:57:17.881970
2018-04-09 01:06:47.711983
2021-07-07 00:45:47.926697
2019-12-16 20:06:44.414058
2017-07-14 21:44:24.500906
2019-01-26 14:33:21.621704
2022-04-25 05:25:52.132778
2020-09-10 22:59:19.991836
2018-10-25 18:37:18.826610
2018-10-03 19:04:59.422575
2018-11-26 19:58:16.486356
2022-02-26 18:43:59.874920
2022-10-22 03:03:36.265509
2018-11-22 09:52:30.015407
2019-03-07 01:42:14.308094
2022-07-13 05:35:52.963786
2020-02-21 08:08:16.754144
2021-02-05 01:38:54.641238
2019-05-01 05:39:42.674205
2020-06-11 20:51:34.614501
2017-06-04 16:24:59.850258
2

2018-09-27 21:13:18.085515
2018-09-16 11:05:57.449632
2021-10-15 18:03:13.750923
2018-10-02 09:26:44.324968
2021-05-31 14:32:25.049767
2018-09-07 06:07:33.020007
2020-04-20 15:43:27.176360
2017-01-20 11:49:41.701682
2019-01-28 01:40:43.681155
2020-12-11 10:53:53.477063
2021-10-04 21:38:40.698721
2020-11-21 20:27:31.412800
2020-08-03 12:46:11.594074
2018-06-15 03:10:44.933112
2017-04-09 02:19:03.268524
2019-12-01 06:01:48.651776
2021-09-22 14:50:02.749289
2020-08-09 06:51:15.427067
2019-02-17 20:43:02.237590
2017-03-05 00:17:19.090107
2022-01-16 11:13:37.491156
2021-01-30 02:42:54.499726
2017-08-30 15:24:38.158423
2022-07-06 23:58:40.822181
2017-03-19 23:36:15.585755
2017-10-07 19:12:31.663127
2022-11-16 01:00:50.426616
2022-01-02 11:53:24.831575
2018-05-22 03:49:19.523374
2017-07-17 22:40:34.799227
2020-05-14 10:32:01.175287
2021-04-11 01:58:26.643576
2021-02-11 13:05:32.721624
2017-09-22 13:50:54.984469
2018-03-24 14:17:49.654131
2018-01-13 02:58:57.578799
2019-08-14 07:23:12.760737
2

2019-06-17 23:35:18.755200
2020-12-13 22:33:57.715570
2020-04-26 03:48:44.558924
2018-04-24 17:35:58.829980
2022-10-21 11:55:26.276704
2022-10-14 10:18:36.755431
2021-07-31 21:04:33.198172
2018-01-25 18:19:10.344802
2021-11-28 21:36:43.811709
2020-04-06 17:04:05.177113
2017-07-03 19:13:20.894106
2019-05-14 03:55:22.048911
2022-07-18 19:02:17.951722
2020-11-05 23:44:51.979398
2021-07-13 02:43:32.466365
2022-08-05 22:47:37.465964
2017-01-26 11:44:13.159434
2019-04-24 04:08:11.619352
2018-06-03 12:36:14.004761
2020-03-04 20:31:26.067853
2018-10-28 09:12:43.468769
2018-05-17 03:05:45.476384
2019-02-05 07:09:45.646400
2018-09-19 02:14:09.635535
2019-07-05 12:24:38.773725
2017-04-21 12:06:13.630947
2019-07-15 09:42:42.546811
2020-10-11 08:25:19.909042
2022-02-06 12:23:28.321216
2021-02-13 07:52:13.069702
2017-08-15 16:16:46.028586
2017-01-26 18:26:13.254279
2018-05-10 04:52:03.995385
2022-12-11 15:45:53.173977
2020-05-07 07:54:54.296482
2022-03-14 02:40:52.998401
2018-12-21 00:26:23.506578
2

2021-10-21 19:12:21.657796
2019-12-23 19:05:50.426290
2017-12-11 08:10:07.040131
2021-11-01 19:27:06.611782
2018-03-01 03:43:47.376949
2020-04-29 02:06:45.058908
2022-08-08 20:10:06.233899
2022-06-13 13:57:54.996798
2021-05-29 10:53:43.407050
2019-09-14 20:33:07.523975
2022-05-15 08:18:33.475060
2020-03-25 07:56:58.075587
2021-10-16 13:58:30.735147
2019-11-05 19:25:30.247430
2021-07-04 15:32:36.841470
2021-10-26 19:18:50.061280
2020-05-11 09:07:00.560931
2022-09-16 09:56:48.887634
2017-11-14 18:45:59.067918
2020-02-15 08:18:58.421057
2020-04-18 04:25:20.028586
2022-10-28 17:55:43.643183
2019-02-18 07:33:05.011719
2018-04-03 16:14:01.039289
2019-04-24 02:41:33.400416
2017-08-02 10:37:30.011801
2022-11-27 13:25:46.834700
2019-09-08 03:04:21.651112
2022-09-11 16:13:05.635007
2020-09-22 09:35:37.884440
2022-03-20 01:45:23.005346
2022-05-31 12:32:42.341075
2022-02-12 21:43:55.318037
2019-10-19 13:12:47.074434
2021-09-04 14:00:03.767694
2022-03-29 09:13:50.056101
2018-03-06 20:45:23.026875
2

2022-12-24 12:04:25.790314
2017-06-01 21:07:12.722700
2019-06-21 01:13:42.778361
2019-05-01 14:38:18.730253
2020-03-15 18:39:44.031271
2019-01-12 12:07:08.077595
2022-11-16 03:08:32.893617
2020-05-01 20:44:22.607608
2017-09-23 07:42:33.370182
2021-01-21 16:00:42.084643
2022-07-07 14:04:50.898859
2017-11-17 17:09:05.313950
2022-02-27 01:22:55.713460
2018-02-15 03:43:43.132522
2022-11-09 20:36:52.069851
2018-10-07 19:00:11.848535
2017-07-30 13:42:07.080054
2019-06-04 08:00:39.115000
2018-07-06 15:43:57.234041
2018-02-09 08:24:12.922570
2022-02-10 07:42:50.835647
2018-04-20 17:15:55.615625
2018-01-14 07:33:59.979160
2018-10-26 16:48:50.642992
2022-11-22 05:42:07.926421
2018-04-17 10:19:36.189038
2022-09-12 01:19:14.043692
2020-06-27 19:49:38.749283
2018-05-20 18:02:10.978354
2019-10-02 18:33:40.873122
2019-07-12 23:49:38.295258
2017-02-13 01:30:21.085472
2020-05-05 19:11:40.826842
2022-02-21 03:11:52.838924
2019-06-17 14:44:32.599660
2017-08-05 23:07:00.447040
2021-04-24 13:21:33.813504
2

2022-06-20 03:14:15.396268
2017-09-16 08:37:01.650105
2019-10-01 02:58:33.706098
2021-01-28 08:20:24.248827
2020-08-13 20:48:17.860441
2021-02-15 13:16:35.930414
2020-11-23 01:18:40.472372
2021-04-11 00:06:49.863996
2019-12-13 09:28:12.868568
2020-08-07 13:23:52.275026
2017-11-15 06:45:43.151061
2020-10-03 23:45:41.436902
2019-12-06 18:23:07.805598
2020-11-07 17:18:41.868251
2020-03-20 03:43:28.540670
2017-11-12 15:55:24.498092
2022-08-11 11:27:18.954370
2020-07-07 17:11:11.109875
2017-11-13 01:24:03.209051
2018-10-10 10:07:56.997978
2019-07-29 23:01:12.750307
2017-01-09 02:53:42.499424
2019-09-25 19:58:22.634416
2018-07-28 17:14:04.391992
2017-01-06 05:30:30.094520
2020-02-06 22:45:55.460931
2017-07-31 19:35:06.954478
2018-04-30 16:23:28.978836
2017-03-01 22:25:38.688056
2017-07-21 08:54:07.137873
2019-01-15 23:21:03.069310
2022-02-23 12:01:39.846509
2020-12-08 23:04:01.629069
2018-02-13 05:30:21.488003
2017-09-06 23:02:36.833094
2019-08-27 13:59:42.934310
2017-03-02 17:49:28.164052
2

2017-11-10 17:30:06.486461
2022-10-11 21:32:24.831209
2017-08-21 13:41:38.110110
2022-10-13 05:28:18.374447
2017-08-12 11:08:10.199334
2018-09-05 00:35:07.493268
2018-09-20 18:04:12.734154
2017-12-20 00:48:00.249003
2018-03-08 05:00:20.287731
2018-05-01 02:22:23.034049
2021-03-09 23:04:16.676369
2018-05-05 17:26:44.836603
2021-07-30 03:04:05.155044
2019-08-09 03:00:00.157976
2019-02-14 22:36:43.420058
2018-11-25 04:14:46.209888
2022-03-01 22:58:24.458689
2019-01-04 09:23:22.799245
2022-04-27 03:00:54.414159
2020-06-21 00:53:05.894460
2022-04-03 02:00:58.381600
2021-05-16 07:58:36.351190
2018-01-20 07:15:20.195920
2018-04-16 13:42:00.252814
2018-03-15 22:23:41.669192
2018-07-01 06:40:59.819353
2021-12-27 12:14:18.723391
2018-02-28 08:44:27.531854
2018-06-15 01:44:10.632891
2021-02-28 15:01:49.534477
2020-08-26 08:40:45.672399
2022-12-13 01:41:16.883617
2017-07-05 02:58:03.627438
2018-11-05 21:11:18.845721
2018-12-16 08:09:44.695419
2019-01-24 14:25:09.437819
2018-05-03 01:23:11.903249
2

2020-02-10 09:25:02.818555
2021-08-29 22:36:23.951333
2020-09-01 04:55:19.109343
2019-09-30 21:11:19.293610
2020-10-14 15:16:31.466916
2018-06-09 11:52:03.396541
2020-07-20 17:14:50.255290
2017-12-23 06:49:51.837308
2019-06-26 13:15:41.653256
2022-09-14 01:49:28.986329
2019-12-03 18:02:33.436506
2022-08-30 15:35:08.265770
2017-02-22 03:02:34.895896
2019-10-20 12:27:26.534964
2019-12-02 10:42:01.770195
2019-05-16 22:09:44.296425
2018-11-09 12:23:28.533992
2020-06-25 16:26:03.978374
2018-07-08 20:52:23.712891
2019-09-03 13:46:36.093295
2017-06-11 15:30:49.901675
2022-01-12 17:08:41.034779
2017-10-17 03:51:07.015797
2018-05-01 20:38:35.899095
2018-03-20 10:17:19.551236
2021-08-02 03:58:56.525038
2018-10-21 00:18:51.222578
2019-12-02 03:51:27.221709
2020-06-08 01:44:53.043924
2017-05-21 08:02:31.033661
2017-10-24 05:55:34.616652
2021-06-07 03:06:16.100647
2022-09-04 08:50:06.194272
2021-04-13 12:11:49.762577
2017-02-05 21:08:41.988724
2022-01-22 08:42:06.792496
2019-05-02 14:01:07.597603
2

2022-11-14 05:24:02.420082
2018-01-01 05:03:55.041152
2018-08-04 05:15:39.887821
2022-08-12 04:49:19.245202
2019-03-08 07:55:47.386018
2020-10-12 22:49:51.271483
2019-09-24 13:40:15.747433
2018-05-28 16:33:34.319851
2021-11-02 23:34:54.836436
2019-06-19 02:56:21.301728
2022-04-12 19:01:29.059550
2022-02-12 04:57:58.645337
2022-09-09 23:59:16.002454
2019-01-19 00:05:20.620046
2017-09-25 02:11:04.133389
2019-06-07 18:22:30.906701
2021-09-16 19:38:17.781969
2019-06-23 21:04:29.420219
2017-04-12 18:16:44.425345
2017-08-14 06:30:25.447103
2021-02-09 23:45:46.417767
2017-05-30 05:10:29.710565
2019-05-02 01:27:32.726815
2019-04-13 23:48:34.057515
2020-07-02 06:19:16.484084
2019-10-29 04:06:12.799978
2017-04-13 04:25:19.447073
2021-02-03 10:44:36.838396
2020-04-21 13:42:37.914270
2021-09-15 07:16:51.012447
2019-05-06 04:55:34.677350
2022-02-20 01:04:41.803437
2018-07-26 15:43:24.799181
2017-07-03 03:35:03.758351
2017-11-04 17:23:26.290932
2022-05-04 20:50:14.277102
2020-05-11 11:36:41.472754
2

2020-08-10 18:31:18.277401
2017-09-15 16:14:57.888305
2022-10-22 23:17:18.987232
2022-11-19 19:03:47.384443
2021-10-04 10:54:24.918958
2019-11-19 16:35:06.464849
2022-01-06 02:15:19.955241
2020-10-29 11:26:25.130354
2021-03-10 17:06:35.800015
2020-07-14 08:40:33.635669
2019-03-01 14:26:14.455608
2018-07-12 14:00:49.584545
2021-04-25 05:34:52.278764
2017-07-09 15:13:52.969680
2018-05-23 02:34:56.848997
2022-03-26 06:37:07.431052
2020-08-30 10:06:10.371976
2020-08-22 08:32:55.204468
2019-04-23 16:55:19.564882
2018-06-04 06:49:13.902575
2021-04-28 16:20:30.768126
2022-07-17 22:29:16.726856
2019-04-06 13:12:47.701026
2017-12-06 07:04:42.569024
2018-11-05 11:19:01.706334
2017-01-19 01:32:52.724006
2018-12-05 11:30:33.133076
2020-08-12 16:46:46.899056
2019-11-03 08:22:36.236424
2020-02-07 01:36:47.930285
2021-10-10 04:53:28.458487
2021-06-03 14:01:39.057127
2019-10-25 20:57:52.797915
2019-06-14 04:08:05.155426
2017-07-09 19:57:44.543638
2019-01-19 17:46:42.595613
2019-04-02 01:10:27.045265
2

2017-01-17 12:11:37.819891
2019-08-15 09:54:42.036940
2020-08-24 20:10:08.222817
2018-02-07 11:32:35.950897
2017-06-25 02:27:35.627031
2017-10-10 05:53:54.283790
2019-09-19 08:36:47.812227
2019-10-05 08:58:32.672489
2018-09-23 01:50:48.919632
2017-11-27 22:55:05.300994
2017-06-17 21:53:04.197798
2020-08-27 01:57:24.635490
2019-09-16 02:30:11.521557
2021-07-31 13:19:21.768990
2019-09-28 13:34:53.790797
2019-02-27 03:22:59.014000
2022-11-19 16:50:30.089743
2019-01-12 23:00:12.942983
2020-10-11 15:51:53.749092
2018-03-28 02:02:17.683534
2020-09-08 15:55:15.325892
2020-06-30 15:13:05.005505
2018-01-11 20:03:51.611456
2020-04-25 10:33:37.925415
2020-12-21 19:52:00.887360
2019-09-28 18:08:10.286961
2018-01-16 02:53:04.867179
2022-12-25 12:03:37.431970
2020-07-04 03:55:53.818613
2017-07-11 20:11:36.131797
2018-12-15 23:43:16.400260
2017-03-20 23:36:27.822580
2020-04-15 13:35:43.632322
2021-07-16 06:54:49.821879
2022-07-21 01:18:25.020298
2022-07-30 14:49:05.744369
2019-05-23 19:48:27.743766
2

2022-04-29 22:28:37.363864
2018-07-14 09:18:02.860650
2022-08-03 09:49:38.204037
2022-08-19 07:55:42.753166
2022-10-05 17:39:55.503184
2021-06-10 20:07:22.763293
2019-11-20 08:06:13.711714
2020-10-29 04:31:29.446053
2020-10-30 11:34:33.144804
2017-10-07 00:23:10.135751
2021-09-11 02:23:02.095372
2021-11-30 20:53:49.304701
2018-10-03 22:32:40.521435
2021-10-07 20:12:01.583477
2022-04-18 00:27:35.531049
2020-03-07 18:20:12.149466
2019-12-29 21:36:57.752141
2021-11-10 20:24:19.652721
2018-09-24 11:29:25.758184
2021-08-01 21:40:03.268494
2020-01-07 01:28:23.757727
2020-02-12 09:09:44.762341
2019-06-20 18:51:23.719334
2019-11-06 12:51:23.209063
2019-12-25 19:26:50.270977
2017-08-25 15:18:01.140545
2018-02-17 02:00:26.026818
2019-05-09 10:59:28.840319
2019-03-31 15:51:11.705544
2018-07-11 09:45:02.436280
2021-07-04 07:50:57.916597
2020-07-31 11:41:29.444343
2020-02-26 13:17:13.857754
2018-04-23 10:29:04.013204
2022-09-28 23:30:59.052506
2021-09-21 04:19:20.982353
2019-05-27 16:24:29.007292
2

2018-03-15 09:05:33.897767
2022-10-14 14:43:44.497236
2020-10-25 15:30:42.480278
2018-01-05 05:48:41.697205
2019-11-19 00:49:06.676147
2017-12-30 16:30:37.249654
2017-11-08 18:30:51.349913
2020-12-17 16:21:34.169495
2017-08-30 22:52:58.606163
2019-11-28 03:35:33.427660
2022-08-02 01:09:08.363981
2017-05-11 08:19:43.498863
2022-06-26 00:20:45.775207
2019-07-19 03:56:38.825235
2020-05-24 13:40:00.694029
2017-10-18 03:33:24.767571
2018-05-04 23:06:57.729006
2017-10-14 15:23:02.231884
2019-09-21 07:18:43.115017
2021-11-28 14:01:56.242480
2022-11-14 17:39:32.799943
2022-03-29 06:37:01.960305
2017-07-14 12:31:18.595388
2022-12-04 07:15:26.464988
2018-01-04 20:26:40.354333
2022-09-15 19:14:32.114277
2019-03-23 22:45:35.483422
2018-02-20 23:07:38.710670
2019-09-30 03:56:34.520156
2019-07-02 21:32:50.133713
2019-09-16 05:29:51.383036
2017-06-02 00:50:52.652606
2020-09-14 11:04:33.634895
2020-04-13 23:28:37.778598
2019-08-31 22:50:20.943398
2021-11-18 04:17:53.824046
2021-06-05 07:32:25.854249
2

2022-10-19 10:58:37.154250
2021-10-05 23:58:32.323733
2019-10-29 22:41:39.374193
2021-10-21 01:05:12.138031
2022-02-18 14:34:44.648489
2017-05-03 11:23:02.112225
2019-02-28 17:42:47.643702
2017-12-22 03:46:32.349300
2018-12-13 02:46:37.932155
2019-04-23 10:19:49.005752
2018-06-11 17:52:20.640813
2020-07-03 05:51:54.029189
2017-05-12 15:12:13.101789
2017-04-07 07:38:31.369483
2018-02-04 10:29:22.840611
2018-05-12 08:56:17.759214
2021-11-20 17:40:04.530696
2021-10-16 17:48:48.978666
2021-06-22 10:08:53.291322
2017-12-27 16:32:06.545395
2022-10-09 14:20:16.448899
2018-11-26 01:01:44.065266
2017-03-25 07:35:38.041435
2020-05-06 19:55:50.404122
2020-12-12 09:02:30.465724
2020-07-22 15:49:56.796414
2017-04-12 10:09:13.198872
2017-03-21 20:05:51.182471
2018-09-22 18:09:56.181177
2020-05-18 05:18:18.506690
2020-12-14 12:33:29.998570
2019-07-21 11:54:04.408588
2021-09-23 23:31:22.238107
2020-04-19 00:24:54.374964
2021-01-04 07:17:20.009653
2018-11-01 12:25:36.008049
2022-05-03 03:35:58.964337
2

2017-01-16 00:14:33.310209
2017-05-07 00:27:32.951217
2019-10-09 18:00:51.778113
2022-01-04 11:01:24.828770
2017-02-02 08:22:10.043451
2021-11-07 17:13:14.856858
2018-11-24 01:04:24.796175
2021-01-14 01:36:05.267987
2022-02-12 01:28:31.444986
2021-07-10 06:24:08.435006
2018-10-20 01:54:32.000351
2019-10-24 20:48:09.285432
2019-01-02 16:11:58.406709
2018-12-31 02:34:14.726373
2022-07-15 06:41:10.033967
2022-07-22 14:10:41.041529
2021-04-26 08:57:23.324060
2018-01-05 02:58:33.429281
2018-09-21 23:36:54.094045
2017-11-26 13:27:01.507262
2021-03-13 05:03:12.331810
2019-02-09 09:13:56.655701
2022-06-11 19:45:05.533790
2019-11-17 01:56:35.213875
2021-07-24 20:16:43.558151
2018-07-18 20:05:07.895503
2017-06-29 04:17:08.078826
2018-08-29 10:49:21.709580
2019-12-08 17:30:24.084292
2019-05-05 19:34:12.595430
2018-12-12 09:50:43.056118
2022-09-26 17:20:50.038426
2021-12-29 07:49:29.172856
2017-09-10 01:54:20.954260
2020-02-07 10:04:26.245402
2018-07-15 22:23:51.658922
2018-02-06 23:50:06.062669
2

2018-11-30 14:35:32.688998
2018-04-21 22:55:20.106142
2020-07-30 19:40:55.136952
2022-08-30 12:48:51.188142
2021-01-08 14:03:38.257072
2017-11-13 07:44:57.391392
2021-02-21 18:37:59.489307
2021-04-16 14:00:50.197829
2019-06-10 11:43:51.689827
2019-02-11 20:57:18.157767
2021-05-15 16:21:47.823753
2021-12-02 16:33:31.671945
2020-11-09 00:24:09.064590
2019-11-28 04:03:54.287736
2019-07-13 23:06:30.947480
2017-03-09 14:14:50.480004
2019-06-20 20:19:59.115669
2017-11-29 17:00:07.079402
2020-10-14 06:31:54.605748
2020-08-07 12:54:03.268589
2020-08-17 02:46:30.727785
2019-01-16 06:00:29.686451
2017-11-04 11:10:02.541924
2017-12-16 10:48:37.718121
2017-02-14 10:09:38.078885
2021-02-12 05:15:35.665376
2021-06-06 20:19:03.874995
2020-07-20 16:08:17.006409
2017-02-09 18:40:37.931760
2022-03-22 14:45:54.312162
2021-11-26 17:22:03.458958
2019-01-08 01:59:09.690844
2018-10-15 20:23:37.553766
2021-12-19 14:51:39.133340
2022-07-31 04:11:56.279308
2022-12-15 02:13:30.991407
2021-01-28 00:13:25.922050
2

2021-01-04 18:04:35.301477
2017-07-30 12:12:13.377735
2021-07-23 04:42:47.126796
2021-10-11 01:35:19.269247
2022-06-09 01:12:46.019366
2020-05-01 02:23:49.212777
2019-07-02 18:56:37.355116
2017-12-09 08:28:53.603898
2022-10-19 14:59:04.455983
2017-10-31 06:28:20.056281
2020-01-05 13:39:08.366791
2021-02-11 11:24:46.413456
2020-02-11 07:05:05.867669
2018-10-23 11:20:50.762258
2019-03-12 08:46:31.025211
2022-04-07 14:59:20.134856
2019-04-15 11:20:11.969158
2020-11-14 02:15:59.470805
2019-06-08 07:44:56.858795
2022-12-29 16:47:11.944459
2019-01-25 21:13:01.286956
2021-08-28 18:30:23.622079
2017-10-17 01:05:44.202831
2019-03-14 07:35:12.800110
2021-08-24 01:55:59.634150
2021-07-18 18:52:21.910218
2020-02-07 02:57:57.968343
2017-06-18 03:40:37.098022
2021-07-12 10:12:09.203284
2021-09-08 04:38:29.417745
2022-04-24 06:19:39.636467
2021-07-07 23:34:45.003298
2020-10-20 13:48:58.440021
2019-12-03 01:52:01.921932
2021-10-05 20:52:28.875072
2019-07-21 00:28:41.971370
2017-06-15 23:52:22.967257
2

In [24]:
#Creating a ride date column
ride_sharing['ride_date'] = random_date

### Back to the future
A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride's date. This information is stored in the ride_date column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert ride_date to a datetime object.

The datetime package has been imported as dt, alongside all the packages you've been using till now.

In [25]:
import datetime as dt
# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])

# Save today's date
today = pd.Timestamp('today')

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

2020-06-21 09:06:36.519770
