### Data Cleaning

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, you will learn how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

### 1. Common data problems 

- Inconsistent column names
- Missing Data
- Outliers
- Duplicate rows
- Untidiness

In [1]:
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
from random import randint

import extra # Just a file containing useful lists

In [2]:
ride_sharing = pd.read_csv('../Datasets/ride_sharing_new.csv')
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


### Numeric data or ... ?
You'll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1 for free riders.

2 for pay per ride.

3 for monthly subscribers.

In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd.

In [3]:
# Print the information of ride_sharing
print(ride_sharing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None


In [4]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


In [5]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

In [6]:
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'


In [7]:
# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


### Summing strings and concatenating numbers
In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical.

In [8]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print('Average ride sharing duration time is {:.2f}'.format(ride_sharing['duration_time'].mean()))

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
Average ride sharing duration time is 11.39


In [9]:
#Trying to create random tire sizes for each bike in the dataset
tire_sizes = []
for s in range(0, 25760):
    n = random.randint(26, 29)
    tire_sizes.append(n)
    
#Creating a tire sizez column in the dataset
ride_sharing['tire_sizes'] = tire_sizes

In [10]:
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,user_type_cat,duration_trim,duration_time,tire_sizes
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,2,12,12,26
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,2,24,24,29
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male,3,8,8,26
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male,1,4,4,28
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male,2,11,11,29


In [11]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
 11  duration_trim    25760 non-null  object  
 12  duration_time    25760 non-null  int64   
 13  tire_sizes       25760 non-null  int64   
dtypes: category(1), int64(8), object(5)
memory usage: 2.6+ MB


In [12]:
#Changing the datatype of tire sizes from integer to category
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
assert ride_sharing['tire_sizes'].dtype == 'category'

In [13]:
#Checking if the data type change really worked
assert ride_sharing['tire_sizes'].dtype == 'category'

### Tire size constraints
In this lesson, you're going to build on top of the work you've been doing with the ride_sharing DataFrame. You'll be working with the tire_sizes column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.

In [14]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
ride_sharing[ride_sharing['tire_sizes'] > 27]

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

count     25760
unique        2
top          27
freq      19273
Name: tire_sizes, dtype: int64


In [15]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
 11  duration_trim    25760 non-null  object  
 12  duration_time    25760 non-null  int64   
 13  tire_sizes       25760 non-null  category
dtypes: category(2), int64(7), object(5)
memory usage: 2.4+ MB


In [16]:
# I want to add a date column to the dataframe. will do that since its necessary for the next exercise
import random
from datetime import datetime, timedelta

min_year = 2017
max_year = datetime.now().year

start = datetime(min_year, 1, 1, 00, 00, 00)
years = max_year - min_year + 2
end = start + timedelta(days=365 * years)

for i in range(25760):
    random_date = start + (end - start) * random.random()
    print(random_date)

2019-08-09 20:29:02.665361
2020-08-27 11:44:27.960259
2020-07-16 09:23:10.165561
2022-08-01 03:22:56.738207
2021-07-07 16:11:27.509373
2020-07-28 05:23:01.650799
2022-05-11 23:43:37.079203
2018-10-01 06:47:56.009045
2017-09-07 11:41:31.943240
2021-01-08 07:45:53.596777
2021-10-28 08:47:49.991679
2017-04-06 08:09:28.356309
2018-11-19 06:52:00.883705
2017-11-04 06:34:15.941987
2022-01-04 04:58:13.892165
2017-12-21 21:21:48.453582
2018-02-28 05:18:09.642844
2017-07-29 19:11:19.264314
2017-11-30 02:45:47.861529
2020-03-29 10:35:39.349469
2017-07-13 10:18:08.629258
2021-05-17 05:25:21.865672
2022-04-19 12:37:07.354623
2020-02-29 14:33:50.451569
2020-06-10 06:20:24.439274
2018-11-26 02:05:42.697484
2017-11-18 16:47:44.245786
2019-02-17 12:05:40.764126
2018-03-09 17:21:15.140278
2018-07-27 23:24:07.549326
2021-12-26 16:17:22.385272
2019-03-09 21:51:07.666085
2017-12-07 00:41:46.905262
2020-12-31 21:32:38.864769
2017-02-23 15:35:28.674438
2022-01-18 05:52:31.905776
2020-03-11 17:19:36.693674
2

2020-11-12 11:19:52.767448
2019-03-16 20:48:27.005505
2017-04-22 06:32:15.002318
2021-03-03 07:41:23.807076
2022-10-27 12:56:27.988348
2020-11-10 04:29:21.232802
2021-08-07 21:19:02.369509
2017-06-14 19:05:27.631544
2017-05-25 15:51:12.543012
2018-03-26 23:42:20.070864
2017-09-10 02:02:53.564790
2022-12-01 05:20:48.993675
2021-01-07 02:06:02.595493
2019-05-09 06:04:10.960521
2018-05-26 03:10:45.231717
2018-06-06 01:53:31.332701
2019-06-10 09:56:24.434419
2017-03-27 10:52:45.625026
2020-07-08 14:25:47.589760
2019-11-20 01:52:29.954302
2018-06-07 04:18:39.103656
2020-10-07 02:43:09.548706
2019-07-12 09:46:50.866370
2019-11-30 15:13:02.590884
2017-04-22 19:16:31.571371
2017-10-30 09:43:56.767270
2017-03-11 01:38:07.363568
2018-09-25 03:42:14.035383
2020-09-24 06:44:45.139151
2021-06-04 03:47:46.732867
2017-12-15 17:54:04.684995
2021-02-02 22:21:31.952561
2018-04-24 23:51:43.153315
2021-05-25 13:11:46.332799
2018-01-25 12:28:07.083262
2022-06-15 18:11:49.598548
2018-12-02 22:16:19.061735
2

2018-08-12 17:48:27.665134
2021-08-15 11:02:38.680772
2018-11-29 20:40:45.622564
2020-12-22 03:10:01.789990
2018-06-14 01:02:02.044726
2017-07-07 13:51:33.474704
2020-05-26 08:46:16.442998
2017-12-02 12:04:18.160416
2018-07-01 14:05:00.085992
2022-11-22 02:20:17.737009
2017-11-23 23:42:38.788955
2017-03-08 07:44:35.366279
2020-08-16 16:55:12.215249
2021-07-19 02:47:23.054488
2017-06-29 12:22:55.024645
2017-04-05 00:31:14.881334
2020-02-19 07:27:12.570345
2018-06-29 21:27:17.928373
2020-05-17 08:57:33.628818
2021-11-24 09:21:06.291874
2022-01-24 00:16:05.940196
2019-06-17 12:48:26.582679
2020-12-02 07:06:38.402522
2022-08-16 14:08:56.997008
2021-01-16 12:52:27.400187
2017-10-13 06:41:25.877255
2017-02-09 02:22:12.154059
2022-12-22 09:46:09.836056
2021-06-08 17:56:52.355181
2019-07-31 02:53:00.625361
2022-10-27 21:09:57.283363
2018-10-22 20:09:59.978581
2018-02-05 04:51:42.014586
2021-01-07 18:20:36.940107
2018-07-27 08:40:50.602397
2019-01-19 13:12:00.686419
2020-08-12 07:24:39.092127
2

2017-04-21 02:10:51.491432
2021-01-14 14:11:09.886248
2020-09-30 04:08:07.757380
2022-04-25 06:23:15.803180
2022-11-22 13:29:42.452033
2021-12-30 12:16:28.451292
2017-07-28 20:21:57.785089
2019-01-16 06:48:46.149983
2022-11-02 05:20:00.731246
2018-12-12 17:37:49.095155
2017-11-08 18:38:45.746698
2020-02-01 09:34:14.996556
2017-01-23 13:48:07.325116
2022-10-02 00:57:23.910629
2020-10-19 05:31:38.575506
2022-07-29 15:29:35.152968
2018-10-02 09:45:22.736514
2018-03-01 03:44:27.744000
2017-06-26 17:20:16.688661
2021-12-04 07:57:33.725503
2020-06-12 05:25:53.938263
2018-09-06 18:49:33.080950
2018-02-24 04:36:46.783340
2019-05-20 18:48:42.611656
2022-10-26 02:24:57.648932
2021-07-20 14:08:43.031832
2022-05-16 10:38:17.032640
2021-09-25 03:37:47.917688
2021-03-19 23:34:25.962893
2019-12-28 19:51:57.126763
2017-10-27 20:20:03.204521
2022-07-26 14:25:34.242029
2022-04-07 23:46:34.672847
2018-12-15 15:46:38.070287
2022-02-14 07:04:31.562799
2018-10-31 12:52:29.839518
2021-06-20 17:37:51.021287
2

2022-10-06 23:59:27.514612
2019-02-09 08:59:42.039260
2019-10-30 10:46:41.991732
2018-01-15 23:39:53.004054
2018-07-17 06:45:46.190797
2022-01-07 19:28:39.228569
2021-09-25 13:23:41.192736
2018-09-03 07:20:14.496203
2022-08-21 11:53:14.932954
2019-03-21 10:38:09.582591
2022-02-17 06:03:47.230783
2021-12-28 13:13:34.058470
2018-02-25 02:32:16.842616
2018-07-21 11:26:51.123982
2019-05-02 10:57:40.002748
2022-07-26 01:48:49.090782
2017-07-15 14:26:56.596192
2022-12-22 22:25:39.255063
2021-07-20 20:00:06.639330
2020-11-06 22:41:59.473787
2020-12-22 13:04:43.064758
2017-04-30 13:28:54.245164
2019-03-10 13:00:45.788145
2022-05-09 20:38:01.628630
2022-10-28 23:03:57.251401
2022-11-02 02:41:40.979830
2018-03-25 07:40:17.998190
2022-11-05 02:56:10.594903
2021-11-02 00:14:26.035405
2019-07-01 03:08:50.124663
2022-11-12 15:26:47.189647
2021-12-12 17:11:43.754604
2017-07-12 03:51:43.570032
2020-04-21 21:20:20.651216
2019-03-17 04:59:55.430958
2020-08-04 01:14:04.605405
2022-09-22 20:24:06.101612
2

2021-07-06 15:19:28.582115
2022-11-06 16:59:36.986323
2019-04-25 23:30:01.106571
2020-04-23 05:23:24.958297
2021-04-29 10:18:31.543970
2020-09-12 09:44:03.463747
2017-03-19 09:11:41.159110
2017-01-25 10:36:16.366716
2019-01-25 12:09:18.763768
2019-11-17 18:24:32.909623
2021-01-03 15:29:36.187231
2019-03-17 00:39:11.940735
2021-12-21 14:29:03.342616
2020-12-16 03:40:15.244360
2018-12-29 22:35:58.025714
2022-03-12 22:06:43.637702
2017-01-22 12:15:04.781989
2019-04-23 12:53:45.275694
2020-08-01 06:59:39.278866
2017-02-04 15:34:10.823451
2018-11-05 07:06:05.862599
2019-04-11 09:37:02.154108
2017-09-02 23:15:14.001596
2020-10-19 05:02:23.066288
2018-03-16 23:26:25.849609
2020-04-10 09:30:33.706896
2022-01-29 22:03:21.504299
2018-06-11 01:52:14.289741
2018-07-30 05:21:25.996585
2019-01-10 23:57:59.210394
2017-05-18 00:42:41.748101
2017-05-27 03:22:39.739190
2017-03-07 03:58:30.604206
2017-07-28 05:39:07.932857
2019-09-12 03:58:45.526917
2019-03-14 15:39:23.993112
2017-07-21 16:36:27.038541
2

2022-08-18 08:36:35.958637
2022-07-12 13:18:51.158048
2022-11-23 19:42:29.696339
2022-08-18 06:39:49.481951
2021-04-07 12:44:19.312350
2020-12-23 11:34:31.677634
2021-01-28 06:02:55.487715
2020-06-06 23:35:46.893205
2018-10-08 21:46:57.594731
2020-10-22 02:39:26.913854
2022-02-09 23:07:45.745419
2019-12-01 21:33:02.176933
2019-04-03 22:58:02.354350
2019-01-27 11:06:28.101302
2019-08-19 00:39:38.190800
2020-12-06 23:03:33.982215
2017-08-28 15:47:07.300241
2018-05-16 17:45:57.521847
2017-10-06 10:19:39.948257
2020-02-08 15:50:51.733552
2018-06-21 21:26:47.991115
2022-10-02 23:27:54.343044
2017-01-15 15:40:45.871027
2020-10-13 12:10:39.033709
2021-05-19 04:14:25.210698
2022-04-24 10:42:39.528736
2018-08-24 07:35:01.406852
2020-04-01 12:03:53.912895
2019-04-12 19:19:30.614432
2021-03-17 02:26:54.626045
2017-09-28 06:31:31.463806
2022-08-30 20:46:06.691068
2022-07-23 18:11:00.903310
2017-02-25 19:39:06.574921
2022-05-28 06:27:42.869458
2017-11-21 02:38:52.905706
2022-02-10 13:24:53.651970
2

2020-12-04 21:43:20.372993
2017-05-13 14:52:13.878677
2017-09-06 16:03:15.303433
2022-09-16 03:21:58.428783
2021-02-08 01:22:43.444508
2017-11-12 12:18:09.992405
2020-02-29 06:54:00.117944
2017-06-28 21:06:25.555250
2020-08-23 01:54:32.032747
2021-02-02 12:31:25.981187
2019-04-09 05:01:50.415925
2020-11-22 22:46:56.186209
2020-01-26 20:15:19.680102
2018-12-08 18:33:31.437292
2018-06-20 12:48:35.643237
2021-02-26 14:15:11.419606
2021-10-10 20:38:52.816915
2018-12-25 23:35:38.479667
2019-01-18 21:21:10.980868
2018-11-27 21:18:57.044536
2018-04-17 08:32:40.127227
2017-05-06 00:12:56.131866
2018-03-23 14:09:03.066784
2022-01-04 17:41:56.729343
2021-11-15 14:10:25.318578
2017-02-03 18:43:02.683740
2019-02-19 20:41:07.181403
2017-03-21 11:12:09.527159
2018-09-04 20:18:38.285069
2021-10-16 06:28:53.162597
2020-05-26 23:19:42.070413
2017-08-11 13:24:30.426491
2017-06-14 16:28:28.523823
2017-02-17 16:34:28.677694
2018-04-18 04:44:05.981271
2019-03-12 09:28:00.802647
2017-02-11 16:38:25.338231
2

2020-08-22 14:15:47.563258
2019-04-30 05:51:19.651003
2020-06-29 11:27:26.086568
2018-08-03 13:38:05.053218
2020-09-12 18:14:15.477965
2017-07-23 18:43:33.606994
2019-06-18 06:11:26.691186
2019-02-01 02:50:39.576667
2021-04-15 02:21:00.785246
2021-05-23 18:54:25.242693
2022-04-16 15:37:55.566357
2022-07-04 23:12:03.246108
2021-10-04 14:46:17.616132
2017-06-04 22:21:58.192547
2019-01-01 21:17:24.774526
2017-12-23 00:52:11.265493
2020-01-06 09:29:06.355718
2021-07-23 14:53:07.094164
2020-09-09 23:58:40.024714
2021-10-21 01:21:21.879352
2021-01-18 09:58:31.032532
2022-11-03 06:44:47.551883
2022-05-31 22:44:44.547541
2018-08-07 21:01:41.697049
2019-05-13 10:38:44.614704
2021-04-27 11:57:07.775324
2022-06-23 07:33:14.369134
2020-07-01 01:10:04.738409
2019-03-12 14:34:33.446724
2020-11-15 18:36:33.813284
2020-06-18 07:17:14.430809
2017-07-13 06:28:18.017685
2019-02-20 03:50:07.037659
2018-09-01 03:13:47.501932
2020-03-12 14:53:50.146667
2018-11-07 03:31:26.652786
2021-09-15 01:07:02.217766
2

2017-11-05 22:57:18.809238
2017-08-24 21:02:22.860669
2022-01-13 06:12:20.680106
2018-01-24 15:04:37.141383
2019-02-16 17:52:05.604053
2019-07-17 10:03:51.344319
2019-01-01 13:12:29.106348
2022-10-01 20:34:11.034591
2018-02-02 04:45:23.787515
2018-08-07 18:51:47.544177
2017-01-14 18:34:15.233913
2020-11-05 12:12:53.177307
2020-12-10 17:50:54.467927
2022-10-06 04:26:03.161292
2018-06-29 04:57:35.724675
2017-08-31 18:33:25.024329
2020-12-12 04:44:48.482705
2019-02-19 11:16:50.581502
2021-09-01 21:15:06.255377
2022-07-04 22:35:15.217809
2019-11-15 16:47:22.619494
2019-04-06 07:33:32.297247
2021-09-18 03:33:42.966987
2021-09-20 04:54:35.794160
2022-10-01 16:36:40.880657
2021-05-27 11:39:25.562488
2022-10-21 07:03:47.323525
2018-05-07 07:47:01.480698
2022-02-28 02:29:34.631134
2019-01-04 09:42:24.764691
2018-01-10 10:16:37.876069
2021-04-05 13:25:32.012584
2022-06-26 11:06:39.641424
2018-02-05 01:45:00.049949
2021-03-27 16:27:39.834860
2021-01-28 14:09:05.410311
2018-05-17 16:10:59.003463
2

2021-10-18 20:55:15.310266
2019-01-29 20:39:50.140511
2020-05-15 09:02:18.272741
2018-06-30 18:18:42.468872
2020-06-10 05:16:31.296495
2018-09-01 19:22:16.957904
2019-08-11 19:26:25.538491
2020-03-21 21:33:01.196240
2021-02-22 00:55:26.423316
2017-06-12 23:06:47.213637
2022-12-29 18:16:14.358734
2022-07-20 16:11:20.981468
2019-05-14 13:32:44.021733
2017-03-15 20:10:02.537164
2018-02-28 05:06:19.601091
2019-04-19 19:11:46.086622
2019-07-09 01:46:55.225657
2017-12-06 01:04:42.772429
2018-06-15 02:39:58.749126
2017-06-23 11:54:21.578831
2018-06-19 20:58:53.073482
2019-11-12 06:24:30.873117
2020-05-27 17:12:39.745617
2017-06-03 06:05:41.813206
2020-12-01 19:07:31.891653
2021-06-07 13:08:33.223179
2018-02-11 04:12:47.112546
2018-02-04 04:10:47.517381
2018-07-13 13:54:57.236309
2021-03-29 03:45:46.485229
2017-01-23 04:19:26.965232
2018-10-23 11:09:28.757201
2019-08-03 10:37:29.038242
2017-04-15 23:47:22.426146
2021-12-03 18:24:20.426683
2020-01-10 12:27:22.536674
2017-02-15 04:54:42.977203
2

2018-10-09 09:51:29.091526
2018-05-26 08:43:17.383235
2020-08-15 19:26:42.378351
2017-12-15 08:55:30.410602
2022-02-21 09:42:56.596991
2017-12-28 13:50:35.864249
2021-01-06 11:45:14.583694
2021-06-13 15:24:38.819163
2019-09-16 18:19:23.927297
2021-08-18 23:03:41.411502
2022-02-06 11:53:56.855102
2018-05-10 21:00:34.859980
2022-12-23 14:57:36.335091
2019-01-14 04:24:03.431971
2021-07-05 16:06:35.173259
2020-10-19 06:24:55.824765
2022-07-26 09:29:26.773904
2021-04-19 22:54:16.942508
2022-09-04 20:48:32.130511
2020-11-27 09:38:50.409227
2020-04-24 22:30:35.901230
2022-11-18 16:58:23.912442
2018-07-27 10:04:56.537515
2021-05-18 10:10:08.832554
2018-08-25 23:19:50.612113
2017-04-23 23:50:46.074219
2018-01-24 17:37:15.402765
2019-08-01 11:33:06.857002
2022-01-04 11:29:18.154064
2017-12-12 15:46:41.400124
2021-02-05 15:32:09.818483
2021-09-06 18:08:08.084695
2022-01-31 03:12:42.311462
2019-05-18 22:50:31.263357
2018-08-17 23:05:18.805511
2022-10-03 15:25:46.715074
2022-02-19 04:18:11.377056
2

2022-12-21 22:48:19.268225
2018-09-30 03:16:22.696420
2019-05-04 00:42:04.984045
2019-05-18 09:35:25.219920
2017-01-04 12:31:42.146441
2022-04-28 11:48:17.011402
2021-01-03 23:15:51.479185
2020-03-18 03:13:49.834807
2019-03-09 10:43:25.433295
2022-06-21 03:24:57.512811
2021-04-15 15:12:33.920790
2018-09-13 10:09:42.502327
2020-01-03 12:10:07.401950
2019-04-29 03:47:42.501670
2020-08-15 05:48:24.525615
2017-06-08 06:36:16.227597
2021-11-26 23:46:48.468313
2018-07-24 12:16:11.293805
2020-05-21 02:27:11.005141
2022-02-12 21:31:08.280970
2017-07-16 16:17:38.890997
2020-03-17 20:32:24.917861
2018-07-18 16:15:09.248062
2021-01-25 14:50:59.801935
2019-10-27 00:11:52.990723
2019-05-01 06:11:02.328639
2022-07-14 01:30:50.991508
2019-08-16 11:57:29.534981
2022-07-22 08:33:45.819802
2021-01-25 02:55:31.319080
2021-06-28 19:28:36.390827
2020-01-16 19:18:09.779487
2022-11-12 08:49:20.384234
2018-07-16 16:11:08.324458
2019-08-13 16:15:09.221605
2018-10-07 07:03:50.536267
2022-11-02 04:52:58.833825
2

2021-01-21 21:13:53.247657
2020-10-31 06:39:55.115102
2018-08-11 03:57:28.683543
2022-06-05 18:14:46.236664
2017-12-03 20:16:10.383505
2021-05-15 03:57:44.898458
2017-09-15 07:28:39.922320
2017-06-27 20:52:01.708996
2022-04-29 12:06:48.026827
2021-10-29 17:01:50.649769
2019-01-18 16:30:01.802641
2020-08-03 19:32:27.183052
2020-11-10 22:25:33.412860
2017-04-27 07:41:29.248628
2020-04-21 08:19:55.703306
2022-01-19 04:17:20.530016
2022-05-01 14:29:10.242534
2017-08-19 13:40:21.922679
2021-07-12 06:01:32.592148
2021-08-19 20:45:53.958472
2022-03-12 01:56:52.025014
2020-10-26 02:46:29.698725
2020-12-16 20:34:28.451837
2020-08-13 21:06:25.402152
2018-12-14 22:05:59.513580
2020-03-17 17:51:04.527741
2018-04-21 21:39:58.227728
2021-03-03 09:11:38.516016
2022-08-07 21:43:25.927508
2022-05-21 20:30:30.076507
2020-06-24 12:57:07.603535
2021-02-23 11:18:25.836971
2019-01-18 18:39:49.647737
2017-10-27 01:11:29.519372
2022-09-24 20:20:54.197585
2022-11-19 12:24:55.485186
2017-07-24 00:07:20.816655
2

2021-05-02 15:05:20.207276
2018-01-29 05:17:17.653849
2019-06-23 15:22:51.442523
2017-07-28 22:28:29.763875
2018-06-10 17:53:02.683580
2021-06-23 18:49:37.458602
2019-10-18 14:48:19.728807
2018-06-29 03:29:29.817653
2020-08-11 20:13:28.219420
2020-08-08 11:51:09.372693
2019-05-26 19:29:50.275083
2017-01-21 02:03:43.015984
2021-08-31 01:01:32.323314
2020-06-05 23:01:01.127733
2018-04-04 18:38:50.711648
2018-11-13 21:16:31.864060
2020-08-04 00:58:27.327735
2021-03-09 11:30:03.556296
2017-10-11 12:42:22.710232
2017-11-23 20:15:57.740976
2022-12-13 15:43:00.246990
2019-03-13 09:45:08.007041
2017-01-12 06:17:18.651409
2022-07-21 23:41:29.381606
2019-07-27 23:53:47.571852
2021-05-19 18:24:19.138710
2022-07-15 20:42:57.068956
2020-08-22 09:46:07.522631
2019-08-07 05:54:21.328957
2020-12-14 23:14:16.215932
2018-02-07 11:27:21.821986
2020-09-19 01:10:41.179673
2021-06-20 04:44:54.899043
2020-01-23 10:21:20.887496
2019-05-17 01:14:20.997246
2020-06-11 00:12:17.162353
2022-12-10 17:06:39.620504
2

2020-12-08 08:15:52.782722
2017-05-04 04:58:46.384890
2018-08-02 15:27:50.657999
2017-05-11 07:19:27.104609
2022-04-01 17:33:30.309993
2017-03-19 13:22:48.564252
2022-08-13 19:40:30.971280
2021-01-16 15:11:28.177856
2018-11-26 11:59:35.251997
2018-07-10 22:19:10.502765
2021-06-28 01:40:14.370925
2020-12-01 01:14:12.974779
2018-09-17 00:16:14.536608
2022-12-11 01:30:13.827864
2019-07-30 17:35:33.231867
2017-01-18 16:02:58.526010
2019-04-29 11:32:53.252456
2022-02-13 15:06:54.608004
2022-09-28 21:56:10.411215
2020-03-30 05:18:45.895223
2018-06-28 15:01:19.600075
2020-05-18 10:51:54.028548
2021-12-27 18:31:50.817094
2022-02-20 05:06:05.979618
2017-06-29 22:22:22.476014
2017-10-17 06:54:46.782103
2022-07-29 16:16:59.685784
2017-06-19 07:53:13.861119
2020-09-25 06:50:58.693031
2020-10-31 11:17:17.639501
2020-12-31 14:22:10.406763
2020-09-10 02:16:24.578910
2021-03-15 05:20:21.815185
2021-06-07 17:41:10.167284
2017-09-02 02:34:47.173199
2021-02-15 18:56:39.026825
2019-05-01 15:55:09.714773
2

2019-10-22 08:54:41.764404
2020-10-24 07:15:22.847161
2017-06-10 12:41:34.330512
2017-12-06 06:59:18.127061
2017-07-20 06:23:05.302166
2022-09-19 15:48:16.133076
2017-06-25 17:17:43.821146
2021-03-09 22:51:24.363691
2022-05-02 06:03:58.119824
2022-12-23 09:38:55.809671
2021-01-18 16:27:44.112772
2018-03-12 02:43:41.503269
2018-07-21 16:45:33.687553
2021-09-02 08:07:47.664358
2021-03-20 23:02:35.860923
2022-10-30 02:39:02.795683
2022-02-04 15:11:57.435895
2019-09-19 12:33:40.962736
2017-01-06 19:00:05.508496
2017-06-18 08:39:00.837774
2020-04-24 23:22:22.913321
2018-08-02 13:19:26.827742
2020-11-28 04:59:33.108248
2017-04-19 07:56:52.776909
2017-07-11 03:07:03.681222
2021-03-03 22:31:22.669625
2020-02-21 01:19:00.898432
2022-05-04 22:26:02.633753
2021-10-17 23:20:10.391184
2021-06-20 14:18:29.828629
2019-12-08 13:02:15.261121
2020-03-18 13:46:57.967601
2018-05-12 22:08:31.264442
2020-06-09 14:45:31.200200
2021-09-12 17:28:36.296172
2019-11-25 16:48:08.709374
2022-09-17 05:50:44.231567
2

2017-05-08 19:54:19.014966
2019-02-15 18:12:55.030765
2020-03-04 20:02:23.001115
2021-12-07 02:56:09.277034
2018-11-02 22:19:31.433893
2020-04-07 16:56:36.619170
2020-09-03 06:24:57.128068
2018-01-12 02:40:54.274236
2020-06-29 04:30:52.606070
2018-02-24 15:31:06.883061
2017-12-18 17:47:02.434619
2021-09-01 07:01:29.407607
2020-03-25 08:40:22.021828
2019-10-19 01:35:09.987014
2020-03-27 12:07:17.181665
2020-04-14 15:58:29.615784
2022-04-13 06:13:06.161442
2021-11-28 00:34:33.894036
2022-02-27 21:53:12.723653
2019-10-11 17:11:01.066913
2022-08-15 17:11:49.985725
2020-08-04 10:34:09.140588
2019-03-20 11:59:48.717936
2019-12-23 21:10:24.292774
2017-03-21 12:06:37.866217
2017-09-13 15:38:01.208187
2021-04-03 20:35:41.007347
2022-01-15 13:55:21.961405
2022-01-15 00:47:42.597344
2021-01-23 11:18:20.926468
2021-10-14 11:15:11.099728
2022-02-12 06:27:52.700404
2017-09-25 09:08:56.958878
2019-09-06 21:31:22.176125
2017-05-09 20:01:07.824296
2020-09-19 04:42:03.251760
2021-03-05 03:49:43.166861
2

2020-03-13 21:55:53.150546
2019-06-05 23:47:56.156569
2022-07-17 01:15:14.259254
2021-08-01 08:48:56.096335
2019-06-01 17:28:57.778375
2021-10-24 12:11:12.541912
2020-02-14 01:07:51.227480
2017-05-11 15:27:23.621596
2017-11-13 06:30:54.170792
2020-10-19 18:44:07.226073
2021-04-24 19:11:37.812945
2021-03-06 14:34:50.949831
2017-05-29 20:09:28.583993
2017-04-11 22:49:32.354762
2021-04-17 16:04:16.736126
2018-12-26 10:29:42.529278
2019-12-09 08:08:56.707155
2020-05-30 10:48:58.950862
2022-12-02 19:04:55.631522
2022-07-04 18:40:10.428942
2018-01-11 04:19:59.551388
2019-05-16 22:27:39.020633
2022-01-22 18:15:59.458262
2019-09-01 00:42:48.439285
2022-12-28 07:10:33.554270
2020-02-29 22:04:48.276817
2019-10-04 03:05:56.369764
2018-12-15 04:12:58.399483
2018-07-08 05:26:20.282166
2021-05-09 00:45:54.993154
2020-10-02 19:09:58.829989
2018-04-22 06:39:49.363770
2020-08-09 12:19:18.412883
2018-10-05 18:40:39.437759
2018-07-24 04:32:13.032948
2021-05-27 20:23:41.395460
2017-07-13 18:43:13.360024
2

2022-11-14 05:37:54.626709
2020-10-10 14:18:50.252646
2018-03-08 12:33:47.562461
2021-10-02 06:16:02.836355
2018-03-27 08:23:08.984991
2018-05-15 13:43:32.255042
2022-06-04 07:25:34.124491
2017-02-04 18:37:17.291379
2017-11-15 19:44:30.709223
2020-09-01 17:29:00.166844
2021-11-17 17:30:39.131937
2018-02-15 08:56:50.264212
2020-02-27 16:03:27.625750
2020-01-08 11:54:32.725536
2018-09-09 01:30:06.185198
2017-04-19 15:36:47.734865
2018-02-20 03:24:14.395177
2020-06-30 02:19:01.309274
2021-09-10 12:53:51.360445
2018-03-01 12:08:53.892911
2019-08-05 07:39:33.484798
2021-01-11 12:36:17.378085
2022-04-09 00:30:04.232094
2017-12-14 17:26:27.915109
2020-07-01 12:55:15.039981
2019-08-07 09:34:22.110884
2019-02-27 19:02:31.023818
2017-03-31 06:18:55.534824
2020-05-23 12:35:13.946205
2017-10-23 23:41:09.008958
2019-05-10 13:52:54.448979
2017-08-28 03:19:40.747066
2021-04-22 14:32:57.024047
2019-12-22 13:28:36.138269
2022-12-09 18:55:00.440498
2020-07-04 07:17:48.509342
2017-01-27 20:34:02.165070
2

2018-06-28 12:09:49.035909
2018-09-07 21:01:06.260880
2022-09-15 19:35:40.567517
2018-09-24 14:27:38.416912
2019-03-13 07:27:53.019081
2020-09-27 23:07:53.892803
2022-09-18 16:29:21.839494
2017-09-21 22:37:42.807146
2019-02-04 14:24:54.335844
2018-03-25 22:58:44.928627
2021-11-15 23:59:39.294169
2017-02-07 19:39:15.578420
2019-04-05 03:39:43.069636
2020-05-25 10:56:18.883992
2018-09-19 01:51:17.014579
2018-06-11 05:31:52.698789
2022-05-31 10:04:59.115430
2017-04-06 02:10:56.902811
2021-07-02 07:31:49.882723
2017-05-01 15:54:28.919757
2017-09-14 05:01:18.520346
2022-05-07 10:41:22.087181
2019-08-13 05:21:58.607218
2022-08-07 23:25:06.287629
2019-05-10 17:21:43.780913
2017-01-21 11:06:48.106730
2018-03-07 13:02:12.030703
2018-10-18 18:23:40.805881
2021-02-03 13:25:13.222113
2019-11-23 06:31:08.479900
2021-05-04 02:52:02.540533
2020-06-29 01:01:08.818479
2022-06-02 10:23:35.685257
2021-12-21 14:35:24.795935
2019-05-31 20:43:17.790413
2021-06-16 06:21:01.109042
2017-08-19 08:53:36.943187
2

2021-03-04 04:12:21.156687
2017-02-24 04:29:06.699913
2022-07-03 00:48:54.256498
2019-12-07 17:38:30.980964
2019-05-15 14:55:28.490825
2018-07-20 06:11:44.508829
2018-06-07 10:48:04.789633
2019-10-02 19:21:17.687947
2019-07-15 00:38:32.946299
2018-07-07 03:26:05.014621
2021-02-20 20:10:22.208698
2022-06-17 11:32:08.247935
2021-01-23 06:05:03.040452
2019-05-01 21:50:51.808797
2017-10-09 16:36:10.113991
2019-02-25 06:00:08.346556
2019-04-02 05:52:16.752197
2021-03-08 11:04:24.193671
2017-01-05 02:18:44.091741
2017-05-27 17:50:51.787735
2022-01-04 17:49:15.537252
2019-05-04 21:04:32.284481
2017-10-23 06:12:31.220789
2019-10-07 10:25:28.597888
2018-10-02 11:46:21.092172
2018-04-30 21:55:55.995233
2019-12-07 00:10:16.301838
2021-06-24 06:45:07.745194
2018-02-20 04:57:38.194116
2017-04-27 05:14:27.078803
2022-02-17 01:43:51.963664
2019-06-06 19:17:09.180674
2018-06-25 23:10:43.967957
2020-05-16 03:57:34.562900
2022-04-19 22:36:52.784330
2021-03-19 19:29:39.715488
2018-10-06 14:05:20.391896
2

2020-11-01 18:44:09.043829
2019-08-25 09:00:30.640255
2017-05-16 23:16:38.894681
2021-07-18 20:35:01.556664
2019-10-02 12:12:24.414541
2019-11-09 06:09:50.411997
2020-01-25 00:33:20.583454
2020-05-14 15:09:17.074465
2022-01-11 15:13:45.514003
2019-06-23 12:46:44.734532
2021-09-08 08:04:56.571721
2019-03-12 20:00:22.100985
2021-12-09 13:33:43.929881
2017-12-16 16:08:56.619358
2022-01-07 20:30:05.261604
2022-08-22 01:38:47.345036
2017-06-04 10:47:35.815893
2017-11-12 07:35:11.762223
2018-02-10 09:36:57.292227
2022-01-07 01:07:22.745782
2022-08-16 10:36:17.283769
2022-02-04 23:12:02.505385
2020-09-23 09:27:55.779946
2021-01-15 13:16:16.532484
2019-07-07 15:03:28.232295
2020-07-30 12:52:29.990628
2020-06-04 02:37:21.917828
2017-01-14 23:06:47.945614
2019-08-05 13:07:07.042996
2020-11-28 16:34:39.796129
2017-08-04 10:04:45.333457
2020-04-06 12:28:28.097237
2019-06-08 08:19:37.221672
2022-03-02 13:47:01.083731
2022-05-03 01:11:58.859706
2022-10-23 14:00:58.107023
2017-06-07 09:38:30.508802
2

2020-11-29 08:14:20.138632
2019-01-14 12:59:34.547892
2018-01-29 09:53:53.576962
2021-07-02 08:09:24.803593
2022-01-10 13:59:41.477742
2021-12-11 04:38:40.250363
2021-10-11 17:43:05.356454
2022-04-28 23:34:58.476174
2021-01-22 00:54:31.301009
2020-06-19 06:16:40.418185
2022-01-07 15:44:50.022929
2020-12-27 20:14:37.357204
2020-08-30 15:57:31.144121
2017-09-04 00:54:44.647878
2019-04-13 13:41:16.519121
2019-08-06 14:35:02.451663
2021-03-16 23:43:05.957416
2017-03-20 18:11:41.491849
2022-12-23 01:26:14.875154
2022-04-14 14:30:47.451000
2018-04-08 23:43:30.885371
2019-07-13 11:08:54.259805
2021-03-09 21:53:32.733445
2019-05-28 05:46:29.365854
2019-01-09 04:58:08.728839
2021-12-14 12:11:08.894729
2018-06-18 18:24:36.779880
2019-01-03 16:04:39.941251
2021-07-29 21:18:33.958903
2020-07-21 10:49:22.409509
2018-10-06 01:12:56.461746
2021-11-15 14:35:07.943261
2022-05-18 07:16:14.271021
2020-10-14 12:02:39.214900
2021-02-24 21:21:34.764639
2022-04-24 00:51:20.344144
2018-07-21 16:32:58.706666
2

2019-08-08 06:42:42.754386
2021-06-16 09:53:52.972709
2020-08-22 09:43:31.893935
2020-08-12 12:22:34.570192
2022-09-08 11:41:57.202628
2018-03-28 23:04:45.661200
2020-09-17 04:52:17.140697
2020-11-15 23:57:39.641521
2018-03-27 13:27:50.497238
2022-09-02 21:33:54.752219
2019-10-30 11:54:41.859643
2018-01-16 07:45:59.118467
2021-06-11 12:43:51.009509
2020-05-21 11:25:07.424904
2020-01-15 17:08:41.406656
2018-05-29 07:24:23.261558
2017-03-25 12:52:30.554220
2021-09-02 10:53:28.491894
2017-01-01 02:20:57.692670
2018-04-12 08:00:58.717523
2017-03-02 01:34:03.203719
2020-05-11 17:25:31.839101
2021-01-25 19:46:57.473957
2019-01-09 23:23:49.617280
2018-05-23 01:07:42.514197
2020-11-17 00:03:02.303569
2017-08-13 10:27:13.746377
2019-04-23 03:51:13.301514
2021-08-14 16:26:28.710204
2021-03-30 04:49:05.650463
2019-06-01 00:26:14.888195
2022-12-04 23:19:20.210750
2021-04-09 08:03:00.058820
2022-02-09 00:33:33.986504
2018-10-30 12:20:06.948353
2022-05-15 11:10:30.590357
2019-04-15 05:44:51.781982
2

2017-05-23 09:41:57.061593
2021-02-17 21:07:33.064842
2021-11-05 03:06:24.325106
2017-11-26 18:49:52.917806
2022-02-16 11:43:54.533898
2020-11-20 02:32:48.359292
2019-03-12 18:55:18.114672
2018-09-07 23:00:36.682798
2020-05-26 12:22:58.838386
2020-05-26 11:31:48.723212
2019-04-10 13:32:42.939902
2019-07-28 17:26:51.407465
2017-10-01 03:15:53.888912
2020-10-07 14:17:00.218556
2019-04-25 09:13:09.966736
2021-06-04 04:15:01.162830
2017-09-01 14:46:15.480475
2021-10-22 18:15:55.442529
2018-02-03 04:40:12.773270
2017-02-01 07:06:34.238533
2020-04-09 20:12:16.651229
2017-08-30 09:11:48.077249
2018-11-06 01:34:18.608555
2020-02-04 17:18:20.762355
2021-02-06 17:34:52.272479
2019-10-09 04:21:39.216336
2021-02-01 12:16:59.695683
2018-12-08 02:12:39.151897
2017-09-16 18:07:23.503954
2018-01-27 14:42:30.489677
2017-07-19 14:16:52.776145
2020-09-03 04:22:22.744941
2019-12-07 11:29:29.604898
2018-03-31 23:38:30.016666
2020-04-09 04:30:22.807157
2021-10-05 11:57:12.242763
2022-11-10 03:11:29.006034
2

2021-02-01 23:30:40.487237
2017-09-23 10:44:15.353571
2020-07-04 18:28:45.671552
2017-10-27 22:55:26.816045
2018-01-25 09:32:42.321289
2019-05-11 11:00:00.515341
2022-03-25 11:07:38.132760
2018-03-27 15:49:33.082958
2020-08-21 11:02:09.728520
2020-02-01 16:29:10.382915
2020-11-03 20:28:42.470998
2019-10-12 11:07:49.190937
2022-06-11 01:57:28.374163
2019-12-10 22:07:25.174460
2017-03-03 23:39:25.171768
2021-07-30 16:19:37.012290
2017-02-10 17:19:35.406347
2021-05-26 10:28:03.061098
2021-05-15 02:32:33.151692
2021-02-08 07:39:21.794020
2022-01-10 05:00:04.056629
2019-01-19 09:07:23.866834
2020-04-22 18:07:59.833624
2019-09-20 12:48:44.861401
2021-08-26 18:37:19.656863
2022-01-20 12:38:29.662390
2019-05-20 03:12:21.593745
2020-10-04 02:07:28.310423
2017-10-13 05:00:21.654178
2017-08-11 01:39:55.817640
2022-12-21 06:23:56.850809
2017-01-26 17:19:33.376813
2022-11-21 23:05:25.272693
2022-10-18 06:00:37.506106
2020-08-26 21:48:10.927157
2022-07-27 01:10:19.592990
2017-12-10 22:34:14.185947
2

2017-08-09 02:42:02.638618
2018-07-01 21:41:45.702532
2020-08-02 07:05:47.568476
2019-01-22 13:11:56.459665
2020-08-23 17:05:24.726814
2017-03-17 11:30:13.609419
2022-04-10 07:53:21.788035
2019-04-25 07:52:01.397125
2022-03-09 14:50:59.171037
2021-02-18 18:27:44.641317
2019-01-10 08:59:43.404722
2019-04-18 15:17:17.206836
2018-03-25 19:38:24.359297
2020-06-12 18:04:31.727053
2019-03-26 07:06:08.590502
2022-08-31 10:40:37.142284
2021-06-04 10:10:18.950908
2019-08-20 12:19:49.115640
2021-11-05 13:39:50.342265
2019-12-03 12:58:34.433689
2021-11-20 07:54:56.813621
2018-09-19 14:06:42.934819
2018-01-01 13:18:02.149380
2018-06-20 01:55:50.331730
2021-10-01 10:36:22.457945
2020-03-29 16:45:50.017965
2022-09-23 19:10:22.174798
2020-07-21 23:50:59.882221
2017-10-12 19:52:36.960507
2022-05-16 06:59:39.149303
2020-06-04 05:58:58.154706
2022-07-27 00:04:52.253891
2021-07-31 21:45:25.378682
2022-07-30 23:07:16.136871
2020-06-27 21:08:09.953352
2018-11-25 04:37:04.201785
2021-01-08 15:48:46.406842
2

2022-06-16 23:22:48.086453
2019-12-04 08:44:22.258241
2022-01-04 12:53:55.727989
2021-01-08 04:25:11.774295
2022-01-23 04:21:35.477053
2022-05-01 04:23:05.323975
2022-08-10 14:25:20.330991
2019-09-27 18:52:06.294790
2018-09-08 19:57:08.991745
2020-01-13 10:53:08.342334
2019-10-14 04:48:03.421471
2022-09-01 00:05:11.665631
2017-03-20 22:52:50.065449
2018-05-28 21:36:54.998544
2018-06-06 12:36:37.638899
2018-09-23 01:40:49.853572
2021-01-02 00:25:50.856283
2021-04-23 20:36:18.720020
2021-11-21 00:27:09.826527
2017-05-27 22:04:53.501681
2017-02-10 09:07:18.762886
2021-08-13 06:26:57.996236
2020-11-25 06:56:54.374979
2021-08-11 15:33:11.013211
2018-03-06 14:37:20.752511
2019-04-15 00:18:26.173248
2019-12-28 00:27:53.470102
2019-04-19 04:17:16.415460
2020-10-23 22:48:51.073958
2017-04-25 17:43:26.817858
2019-06-05 23:59:11.689779
2019-04-01 15:43:14.863213
2018-03-29 07:02:56.500689
2020-12-30 01:42:26.837417
2018-10-06 12:55:30.239792
2021-09-06 04:14:25.921932
2022-07-26 10:44:21.431229
2

2018-12-30 16:55:22.748044
2018-03-21 16:00:02.282606
2017-03-28 15:37:42.348502
2022-07-15 01:18:49.395682
2019-01-13 21:51:15.795077
2021-12-13 05:44:31.697002
2018-11-26 10:02:52.567487
2018-08-09 02:20:58.621092
2022-10-22 23:58:20.796047
2022-01-11 17:33:57.149350
2017-05-02 22:03:59.616936
2017-05-21 07:52:32.168361
2020-01-08 00:13:22.404143
2018-10-30 18:55:51.087977
2022-10-31 13:15:53.621497
2022-06-10 22:01:31.219635
2021-10-26 02:13:18.662039
2022-12-10 17:50:52.468451
2021-07-06 16:40:41.033804
2021-03-15 10:32:49.935948
2018-01-01 12:25:31.785348
2017-09-23 12:13:05.760913
2020-02-03 15:01:13.815294
2021-04-11 18:40:58.695354
2020-09-05 01:03:25.116304
2017-02-09 01:51:40.082178
2019-09-25 11:39:21.608486
2017-05-24 05:40:06.851664
2018-04-26 20:04:42.555268
2017-01-01 22:13:08.272168
2019-09-01 15:31:45.113817
2020-09-15 04:20:55.938544
2018-02-22 15:47:52.383808
2019-06-02 20:52:24.908879
2021-10-28 08:47:32.184281
2022-12-03 05:13:51.280093
2022-04-30 07:10:39.857355
2

2020-04-17 08:37:21.915029
2019-11-16 12:55:30.058297
2022-09-26 02:23:34.044179
2017-07-06 12:53:04.902758
2020-08-27 02:45:43.379228
2020-12-02 09:02:21.646509
2017-07-15 10:16:06.603942
2021-10-17 00:43:42.217947
2018-04-19 11:07:56.026148
2021-07-03 11:52:37.521248
2020-01-27 14:42:38.975152
2020-10-13 04:22:37.501731
2019-12-30 22:57:19.077876
2021-06-05 10:02:41.113675
2019-01-15 11:40:54.044573
2021-07-02 01:00:46.768896
2020-10-30 16:40:13.489757
2017-02-08 21:09:07.970446
2022-06-23 14:07:43.370499
2020-04-04 09:30:09.339221
2021-04-01 09:06:43.597394
2017-01-11 15:39:15.234679
2019-02-22 21:17:33.124131
2018-07-31 17:38:42.521716
2021-05-25 17:14:01.999116
2018-11-22 04:10:18.481930
2020-04-25 16:32:03.478661
2021-03-27 22:45:14.358944
2020-03-17 11:53:48.888206
2022-10-07 14:36:01.269291
2021-05-17 00:40:27.329862
2022-02-15 18:11:25.734925
2017-11-16 17:00:05.999888
2020-08-23 22:39:06.911630
2018-08-29 14:52:52.107040
2021-12-09 06:28:25.428703
2019-09-30 23:04:52.356632
2

2020-08-23 02:30:24.977184
2018-03-06 06:49:49.691038
2019-02-13 04:30:28.671197
2020-04-14 18:30:08.810789
2022-05-20 07:41:30.433669
2020-07-14 02:38:24.319976
2020-07-03 21:58:27.128898
2020-05-23 19:35:08.624260
2020-07-26 08:20:37.914055
2020-05-09 09:01:31.282816
2021-06-09 00:01:02.887596
2017-08-15 08:25:07.567084
2019-09-11 01:57:04.669395
2021-08-10 17:43:09.920728
2017-04-09 22:12:42.744605
2019-07-05 05:36:28.765512
2021-03-25 18:29:30.217039
2022-02-27 19:29:09.811565
2020-06-19 02:24:18.771345
2020-10-29 12:52:17.021531
2020-09-14 08:00:55.888485
2018-10-29 02:32:47.086581
2017-06-28 10:57:17.848958
2019-04-23 08:18:35.148594
2022-10-06 21:58:59.112947
2018-11-13 13:08:45.306745
2020-09-18 02:01:59.224765
2017-07-21 15:51:34.323797
2017-07-22 21:19:11.718669
2019-09-09 01:46:37.317886
2018-02-23 12:45:06.927567
2020-05-22 03:13:14.366565
2019-07-01 21:47:50.693360
2022-02-27 03:08:38.773420
2019-11-01 00:09:08.349919
2021-02-13 20:15:16.042605
2018-11-26 21:46:09.722084
2

2018-09-04 01:16:16.320463
2018-11-22 01:57:04.944012
2020-04-08 03:32:04.448336
2018-08-20 18:20:27.253415
2021-12-08 18:47:21.733035
2019-01-31 13:32:55.377657
2017-12-01 14:27:37.390150
2022-09-19 07:45:40.031420
2022-02-05 12:36:41.427222
2019-02-20 21:57:15.359496
2018-05-29 10:21:36.954294
2019-07-24 23:27:52.991956
2018-11-01 06:03:45.297849
2021-02-06 04:14:34.451905
2017-07-25 09:24:43.165774
2018-04-17 21:55:34.530581
2019-09-02 18:27:44.053254
2018-10-06 01:23:36.849530
2019-08-16 19:03:40.623329
2019-10-31 21:27:37.702499
2019-06-04 15:32:32.523137
2017-05-29 06:27:23.103355
2020-06-20 06:22:50.192778
2019-11-15 14:01:47.091428
2022-12-14 21:59:00.905089
2022-12-15 10:59:49.113363
2021-07-12 01:03:57.060727
2017-07-05 04:03:36.127421
2018-03-19 07:01:32.331396
2022-04-24 20:01:40.173853
2022-07-23 09:20:49.324027
2021-11-09 08:05:18.769202
2017-04-10 02:16:06.097420
2018-08-14 22:51:53.669983
2021-10-01 10:09:54.610176
2018-11-19 19:37:42.622837
2018-06-19 19:37:41.228586
2

2018-10-13 23:22:36.827492
2022-07-31 07:09:22.720117
2020-11-11 10:30:56.000138
2021-08-02 05:16:15.689196
2019-08-16 09:57:13.826458
2018-01-22 01:04:17.091055
2022-09-12 19:40:35.703992
2020-09-30 09:57:03.336834
2018-04-23 14:37:23.815030
2022-05-21 02:03:42.580230
2022-09-01 14:33:31.898935
2020-04-17 14:34:58.490869
2018-05-05 17:20:35.085115
2019-07-15 19:25:48.023107
2020-08-15 01:52:16.009501
2018-06-18 02:16:05.048926
2020-12-21 10:13:34.147970
2022-01-13 13:03:21.093650
2022-08-24 08:10:08.036917
2017-04-04 01:06:47.084419
2020-02-01 15:26:19.350635
2017-02-12 01:15:37.868797
2020-09-15 07:48:26.287973
2021-10-31 10:42:55.552702
2021-07-14 07:29:17.732421
2017-03-29 02:27:29.319137
2017-01-25 05:44:28.234693
2022-10-27 04:40:46.564035
2021-05-28 05:32:41.949639
2021-12-25 07:29:29.020258
2017-05-19 19:14:41.681968
2022-07-05 17:10:49.116470
2018-03-30 23:27:48.077409
2018-09-26 11:04:37.695375
2018-02-17 16:28:56.314443
2019-01-17 06:55:54.387576
2021-10-01 00:38:49.523323
2

2021-05-29 10:14:27.124914
2022-09-16 01:00:29.777099
2020-01-01 06:34:21.311577
2019-05-26 10:49:39.050032
2019-04-27 09:10:52.188264
2018-03-15 03:36:05.792804
2019-08-06 21:19:31.077060
2018-06-15 22:30:17.691147
2017-10-23 05:52:11.541257
2021-05-27 10:30:04.917472
2017-05-19 14:40:19.261076
2021-04-02 21:32:38.104577
2022-06-20 22:46:18.897830
2022-02-22 18:33:24.890580
2022-02-19 09:56:21.643191
2020-03-16 20:17:26.071074
2021-01-24 21:20:18.319272
2019-02-20 17:39:02.275647
2020-07-09 12:18:06.055409
2021-06-20 14:28:32.620327
2021-04-20 10:46:11.279750
2020-03-23 20:43:04.854199
2018-09-04 02:28:43.946430
2019-07-19 15:19:51.021984
2022-10-30 19:05:15.753288
2021-02-16 22:41:08.703171
2017-11-21 04:05:57.383907
2018-03-02 14:21:10.901856
2017-05-25 12:28:51.174377
2019-05-18 00:56:00.986147
2017-10-03 05:15:22.086360
2020-06-26 11:16:41.566962
2019-04-03 07:47:49.032861
2021-07-26 14:50:35.433148
2019-01-05 16:12:01.691183
2021-01-03 13:33:17.295419
2021-07-07 19:50:19.234050
2

2019-03-10 23:10:46.429385
2020-07-07 05:24:08.968455
2018-12-10 14:45:12.647122
2021-09-16 12:21:50.775389
2022-01-06 15:40:29.467042
2018-07-24 23:05:14.446093
2021-02-11 05:12:56.601300
2021-11-01 16:09:50.245085
2020-05-23 01:19:21.676055
2017-06-01 08:14:24.701288
2020-11-19 18:36:15.798618
2022-06-29 20:32:01.447532
2021-02-03 14:56:57.340258
2021-06-29 18:40:09.847166
2021-06-30 22:15:27.733697
2021-03-24 15:39:25.556664
2020-02-06 08:30:24.707150
2022-03-16 04:23:06.457991
2020-01-13 01:29:43.562487
2017-02-03 13:00:01.162949
2019-10-18 03:41:27.272034
2022-11-19 12:15:22.153464
2021-12-23 10:37:59.645441
2022-11-09 18:42:46.307485
2017-02-02 07:24:33.228426
2017-08-11 17:19:15.784937
2020-11-11 13:20:18.970515
2022-03-11 23:14:53.009135
2018-11-10 23:44:49.052100
2019-07-24 13:38:14.969899
2021-03-13 04:13:58.814006
2020-09-27 19:09:48.657976
2019-09-11 00:49:45.029638
2020-07-14 06:49:19.566811
2021-06-15 14:44:18.197302
2021-01-23 19:37:42.507244
2019-03-24 19:02:47.621768
2

2017-01-28 14:35:35.690405
2019-08-24 16:06:14.020264
2021-10-09 10:56:39.340822
2017-01-01 03:06:19.918794
2021-10-23 21:36:48.823242
2021-07-26 20:55:34.895312
2017-04-25 15:22:37.352459
2018-01-23 11:40:25.381280
2019-11-15 07:43:06.876499
2022-06-07 08:48:33.195141
2021-05-05 17:24:43.593319
2019-06-16 10:54:13.140848
2017-02-24 14:50:52.274239
2021-10-31 16:13:42.783320
2021-12-01 18:33:11.457722
2017-05-28 14:20:22.540518
2022-11-24 22:39:22.428800
2021-10-25 11:45:24.906720
2018-10-06 23:36:53.917931
2021-09-21 21:44:56.068257
2022-04-29 11:26:05.858081
2019-07-27 07:26:10.990543
2019-10-09 06:29:35.001339
2021-08-25 02:22:16.425483
2020-05-12 07:17:45.749311
2019-02-21 01:03:55.602815
2017-03-10 13:06:27.040383
2018-03-21 06:29:12.445946
2020-08-20 13:15:54.595850
2019-01-28 21:59:12.086624
2020-05-21 11:51:17.476805
2022-08-24 06:33:09.740247
2018-05-31 18:59:09.911830
2019-05-26 13:18:52.753279
2019-09-09 14:38:46.264932
2017-12-02 09:06:22.251088
2019-10-07 07:11:39.347002
2

2020-01-08 00:04:13.797656
2020-10-26 20:27:50.352387
2018-12-02 04:08:45.233799
2019-08-02 01:56:14.069439
2019-08-03 20:12:04.955696
2020-03-23 18:39:27.578449
2017-01-26 18:17:16.225793
2018-10-02 14:33:50.494028
2022-04-19 21:41:51.694705
2019-03-28 00:32:07.587503
2020-02-15 07:49:49.174458
2021-02-03 08:11:39.278348
2020-01-18 17:25:18.454223
2018-03-15 09:32:45.547510
2018-04-03 01:09:00.244771
2019-09-11 06:27:44.601113
2022-09-15 08:03:26.457524
2017-02-11 14:45:42.793735
2017-11-07 17:42:58.021095
2019-04-23 16:41:21.227846
2021-07-28 15:32:01.511894
2017-01-25 06:58:05.588756
2020-12-16 00:43:21.828990
2022-11-07 01:21:30.275335
2022-09-22 21:27:16.700782
2020-11-17 18:03:54.269136
2017-11-25 14:21:42.562489
2021-06-02 05:51:48.226177
2017-08-26 16:30:35.361328
2020-07-30 02:23:15.004466
2018-12-25 09:35:58.442944
2018-06-16 15:12:59.771502
2022-11-14 02:55:54.417478
2019-05-25 15:23:13.399445
2022-09-29 12:08:15.665829
2020-02-28 17:03:37.228759
2020-10-08 07:24:45.623624
2

2020-10-28 03:37:42.397929
2021-03-28 23:48:03.276240
2018-03-07 20:02:25.953030
2019-04-22 11:53:09.310940
2020-04-12 11:37:21.553731
2017-06-11 05:22:17.495272
2018-11-09 02:25:10.583870
2017-12-10 11:41:25.614989
2020-03-29 06:35:02.795963
2018-04-18 18:56:48.192669
2019-04-12 13:54:52.488913
2018-03-04 08:33:56.628039
2021-09-04 01:31:40.799594
2018-12-25 19:54:06.636237
2018-08-15 12:49:30.201793
2018-04-19 13:26:32.804354
2021-03-07 17:21:16.624203
2019-08-25 21:49:53.425443
2022-01-03 11:17:30.238550
2017-10-05 23:35:11.015426
2019-06-15 09:47:22.498260
2020-05-22 05:11:45.123956
2017-12-12 00:09:06.442068
2022-05-29 03:04:02.129763
2022-01-16 22:11:30.722627
2021-06-19 15:05:23.396472
2021-03-06 22:53:16.911879
2020-10-26 17:19:13.835950
2017-09-04 13:03:59.582929
2019-11-19 11:16:22.539647
2021-07-17 20:29:45.225637
2017-08-08 18:58:28.601972
2022-03-20 02:19:52.208492
2017-01-27 15:30:51.772224
2019-07-28 18:57:57.514069
2021-11-02 00:19:15.550782
2019-03-07 23:58:23.106597
2

In [17]:
#Creating a ride date column
ride_sharing['ride_date'] = random_date

### Back to the future
A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride's date. This information is stored in the ride_date column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert ride_date to a datetime object.

The datetime package has been imported as dt, alongside all the packages you've been using till now.

In [18]:
import datetime as dt
# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])

# Save today's date
today = pd.Timestamp('today')

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

2020-03-15 01:31:30.875379


In [19]:
#Creating a subset of the dataset 
ride_sharing_sub = ride_sharing.loc[0:77, :]
ride_sharing_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Unnamed: 0       78 non-null     int64         
 1   duration         78 non-null     object        
 2   station_A_id     78 non-null     int64         
 3   station_A_name   78 non-null     object        
 4   station_B_id     78 non-null     int64         
 5   station_B_name   78 non-null     object        
 6   bike_id          78 non-null     int64         
 7   user_type        78 non-null     int64         
 8   user_birth_year  78 non-null     int64         
 9   user_gender      78 non-null     object        
 10  user_type_cat    78 non-null     category      
 11  duration_trim    78 non-null     object        
 12  duration_time    78 non-null     int64         
 13  tire_sizes       78 non-null     category      
 14  ride_date        78 non-null     datetime64[

In [20]:
ride_sharing_sub.columns

Index(['Unnamed: 0', 'duration', 'station_A_id', 'station_A_name',
       'station_B_id', 'station_B_name', 'bike_id', 'user_type',
       'user_birth_year', 'user_gender', 'user_type_cat', 'duration_trim',
       'duration_time', 'tire_sizes', 'ride_date', 'ride_dt'],
      dtype='object')

In [21]:
#Dropping unnecessary columns
cols_to_go = ['Unnamed: 0', 'user_type_cat', 'duration_trim', 'duration_time', 'ride_dt']
ride_sharing_sub = ride_sharing_sub.drop(cols_to_go, axis = 1)

In [22]:
#creating an id for each row
id = extra.id
ride_sharing_sub.insert(loc = 0, column = 'ride_id', value = id)

In [23]:
#Stripping the string 'minutes' from the duration column
ride_sharing_sub['duration'] = ride_sharing_sub['duration'].str.strip('minutes')

In [24]:
#Creating new duration entries. Just wanted to do so, no reason
duration = extra.duration
ride_sharing_sub['duration'] = duration

In [25]:
#Creating new user_birth_year entries. Just wanted to do so, no reason.
user_birth_year = extra.user_birth_year
ride_sharing_sub['user_birth_year'] = user_birth_year

In [26]:
ride_sharing_sub.head()

Unnamed: 0,ride_id,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
0,0,11,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1988,Male,26,2020-03-15 01:31:30.875379
1,1,8,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1988,Male,27,2020-03-15 01:31:30.875379
2,2,11,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1988,Male,26,2020-03-15 01:31:30.875379
3,3,7,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1969,Male,27,2020-03-15 01:31:30.875379
4,4,11,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1986,Male,27,2020-03-15 01:31:30.875379


### Finding duplicates
A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you've been working with thus far.

In [27]:
# Find duplicates
duplicates = ride_sharing_sub.duplicated(subset = 'ride_id', keep = False)
print(duplicates)

0     False
1     False
2     False
3     False
4     False
      ...  
73    False
74     True
75     True
76     True
77     True
Length: 78, dtype: bool


In [28]:
# Sort your duplicated rides
duplicated_rides = ride_sharing_sub[duplicates].sort_values(by = 'ride_id')
print(duplicated_rides.head())

    ride_id  duration  station_A_id  \
22       33        10             5   
39       33         2            30   
53       55         9            21   
65       55         9            16   
74       71        11            67   

                                       station_A_name  station_B_id  \
22       Powell St BART Station (Market St at 5th St)           356   
39     San Francisco Caltrain (Townsend St at 4th St)           130   
53   Montgomery St BART Station (Market St at 2nd St)            78   
65                            Steuart St at Market St            93   
74  San Francisco Caltrain Station 2  (Townsend St...            90   

                  station_B_name  bike_id  user_type  user_birth_year  \
22   Valencia St at Clinton Park     2165          2             1979   
39      22nd St Caltrain Station     5213          1             1979   
53           Folsom St at 9th St     1502          2             1985   
65  4th St at Mission Bay Blvd S     5392     

In [29]:
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

    ride_id  duration  user_birth_year
22       33        10             1979
39       33         2             1979
53       55         9             1985
65       55         9             1985
74       71        11             1997
75       71        11             1997
76       89         9             1986
77       89         9             2060


In [30]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing_sub.drop_duplicates()
ride_dup[ride_dup.duplicated(subset = 'ride_id', keep = False)]

Unnamed: 0,ride_id,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
22,33,10,5,Powell St BART Station (Market St at 5th St),356,Valencia St at Clinton Park,2165,2,1979,Male,27,2020-03-15 01:31:30.875379
39,33,2,30,San Francisco Caltrain (Townsend St at 4th St),130,22nd St Caltrain Station,5213,1,1979,Male,26,2020-03-15 01:31:30.875379
53,55,9,21,Montgomery St BART Station (Market St at 2nd St),78,Folsom St at 9th St,1502,2,1985,Female,27,2020-03-15 01:31:30.875379
65,55,9,16,Steuart St at Market St,93,4th St at Mission Bay Blvd S,5392,2,1985,Male,27,2020-03-15 01:31:30.875379
74,71,11,67,San Francisco Caltrain Station 2 (Townsend St...,90,Townsend St at 7th St,1920,2,1997,Male,26,2020-03-15 01:31:30.875379
75,71,11,21,Montgomery St BART Station (Market St at 2nd St),58,Market St at 10th St,316,2,1997,Female,26,2020-03-15 01:31:30.875379
76,89,9,22,Howard St at Beale St,72,Page St at Scott St,5162,2,1986,Female,27,2020-03-15 01:31:30.875379
77,89,9,21,Montgomery St BART Station (Market St at 2nd St),64,5th St at Brannan St,1299,2,2060,Male,27,2020-03-15 01:31:30.875379


In [31]:
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

In [32]:
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
ride_unique

Unnamed: 0,ride_id,user_birth_year,duration
0,0,1988,11
1,1,1988,8
2,2,1988,11
3,3,1969,7
4,4,1986,11
...,...,...,...
69,94,1993,25
70,95,1959,11
71,96,1991,7
72,98,1989,21


In [33]:
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

### 2. Text & Categorical Data Problems
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature.We will fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

#### Finding consistency
In this exercise we'll be working with the airlines DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.

In this exercise, we will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames. 

In [34]:
airlines = pd.read_csv('../Datasets/airlines_final.csv')
airlines.head()

Unnamed: 0.1,Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral
3,3,1157,Tuesday,SOUTHWEST,LOS ANGELES,West US,Hub,Gates 20-39,2018-12-31,190.0,Clean,Very safe,Somewhat satsified
4,4,2992,Wednesday,AMERICAN,MIAMI,East US,Hub,Gates 50-59,2018-12-31,559.0,Somewhat clean,Very safe,Somewhat satsified


In [35]:
#Creating the categories dataframe
data = {'cleanliness' : ['Clean', 'Average', 'Somewhat clean', 'Somewhat dirty', 'Dirty'],
        'safety': ['Neutral', 'Very Safe', 'Somewhat safe', 'Very unsafe', 'Somewhat unsafe'],
        'satisfaction': ['Very satisfied', 'neutral', 'Somewhat satisfied', 'Somewhat unsatisfied', 'Very unsatisfied']
       }

categories = pd.DataFrame(data)
# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very Safe               neutral
2  Somewhat clean    Somewhat safe    Somewhat satisfied
3  Somewhat dirty      Very unsafe  Somewhat unsatisfied
4           Dirty  Somewhat unsafe      Very unsatisfied
Cleanliness:  ['Clean' 'Average' 'Somewhat clean' 'Somewhat dirty' 'Dirty'] 

Safety:  ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe'] 

Satisfaction:  ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
 'Very unsatisfied'] 



In [36]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2477 entries, 0 to 2476
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     2477 non-null   int64  
 1   id             2477 non-null   int64  
 2   day            2477 non-null   object 
 3   airline        2477 non-null   object 
 4   destination    2477 non-null   object 
 5   dest_region    2477 non-null   object 
 6   dest_size      2477 non-null   object 
 7   boarding_area  2477 non-null   object 
 8   dept_time      2477 non-null   object 
 9   wait_min       2477 non-null   float64
 10  cleanliness    2477 non-null   object 
 11  safety         2477 non-null   object 
 12  satisfaction   2477 non-null   object 
dtypes: float64(1), int64(2), object(10)
memory usage: 251.7+ KB


In [37]:
#Finding the inconsistencies within the cleanliness, safety, and satisfaction columns. 
inconsistent_categories = set(airlines['cleanliness']).difference(categories['cleanliness'])
print('Inconsistencies in the cleanliness column is null:', inconsistent_categories)

inconsistent_categories1 = set(airlines['safety']).difference(categories['safety'])
print('Inconsistencies in the safety column is:', inconsistent_categories1)


inconsistent_categories2 = set(airlines['satisfaction']).difference(categories['satisfaction'])
print('Inconsistencies in the satisfaction column are:', inconsistent_categories2)

Inconsistencies in the cleanliness column is null: set()
Inconsistencies in the safety column is: {'Very safe'}
Inconsistencies in the satisfaction column are: {'Neutral', 'Somewhat satsified'}


In [38]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
print(cat_clean)

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])


set()
Empty DataFrame
Columns: [Unnamed: 0, id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []
      Unnamed: 0    id        day        airline        destination  \
0              0  1351    Tuesday    UNITED INTL             KANSAI   
1              1   373     Friday         ALASKA  SAN JOSE DEL CABO   
2              2  2820   Thursday          DELTA        LOS ANGELES   
3              3  1157    Tuesday      SOUTHWEST        LOS ANGELES   
4              4  2992  Wednesday       AMERICAN              MIAMI   
...          ...   ...        ...            ...                ...   
2472        2804  1475    Tuesday         ALASKA       NEW YORK-JFK   
2473        2805  2222   Thursday      SOUTHWEST            PHOENIX   
2474        2806  2684     Friday         UNITED            ORLANDO   
2475        2807  2549    Tuesday        JETBLUE         LONG BEACH   
2476        2808  2162   Saturday  CHIN

In [39]:
#Find the safety categories in airlines not in categories
safety_clean = set(airlines['safety']).difference(categories['safety'])

#Find rows with that category
safety_clean_rows = airlines['safety'].isin(safety_clean)

#Print rows with inconsistent category
print(airlines[safety_clean_rows])

#Print rows with consistent categories only
print(airlines[~safety_clean_rows])

      Unnamed: 0    id        day        airline        destination  \
1              1   373     Friday         ALASKA  SAN JOSE DEL CABO   
3              3  1157    Tuesday      SOUTHWEST        LOS ANGELES   
4              4  2992  Wednesday       AMERICAN              MIAMI   
5              5   634   Thursday         ALASKA             NEWARK   
6              6  2578   Saturday        JETBLUE         LONG BEACH   
...          ...   ...        ...            ...                ...   
2466        2798  3099     Sunday         ALASKA             NEWARK   
2470        2802   394     Friday         ALASKA        LOS ANGELES   
2473        2805  2222   Thursday      SOUTHWEST            PHOENIX   
2474        2806  2684     Friday         UNITED            ORLANDO   
2476        2808  2162   Saturday  CHINA EASTERN            QINGDAO   

        dest_region dest_size boarding_area   dept_time  wait_min  \
1     Canada/Mexico     Small   Gates 50-59  2018-12-31     135.0   
3        

### Inconsistent categories
We'll be revisiting the airlines DataFrame from the previous lesson.

As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.

In this exercise, you will examine two categorical columns from this DataFrame, dest_region and dest_size respectively, assess how to address them and make sure that they are cleaned and ready for analysis. 

In [40]:
# Print unique values of both columns
print(airlines['dest_region'].unique(), '\n')
print(airlines['dest_size'].unique())

['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
 'Middle East' 'Europe' 'eur' 'Central/South America'
 'Australia/New Zealand' 'middle east'] 

['Hub' 'Small' '    Hub' 'Medium' 'Large' 'Hub     ' '    Small'
 'Medium     ' '    Medium' 'Small     ' '    Large' 'Large     ']


In [41]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower() 
print(airlines['dest_region'].unique(), '\n')

airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
print(airlines['dest_region'].unique())

['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'eur' 'central/south america' 'australia/new zealand'] 

['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'central/south america' 'australia/new zealand']


In [42]:
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_size'].unique())

['Hub' 'Small' 'Medium' 'Large']


In [43]:
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                                labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

In [44]:
print(airlines.head(3))

   Unnamed: 0    id       day      airline        destination    dest_region  \
0           0  1351   Tuesday  UNITED INTL             KANSAI           asia   
1           1   373    Friday       ALASKA  SAN JOSE DEL CABO  canada/mexico   
2           2  2820  Thursday        DELTA        LOS ANGELES        west us   

  dest_size boarding_area   dept_time  wait_min cleanliness         safety  \
0       Hub  Gates 91-102  2018-12-31     115.0       Clean        Neutral   
1     Small   Gates 50-59  2018-12-31     135.0       Clean      Very safe   
2       Hub   Gates 40-48  2018-12-31      70.0     Average  Somewhat safe   

     satisfaction wait_type day_week  
0  Very satisfied    medium  weekday  
1  Very satisfied    medium  weekday  
2         Neutral    medium  weekday  


In [45]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2477 entries, 0 to 2476
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Unnamed: 0     2477 non-null   int64   
 1   id             2477 non-null   int64   
 2   day            2477 non-null   object  
 3   airline        2477 non-null   object  
 4   destination    2477 non-null   object  
 5   dest_region    2477 non-null   object  
 6   dest_size      2477 non-null   object  
 7   boarding_area  2477 non-null   object  
 8   dept_time      2477 non-null   object  
 9   wait_min       2477 non-null   float64 
 10  cleanliness    2477 non-null   object  
 11  safety         2477 non-null   object  
 12  satisfaction   2477 non-null   object  
 13  wait_type      2477 non-null   category
 14  day_week       2477 non-null   object  
dtypes: category(1), float64(1), int64(2), object(11)
memory usage: 273.6+ KB


In [46]:
airlines_sub = airlines[0:200]
airlines_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Unnamed: 0     200 non-null    int64   
 1   id             200 non-null    int64   
 2   day            200 non-null    object  
 3   airline        200 non-null    object  
 4   destination    200 non-null    object  
 5   dest_region    200 non-null    object  
 6   dest_size      200 non-null    object  
 7   boarding_area  200 non-null    object  
 8   dept_time      200 non-null    object  
 9   wait_min       200 non-null    float64 
 10  cleanliness    200 non-null    object  
 11  safety         200 non-null    object  
 12  satisfaction   200 non-null    object  
 13  wait_type      200 non-null    category
 14  day_week       200 non-null    object  
dtypes: category(1), float64(1), int64(2), object(11)
memory usage: 22.3+ KB


In [47]:
#Adding passenger names to the dataset from a list
full_name = extra.names
airlines_sub.insert(loc = 2, column = 'full_name', value = full_name)
print(airlines_sub.head(2))

   Unnamed: 0    id        full_name      day      airline        destination  \
0           0  1351   Melodie Stuart  Tuesday  UNITED INTL             KANSAI   
1           1   373  Dominic Shannon   Friday       ALASKA  SAN JOSE DEL CABO   

     dest_region dest_size boarding_area   dept_time  wait_min cleanliness  \
0           asia       Hub  Gates 91-102  2018-12-31     115.0       Clean   
1  canada/mexico     Small   Gates 50-59  2018-12-31     135.0       Clean   

      safety    satisfaction wait_type day_week  
0    Neutral  Very satisfied    medium  weekday  
1  Very safe  Very satisfied    medium  weekday  


### Removing titles and taking names
While collecting survey respondent metadata in the airlines DataFrame, the full name of respondents was saved in the full_name column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as "Dr.", "Mr.", "Ms." and "Miss".

Our ultimate objective is to create two new columns named first_name and last_name, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.

In [54]:
# Replace "Dr." with empty string ""
airlines_sub['full_name'] = airlines_sub.loc[:, 'full_name'].str.replace('Dr.', '')

# Replace "Mr." with empty string ""
#airlines_sub['full_name'] = airlines_sub['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
#airlines_sub['full_name'] = airlines_sub['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
#airlines_sub['full_name'] = airlines_sub['full_name'].str.replace("Ms.", "")

SyntaxError: cannot assign to function call (<ipython-input-54-f680b7612fed>, line 2)