Skip to content

Latest commit

 

History

History
739 lines (670 loc) · 40.4 KB

prepare-process.md

File metadata and controls

739 lines (670 loc) · 40.4 KB

Prepare and Process

kk 2022-05-19

Prepare

  • Where is your data located? A link is provided to a website where the data can be downloaded.
  • How is the data organized? There are zip files with tripdata for each month from April 2020 to April 2022. There are also other files which seem to contain summarized quarterly summarized tripdata from 2013 to 2018. The naming is not consistent. The task is to only look at the last 12 months, so I only download files from May 2021 to April 2022.
  • Are there issues with bias or credibility in this data? Does your data ROCCC? In this case, without having looked at the data itself yet, I’d say that this data is credible, because in this scenario it has been provided to me by the company itself. I would also say that it is unbiased, because (normally) it should contain the automatically recorded trip data from all clients. In terms of ROCCC, I would say that the data is __R__eliable (accurate, complete and unbiased), because it is automatically collected, and there are files for each month. It is also __O__riginal, because it is first-hand from the company. I cannot say yet if the data is __C__omprehensive (containing all information needed to answer the question), because I haven’t looked at it yet. The data is __C__urrent, because it contains up-to-date information from the last months. It is also __C__ited, because I know where it is coming from.
  • How are you addressing licensing, privacy, security, and accessibility? The data comes with a license, that states that the data is available to the public, and that it can be included as source material in analyses, reports or studies published for non-commercial purposes. This is applies to my case, so I think that there are no issues if I upload the data here together with my analysis.
  • How did you verify the data’s integrity? At first I checked what source the data was coming from, and if it is a reliable one. I’d say yes, because it is directly from the bike rental company who rents the bikes (in our fictonal case, but also in reality, it is data from a bike rental company in Chicago). I read all the csv files from the last 12 months to data frames, and summarized them using r’s glimpse and skim functions to get an overview. I checked if all had the same columns, the datatypes, if there were missing values, and if there were any hints for bias.
  • How does it help answer your question? The question to address is “How do annual members and casual riders use Cyclistic bikes differently?”. Consequently, a first requirement the data must fulfill is having a distinction between casual and annual members. This requirement is fulfilled with the column (“members_casual”), which sorts the users in categories. Apart from that, the data contains information on the type of bike used, the duration of trips, as well as start and end stations and their location. With this data, it is possible to address the question in some ways: I could determine if annual or casual users use different types of bikes, if they use it for different trip durations, and if there are differences in the variation of stations addressed. Maybe there are even location-based hints, like e.g. casual users use bikes more in free time (e.g. to go to parks) and annual members more to go to work. But this could be complicated, because I don’t know Chicago that much, and might require additional research from my side.
  • Are there any problems with the data? I would say that in terms of integrity the data looks fine (see above). After getting an overview of the data I see some problems. There seems to be a systematic issue with the recording of the end-station location (latitude and longitude), because each dataset has missing data for these columns. Moreover, there are lots of missing values (not NA, but empty strings) for start and end station name and id. Another thing I noticed is that the datetime values for start and end time are of type “character” and not “date” or “datetime”.

Note: I commented out all skim functions in the code chunks because their execution took too long.

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(skimr)
# read all the csv files into a list and then create a list of data frames from them
filenames <- list.files(path="raw-data")
print(filenames)
##  [1] "202105-divvy-tripdata.csv" "202106-divvy-tripdata.csv"
##  [3] "202107-divvy-tripdata.csv" "202108-divvy-tripdata.csv"
##  [5] "202109-divvy-tripdata.csv" "202110-divvy-tripdata.csv"
##  [7] "202111-divvy-tripdata.csv" "202112-divvy-tripdata.csv"
##  [9] "202201-divvy-tripdata.csv" "202202-divvy-tripdata.csv"
## [11] "202203-divvy-tripdata.csv" "202204-divvy-tripdata.csv"
setwd("raw-data")
filelist <- lapply(filenames, read.csv)
print(length(filelist)) # check if length is 12
## [1] 12
# change names of files to something shorter for the dataframes
names(filelist) <- c("df_05_21","df_06_21","df_07_21","df_08_21","df_09_21","df_10_21","df_11_21","df_12_21","df_01_22","df_02_22","df_03_22","df_04_22")
# check if all data frames have the same number of columns
for (i in 1:length(filelist)) {
  print(dim(filelist[[i]]))
}
## [1] 531633     13
## [1] 729595     13
## [1] 822410     13
## [1] 804352     13
## [1] 756147     13
## [1] 631226     13
## [1] 359978     13
## [1] 247540     13
## [1] 103770     13
## [1] 115609     13
## [1] 284042     13
## [1] 371249     13
# create overviews and summaries of all data frames to get a first impression
for (i in 1:length(filelist)) {
  glimpse(filelist[[i]])
}
## Rows: 531,633
## Columns: 13
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-05-30 11:58:15", "2021-05-30 11:29:14", "2021…
## $ ended_at           <chr> "2021-05-30 12:10:39", "2021-05-30 12:14:09", "2021…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## Rows: 729,595
## Columns: 13
## $ ride_id            <chr> "99FEC93BA843FB20", "06048DCFC8520CAF", "9598066F68…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-06-13 14:31:28", "2021-06-04 11:18:02", "2021…
## $ ended_at           <chr> "2021-06-13 14:34:11", "2021-06-04 11:24:19", "2021…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "Michigan Ave &…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "13042", "", ""…
## $ start_lat          <dbl> 41.80, 41.79, 41.80, 41.78, 41.80, 41.78, 41.79, 41…
## $ start_lng          <dbl> -87.59, -87.59, -87.60, -87.58, -87.59, -87.58, -87…
## $ end_lat            <dbl> 41.80000, 41.80000, 41.79000, 41.80000, 41.79000, 4…
## $ end_lng            <dbl> -87.6000, -87.6000, -87.5900, -87.6000, -87.5900, -…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## Rows: 822,410
## Columns: 13
## $ ride_id            <chr> "0A1B623926EF4E16", "B2D5583A5A5E76EE", "6F264597DD…
## $ rideable_type      <chr> "docked_bike", "classic_bike", "classic_bike", "cla…
## $ started_at         <chr> "2021-07-02 14:44:36", "2021-07-07 16:57:42", "2021…
## $ ended_at           <chr> "2021-07-02 15:19:58", "2021-07-07 17:16:09", "2021…
## $ start_station_name <chr> "Michigan Ave & Washington St", "California Ave & C…
## $ start_station_id   <chr> "13001", "17660", "SL-012", "17660", "17660", "1766…
## $ end_station_name   <chr> "Halsted St & North Branch St", "Wood St & Hubbard …
## $ end_station_id     <chr> "KA1504000117", "13432", "KA1503000044", "13196", "…
## $ start_lat          <dbl> 41.88398, 41.90036, 41.86038, 41.90036, 41.90035, 4…
## $ start_lng          <dbl> -87.62468, -87.69670, -87.62581, -87.69670, -87.696…
## $ end_lat            <dbl> 41.89937, 41.88990, 41.89017, 41.89456, 41.88659, 4…
## $ end_lng            <dbl> -87.64848, -87.67147, -87.62619, -87.65345, -87.658…
## $ member_casual      <chr> "casual", "casual", "member", "member", "casual", "…
## Rows: 804,352
## Columns: 13
## $ ride_id            <chr> "99103BB87CC6C1BB", "EAFCCCFB0A3FC5A1", "9EF4F46C57…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-08-10 17:15:49", "2021-08-10 17:23:14", "2021…
## $ ended_at           <chr> "2021-08-10 17:22:44", "2021-08-10 17:39:24", "2021…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "Clark St & Grace St", …
## $ end_station_id     <chr> "", "", "", "", "", "", "", "TA1307000127", "", "",…
## $ start_lat          <dbl> 41.77000, 41.77000, 41.95000, 41.97000, 41.79000, 4…
## $ start_lng          <dbl> -87.68000, -87.68000, -87.65000, -87.67000, -87.600…
## $ end_lat            <dbl> 41.77000, 41.77000, 41.97000, 41.95000, 41.77000, 4…
## $ end_lng            <dbl> -87.68000, -87.63000, -87.66000, -87.65000, -87.620…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## Rows: 756,147
## Columns: 13
## $ ride_id            <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32", "6EF7213790…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-09-28 16:07:10", "2021-09-28 14:24:51", "2021…
## $ ended_at           <chr> "2021-09-28 16:09:54", "2021-09-28 14:40:05", "2021…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "Clark St &…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "TA13070001…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.89000, 41.94000, 41.81000, 41.80000, 41.88000, 4…
## $ start_lng          <dbl> -87.68000, -87.64000, -87.72000, -87.72000, -87.740…
## $ end_lat            <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.88, 41.74, 41…
## $ end_lng            <dbl> -87.67, -87.67, -87.72, -87.72, -87.71, -87.74, -87…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## Rows: 631,226
## Columns: 13
## $ ride_id            <chr> "620BC6107255BF4C", "4471C70731AB2E45", "26CA69D43D…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-10-22 12:46:42", "2021-10-21 09:12:37", "2021…
## $ ended_at           <chr> "2021-10-22 12:49:50", "2021-10-21 09:14:14", "2021…
## $ start_station_name <chr> "Kingsbury St & Kinzie St", "", "", "", "", "", "",…
## $ start_station_id   <chr> "KA1503000043", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.88919, 41.93000, 41.92000, 41.92000, 41.89000, 4…
## $ start_lng          <dbl> -87.63850, -87.70000, -87.70000, -87.69000, -87.710…
## $ end_lat            <dbl> 41.89000, 41.93000, 41.94000, 41.92000, 41.89000, 4…
## $ end_lng            <dbl> -87.63000, -87.71000, -87.72000, -87.69000, -87.690…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## Rows: 359,978
## Columns: 13
## $ ride_id            <chr> "7C00A93E10556E47", "90854840DFD508BA", "0A7D10CDD1…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-11-27 13:27:38", "2021-11-27 13:38:25", "2021…
## $ ended_at           <chr> "2021-11-27 13:46:38", "2021-11-27 13:56:10", "2021…
## $ start_station_name <chr> "", "", "", "", "", "Michigan Ave & Oak St", "", ""…
## $ start_station_id   <chr> "", "", "", "", "", "13042", "", "", "", "", "", ""…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.93000, 41.96000, 41.96000, 41.94000, 41.90000, 4…
## $ start_lng          <dbl> -87.72000, -87.70000, -87.70000, -87.79000, -87.630…
## $ end_lat            <dbl> 41.96, 41.92, 41.96, 41.93, 41.88, 41.90, 41.80, 41…
## $ end_lng            <dbl> -87.73, -87.70, -87.70, -87.79, -87.62, -87.63, -87…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## Rows: 247,540
## Columns: 13
## $ ride_id            <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-12-07 15:06:07", "2021-12-11 03:43:29", "2021…
## $ ended_at           <chr> "2021-12-07 15:13:42", "2021-12-11 04:10:23", "2021…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id   <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name   <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id     <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat          <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng          <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat            <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng            <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…
## Rows: 103,770
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <chr> "2022-01-13 11:59:47", "2022-01-10 08:41:56", "2022…
## $ ended_at           <chr> "2022-01-13 12:02:44", "2022-01-10 08:46:17", "2022…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…
## Rows: 115,609
## Columns: 13
## $ ride_id            <chr> "E1E065E7ED285C02", "1602DCDC5B30FFE3", "BE7DD2AF4B…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-02-19 18:08:41", "2022-02-20 17:41:30", "2022…
## $ ended_at           <chr> "2022-02-19 18:23:56", "2022-02-20 17:45:56", "2022…
## $ start_station_name <chr> "State St & Randolph St", "Halsted St & Wrightwood …
## $ start_station_id   <chr> "TA1305000029", "TA1309000061", "TA1305000029", "13…
## $ end_station_name   <chr> "Clark St & Lincoln Ave", "Southport Ave & Wrightwo…
## $ end_station_id     <chr> "13179", "TA1307000113", "13011", "13323", "TA13070…
## $ start_lat          <dbl> 41.88462, 41.92914, 41.88462, 41.94815, 41.88462, 4…
## $ start_lng          <dbl> -87.62783, -87.64908, -87.62783, -87.66394, -87.627…
## $ end_lat            <dbl> 41.91569, 41.92877, 41.87926, 41.95283, 41.88584, 4…
## $ end_lng            <dbl> -87.63460, -87.66391, -87.63990, -87.64999, -87.635…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## Rows: 284,042
## Columns: 13
## $ ride_id            <chr> "47EC0A7F82E65D52", "8494861979B0F477", "EFE527AF80…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "classic_bike", "c…
## $ started_at         <chr> "2022-03-21 13:45:01", "2022-03-16 09:37:16", "2022…
## $ ended_at           <chr> "2022-03-21 13:51:18", "2022-03-16 09:43:34", "2022…
## $ start_station_name <chr> "Wabash Ave & Wacker Pl", "Michigan Ave & Oak St", …
## $ start_station_id   <chr> "TA1307000131", "13042", "13109", "TA1307000131", "…
## $ end_station_name   <chr> "Kingsbury St & Kinzie St", "Orleans St & Chestnut …
## $ end_station_id     <chr> "KA1503000043", "620", "15578", "TA1305000025", "13…
## $ start_lat          <dbl> 41.88688, 41.90100, 41.97835, 41.88688, 41.91172, 4…
## $ start_lng          <dbl> -87.62603, -87.62375, -87.65975, -87.62603, -87.626…
## $ end_lat            <dbl> 41.88918, 41.89820, 41.98404, 41.87771, 41.87794, 4…
## $ end_lng            <dbl> -87.63851, -87.63754, -87.66027, -87.63532, -87.662…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## Rows: 371,249
## Columns: 13
## $ ride_id            <chr> "3564070EEFD12711", "0B820C7FCF22F489", "89EEEE3229…
## $ rideable_type      <chr> "electric_bike", "classic_bike", "classic_bike", "c…
## $ started_at         <chr> "2022-04-06 17:42:48", "2022-04-24 19:23:07", "2022…
## $ ended_at           <chr> "2022-04-06 17:54:36", "2022-04-24 19:43:17", "2022…
## $ start_station_name <chr> "Paulina St & Howard St", "Wentworth Ave & Cermak R…
## $ start_station_id   <chr> "515", "13075", "TA1307000121", "13075", "TA1307000…
## $ end_station_name   <chr> "University Library (NU)", "Green St & Madison St",…
## $ end_station_id     <chr> "605", "TA1307000120", "TA1307000120", "KA170600500…
## $ start_lat          <dbl> 42.01913, 41.85308, 41.87184, 41.85308, 41.87181, 4…
## $ start_lng          <dbl> -87.67353, -87.63193, -87.64664, -87.63193, -87.646…
## $ end_lat            <dbl> 42.05294, 41.88189, 41.88189, 41.86749, 41.88224, 4…
## $ end_lng            <dbl> -87.67345, -87.64879, -87.64879, -87.63219, -87.641…
## $ member_casual      <chr> "member", "member", "member", "casual", "member", "…
# skim each df manually, because it does not seem to work in a loop (I don't manage to find out why:( )) 
#skim(filelist[1])
#skim(filelist[2])
#skim(filelist[3])
#skim(filelist[4])
#skim(filelist[5])
#skim(filelist[6])
#skim(filelist[7])
#skim(filelist[8])
#skim(filelist[9])
#skim(filelist[10])
#skim(filelist[11])
#skim(filelist[12])

# make one combined data frame from all the data frames
df_all <- bind_rows(filelist, .id = "column_label")
glimpse(df_all)
## Rows: 5,757,551
## Columns: 14
## $ column_label       <chr> "df_05_21", "df_05_21", "df_05_21", "df_05_21", "df…
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2021-05-30 11:58:15", "2021-05-30 11:29:14", "2021…
## $ ended_at           <chr> "2021-05-30 12:10:39", "2021-05-30 12:14:09", "2021…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
#skim(df_all)

Process

  • What tools are you choosing and why? I’m choosing r, because it is powerful and flexible, and good for working with large datasets, and I would like to practice using it.
  • Have you ensured your data’s integrity? Yes, see ‘Prepare’ section above.
  • What steps have you taken to ensure that your data is clean? See below.
  • How can you verify that your data is clean and ready to analyse? I looked through all columns and determined if their values and data types made sense, if there were impossible or missing values, and fixed accordingly if necessary. I looked manually or at summary statistics.
  • Have you documented your cleaning process so you can review and share those results? See below.

Steps:

  • Back up original data: put them in folder “raw-data” and will save cleaned files in folder “output-data”
  • Check for missing values and decide how to deal with them:
# make new df for cleaned / wip version
df_all_clean <- df_all
## check for missing values and decide how to deal with them
# skim(df_all_clean)

*There are 4766 missing values for end_lat and end_long each. There are also empty strings in other columns (but not NA): start_station_name (790,207), start_station_id (790,204), end_station_name (843,361), and end_station_id (843,361). For now I will leave it like it is, but keep that in mind. * Check for duplicates: the output from skim() contains a n_unique column. It states that there are 5,757,551 unique ride ids, which equals the number of rows. Thus, there seem to be no duplicates, assuming that the ride id is trustworthy * remove extra spaces and blanks

## removing extra spaces and blanks
#df_all_clean <- trimws(df_all_clean, "both") # did not end up doing this because the function did not finish in a reasonable amout of time
  • check for irrelevant data: here are not many columns, and I would not judge any of them as irrelevant per se. Those columns with many missing values could be judged to be less relevant, because they are not complete. However, those rows that have no missing values in these columns could still give insights. So I’d not remove any data at this point. Moreover, I downloaded only data from the last 12 months, so there is no old data in my dataset that could be irrelevant.
  • check if columns are named meaningfully
## are the columns named meaningfully?
colnames(df_all_clean) # yes
##  [1] "column_label"       "ride_id"            "rideable_type"     
##  [4] "started_at"         "ended_at"           "start_station_name"
##  [7] "start_station_id"   "end_station_name"   "end_station_id"    
## [10] "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"            "member_casual"
  • check if strings are consistent and meaningful
## are strings consistent and meaningful? check if there are spelling errors or formatting issues.
### for 'factors': I looked at possible values of 'rideable_type' and 'member_casual" and counted their occurrences.
df_all %>%  count(rideable_type)
##   rideable_type       n
## 1  classic_bike 3202784
## 2   docked_bike  291391
## 3 electric_bike 2263376
unique(df_all$rideable_type)
## [1] "electric_bike" "classic_bike"  "docked_bike"
df_all %>% count(member_casual)
##   member_casual       n
## 1        casual 2536358
## 2        member 3221193
unique(df_all$member_casual)
## [1] "casual" "member"
  • The values of the factors seem to make sense. The ride ids equal the number of rows, and the sample I checked seemed to be formatted reasonably. For the rest, I also checked a sample manually. The station ids seem to have two different formats. The station names seem to look reasonably formatted, but I’m not an expert on the area and its naming conventions.
  • Check for misfielded values: I only looked at a subset but that looks fine.
  • Check for mismatched data types: the only datatypes used are character and numeric. Those columns that have a small, distinct numbers of values.
# check for mismatched data types
## "should be transformed into 'factor', start and end times to 'datetime'
#df_all_clean["member_casual"] <- as.factor(df_all_clean["member_casual"]) # didn't do for now because it took long and returned NA, TODO: find out why
#df_all_clean["rideable_type"] <- as.factor(df_all_clean["rideable_type"]) # see above. Also: do I really need to transform chr to factor?
df_all_clean$started_at <- as.POSIXct(df_all_clean$started_at, format="%Y-%m-%d %H:%M:%S") #2021-05-25 17:28:11
df_all_clean$ended_at <- as.POSIXct(df_all_clean$ended_at, format="%Y-%m-%d %H:%M:%S")
glimpse(df_all_clean)
## Rows: 5,757,551
## Columns: 14
## $ column_label       <chr> "df_05_21", "df_05_21", "df_05_21", "df_05_21", "df…
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-05-30 11:58:15, 2021-05-30 11:29:14, 2021-05-…
## $ ended_at           <dttm> 2021-05-30 12:10:39, 2021-05-30 12:14:09, 2021-05-…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
  • create new columns:
# create new column ride length
df_all_clean$ride_length <- df_all_clean$ended_at - df_all_clean$started_at
## format from seconds to dd:hh:mm:ss
library(lubridate)
## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
df_all_clean$ride_length_period <- seconds_to_period(df_all_clean$ride_length) 
glimpse(df_all_clean)
## Rows: 5,757,551
## Columns: 16
## $ column_label       <chr> "df_05_21", "df_05_21", "df_05_21", "df_05_21", "df…
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-05-30 11:58:15, 2021-05-30 11:29:14, 2021-05-…
## $ ended_at           <dttm> 2021-05-30 12:10:39, 2021-05-30 12:14:09, 2021-05-…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length        <drtn> 744 secs, 2695 secs, 72 secs, 913 secs, 413 secs, …
## $ ride_length_period <Period> 12M 24S, 44M 55S, 1M 12S, 15M 13S, 6M 53S, 23M 3…
# create day of week column for start date
df_all_clean$day_of_week <- weekdays(df_all_clean$started_at)
  • Check for irregularities:
## check for data irregularities (invalid values or outliers)
summary(df_all_clean)
##  column_label         ride_id          rideable_type     
##  Length:5757551     Length:5757551     Length:5757551    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    started_at                        ended_at                    
##  Min.   :2021-05-01 00:00:11.00   Min.   :2021-05-01 00:03:26.0  
##  1st Qu.:2021-07-07 14:51:40.00   1st Qu.:2021-07-07 15:15:33.0  
##  Median :2021-08-31 17:16:28.00   Median :2021-08-31 17:33:04.0  
##  Mean   :2021-09-18 18:29:49.88   Mean   :2021-09-18 18:50:35.5  
##  3rd Qu.:2021-11-03 20:19:42.00   3rd Qu.:2021-11-03 20:31:13.0  
##  Max.   :2022-04-30 23:59:54.00   Max.   :2022-05-02 00:35:01.0  
##  NA's   :70                       NA's   :78                     
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5757551     Length:5757551     Length:5757551     Length:5757551    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.64   Min.   :-87.84   Min.   :41.39   Min.   :-88.97  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.17   Max.   :-87.49  
##                                   NA's   :4766    NA's   :4766    
##  member_casual      ride_length       ride_length_period            
##  Length:5757551     Length:5757551    Min.   :-58M -2S              
##  Class :character   Class :difftime   1st Qu.:6M 27S                
##  Mode  :character   Mode  :numeric    Median :11M 31S               
##                                       Mean   :21M 8.4333348617813S  
##                                       3rd Qu.:20M 57S               
##                                       Max.   :38d 20H 24M 9S        
##                                       NA's   :89                    
##  day_of_week       
##  Length:5757551    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
  • There are outliers in the “ride_length_period” column: Min. :-58M -2S, Max. :38d 20H 24M 9S. I will check them manually by sorting, then remove rows with values that make now sense.
# sort and check
df_sorted <- arrange(df_all_clean, ride_length_period)
#View(df_sorted)
head(df_all_clean)
##   column_label          ride_id rideable_type          started_at
## 1     df_05_21 C809ED75D6160B2A electric_bike 2021-05-30 11:58:15
## 2     df_05_21 DD59FDCE0ACACAF3 electric_bike 2021-05-30 11:29:14
## 3     df_05_21 0AB83CB88C43EFC2 electric_bike 2021-05-30 14:24:01
## 4     df_05_21 7881AC6D39110C60 electric_bike 2021-05-30 14:25:51
## 5     df_05_21 853FA701B4582BAF electric_bike 2021-05-30 18:15:39
## 6     df_05_21 F5E63DFD96B2A737 electric_bike 2021-05-30 11:33:41
##              ended_at start_station_name start_station_id end_station_name
## 1 2021-05-30 12:10:39                                                     
## 2 2021-05-30 12:14:09                                                     
## 3 2021-05-30 14:25:13                                                     
## 4 2021-05-30 14:41:04                                                     
## 5 2021-05-30 18:22:32                                                     
## 6 2021-05-30 11:57:17                                                     
##   end_station_id start_lat start_lng end_lat end_lng member_casual ride_length
## 1                    41.90    -87.63   41.89  -87.61        casual    744 secs
## 2                    41.88    -87.62   41.79  -87.58        casual   2695 secs
## 3                    41.92    -87.70   41.92  -87.70        casual     72 secs
## 4                    41.92    -87.70   41.94  -87.69        casual    913 secs
## 5                    41.94    -87.69   41.94  -87.70        casual    413 secs
## 6                    41.88    -87.63   41.89  -87.62        casual   1416 secs
##   ride_length_period day_of_week
## 1            12M 24S      Sunday
## 2            44M 55S      Sunday
## 3             1M 12S      Sunday
## 4            15M 13S      Sunday
## 5             6M 53S      Sunday
## 6            23M 36S      Sunday
df_all_clean$ride_length_num <- as.numeric(df_all_clean$ride_length)
glimpse(df_all_clean)
## Rows: 5,757,551
## Columns: 18
## $ column_label       <chr> "df_05_21", "df_05_21", "df_05_21", "df_05_21", "df…
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-05-30 11:58:15, 2021-05-30 11:29:14, 2021-05-…
## $ ended_at           <dttm> 2021-05-30 12:10:39, 2021-05-30 12:14:09, 2021-05-…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length        <drtn> 744 secs, 2695 secs, 72 secs, 913 secs, 413 secs, …
## $ ride_length_period <Period> 12M 24S, 44M 55S, 1M 12S, 15M 13S, 6M 53S, 23M 3…
## $ day_of_week        <chr> "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "…
## $ ride_length_num    <dbl> 744, 2695, 72, 913, 413, 1416, 883, 1075, 157, 1581…
print(sum(df_all_clean$ride_length_period < 0, na.rm = TRUE))
## [1] 140
#View(df_all_clean)
print(sum(df_all_clean$ride_length_period == 0, na.rm = TRUE))
## [1] 512
# i'm wondering why the skim method did not show me the missing values for start and end before. Now it shows.
# skim(df_all_clean)
  • There are 140 minus values. I should drop them when I work with them later on. There are also rides with a duration of 0s (512) or a few s. In a real-world scenario, I would ask the stakeholders. What this could mean and if they want to keep the data. I would ask them if they know what could have caused this. In this fictional case I will only remove the negative rides.
# remove all rows that have negative or missing values for ride length
df_all_no_na <- subset(df_all_clean, (!is.na(df_all_clean["ride_length"])))
df_all_no_neg <- subset(df_all_no_na, df_all_no_na["ride_length"]>= 0)
glimpse(df_all_no_neg)
## Rows: 5,757,322
## Columns: 18
## $ column_label       <chr> "df_05_21", "df_05_21", "df_05_21", "df_05_21", "df…
## $ ride_id            <chr> "C809ED75D6160B2A", "DD59FDCE0ACACAF3", "0AB83CB88C…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-05-30 11:58:15, 2021-05-30 11:29:14, 2021-05-…
## $ ended_at           <dttm> 2021-05-30 12:10:39, 2021-05-30 12:14:09, 2021-05-…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.90000, 41.88000, 41.92000, 41.92000, 41.94000, 4…
## $ start_lng          <dbl> -87.63000, -87.62000, -87.70000, -87.70000, -87.690…
## $ end_lat            <dbl> 41.89000, 41.79000, 41.92000, 41.94000, 41.94000, 4…
## $ end_lng            <dbl> -87.61000, -87.58000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length        <drtn> 744 secs, 2695 secs, 72 secs, 913 secs, 413 secs, …
## $ ride_length_period <Period> 12M 24S, 44M 55S, 1M 12S, 15M 13S, 6M 53S, 23M 3…
## $ day_of_week        <chr> "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "…
## $ ride_length_num    <dbl> 744, 2695, 72, 913, 413, 1416, 883, 1075, 157, 1581…
# skim(df_all_no_neg)
  • Save clean file:
## save to csv for storage / for later
getwd()
## [1] "/Users/kat/Dev/Google-Data-Analytics-R/bike-sharing-capstone"
#write.csv(df_all_clean, "../output-data/allbikedata-clean.csv", row.names = FALSE)
#write.csv(df_all_no_neg, "../output-data/allbikedata-clean_onlyvalidridelength.csv", row.names = FALSE)