# Bike Sharing Data Cleanup Lab

### Introduction

In this lesson, we'll use our knowledge of working with data to analyze the Seoul bike sharing system.

The Seoul bike sharing system, has over 800 stations all across Seoul. And it is used by many to move about in the city. We'll explore this data to try to better understand how these bikes are used, which we can use as a case study for understanding bike systems more broadly.

> **Note**: Please do not use list comprehension at this point -- we will practice that in a later lesson.

### Loading our data

Let's read in the data

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/python-fundamentals-jigsaw/review-datatypes/main/SeoulBikeData.csv"
df = pd.read_csv(url, encoding='unicode_escape')
df[:2]

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes


And then we can coerce this to a list of dictionaries.

In [6]:
bike_hours = df.to_dict('records')

### Exploring our data

Ok, let's start off by exploring our dataset.  Select the first element from the list of dictionaries below.

In [15]:
first_record = bike_hours[0]

print(first_record)

{'Date': '01/12/2017', 'Rented Bike Count': 254, 'Hour': 0, 'Temperature(°C)': -5.2, 'Humidity(%)': 37, 'Wind speed (m/s)': 2.2, 'Visibility (10m)': 2000, 'Dew point temperature(°C)': -17.6, 'Solar Radiation (MJ/m2)': 0.0, 'Rainfall(mm)': 0.0, 'Snowfall (cm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday', 'Functioning Day': 'Yes'}


And then to make this easier, let's just display the keys of our `first_record`.

In [16]:
record_keys = first_record.keys()

record_keys

# dict_keys(['Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)',
# 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)',
# 'Seasons', 'Holiday', 'Functioning Day'])

dict_keys(['Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', 'Functioning Day'])

Ok so we can see that we have keys of `Date`, `Rented Bike Count`, and `Hour` among others.  At this point we can identify the **grain** of the data.

By the grain of the data, we mean, what does each record represent.  In this case, each record indicates the number of bikes in a given hour.  Then the other attributes like windspeed and rainfall describe weather conditions of that hour.

* A discrepancy?

It looks like there may be some overlap in our keys.  Notice that the last two attributes are `'Holiday'`, and `'Functioning Day'`. Do these represent the same thing?  

Begin by creating a list of all of the values for `Holiday`.  And then create a list for all of the values for `Functioning Day`.  Slice the first five records.

> Remember: Please do not use list comprehension at this point -- we will practice that in a later lesson.

In [17]:
# declare an array of Holidays
holidays = []

# iterate through all bikes
for bike in bike_hours:
  # append bike Holiday to holidays array
  holidays.append(bike['Holiday'])

holidays[:5]
# ['No Holiday', 'No Holiday', 'No Holiday', 'No Holiday', 'No Holiday']

['No Holiday', 'No Holiday', 'No Holiday', 'No Holiday', 'No Holiday']

And then let's do the same thing for `'Functioning Day'`.

In [19]:
functioning_days = []

for bike in bike_hours:
  functioning_days.append(bike['Functioning Day'])

functioning_days[:5]

# ['Yes', 'Yes', 'Yes', 'Yes', 'Yes']

['Yes', 'Yes', 'Yes', 'Yes', 'Yes']

Ok, so it looks like, perhaps, when there is `No Holiday` that the value of functioning day is `Yes`.

Let's do a little more digging --  find the unique values relating to `Holiday`.

In [25]:
set(holidays)

# {'Holiday', 'No Holiday'}

{'Holiday', 'No Holiday'}

And find all of the unique values relating to `Functioning Day`.

In [26]:
# {'No', 'Yes'}
set(functioning_days)

{'No', 'Yes'}

Ok so they both have two values.  And remember our thought is that when holiday is `No Holiday`, that functioning day's value is always `Yes`.

In [27]:
holidays[:4]
# ['No Holiday', 'No Holiday', 'No Holiday', 'No Holiday']
functioning_days[:4]
# ['Yes', 'Yes', 'Yes', 'Yes']

['Yes', 'Yes', 'Yes', 'Yes']

So below, let's check this by creating a list of all of the records where there is a mismatch.  That is where we have:
* `No Holiday` and `No`, or
* `Holiday` and `Yes`.

In [28]:
mismatches = []

for bike in bike_hours:
  if ((bike['Holiday'] == 'Holiday') & (bike['Functioning Day'] == 'Yes')) or ((bike['Holiday'] == 'No Holiday') & (bike['Functioning Day'] == 'No')):
    mismatches.append(bike)



mismatches[:1]

# [{'Date': '22/12/2017', 'Rented Bike Count': 196, 'Hour': 0, 'Temperature(°C)': -1.7, 'Humidity(%)': 79, 'Wind speed (m/s)': 0.5, 'Visibility (10m)': 794, 'Dew point temperature(°C)': -4.8, 'Solar Radiation (MJ/m2)': 0.0, 'Rainfall(mm)': 0.0, 'Snowfall (cm)': 0.8, 'Seasons': 'Winter', 'Holiday': 'Holiday', 'Functioning Day': 'Yes'}]

[{'Date': '22/12/2017',
  'Rented Bike Count': 196,
  'Hour': 0,
  'Temperature(°C)': -1.7,
  'Humidity(%)': 79,
  'Wind speed (m/s)': 0.5,
  'Visibility (10m)': 794,
  'Dew point temperature(°C)': -4.8,
  'Solar Radiation (MJ/m2)': 0.0,
  'Rainfall(mm)': 0.0,
  'Snowfall (cm)': 0.8,
  'Seasons': 'Winter',
  'Holiday': 'Holiday',
  'Functioning Day': 'Yes'}]

Ok, so we can see that we do have some mismatches.  That is Holiday and Functioning Day do not contain the same information.

### Checking Representativeness

Now that we've used the keys to identify the grain of the data, the next step, we can check the completeness, and the time range that the dataset includes.  

* Completeness

A good way to determine the completeness of the data is to first make sure that we have records for all seasons.  Assign a set representing the unique seasons to a the variable below.

In [29]:
unique_seasons = set(bike['Seasons'] for bike in bike_hours)
unique_seasons

{'Autumn', 'Spring', 'Summer', 'Winter'}

Ok, so we do have records from each season.  Let's also find the range of dates that our data spans.  First return the earliest date in the dataset, and then return the latest date in the dataset.  But before we do, let's take a look at the data.

In [30]:
bike_hours[1]['Date']

'01/12/2017'

Ok, so this is a little tricky, because the data is currently listed as day month year. So if try to sort the data, or use a function like max, it will return results in the wrong order.

Let's see this below.

In [31]:
max(['03/12/2016', '01/12/2017'])

'03/12/2016'

This is because the data is effectively sorted alphabetically, so the string beginning with `03` is larger than the string beginning `01`.  So let's change our data to be `year/month/day`.

First perform this for one record.

In [36]:
sample_date = '03/12/2016'

date = sample_date.split('/')

formatted_date = '/'.join(formatted_date[::-1])
formatted_date
# '2016/12/03'

'2016/12/03'

And now create a list that has all of the dates in this format.

In [37]:
formatted_dates = []

for bike in bike_hours:
  date = bike['Date'].split('/')
  formatted_dates.append('/'.join(date[::-1]))


formatted_dates[:3]

['2017/12/01', '2017/12/01', '2017/12/01']

And now from here, find the max date.

In [38]:
max_date = max(formatted_dates)
max_date

'2018/11/30'

And the min date.

In [39]:
min_date = min(formatted_dates)
min_date

'2017/12/01'

### Summary

In this lesson, we moved through some of the steps in exploring and checking a dataset.  

We began by identifying *the grain* of the data, to determine what each record represented.  And we then moved onto seeing if some of our attributes represented the same information.  And finally we began to check our dataset for completeness (by exploring the seasons), as well as the recency and range of the data.