# 3. Visualisation: Washington DC Daily Bike Rental Figures

Date: 2020-10-30

## About the Notebook

This is an extended visualisation exercise, where a dataset is processed and visualised using the different graphical methods outlined in **M248 Unit 1**.

In **section 1**, the dataset and its attributes are introduced.

In **section 2**:

- Import the data from SQLite database
- Process the data
    - **Filtering** on a specific variable
        - Uses Method: `query()`
    - **Recasting** variables
        - Uses Method: `astype()`
    - **Descriptive columns** appended for indexed variables
        - Uses Function: `addDescCol()`

In **section 3**, four visualisations are produced

- Side-by-side bar plot
    - Uses method: `pivot_table()`
- Frequency histogram
- Comparative boxplots
- Scatterplot

Note that as this a visualisation exercise so there is no comment on the graphics.

## 1. About the Data

Source: BikeRentalDaily

> Capital Bikeshare scheme which runs in the Washington DC area (e.g. similar to the London bikeshare scheme)

| Attribute        | Unit       | Description                                                  |
| ---------------- | ---------- | ------------------------------------------------------------ |
| Dteday           | Ordinal    | Date (YYYY-MM-DD)                                            |
| Season           | Ordinal    | Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)            |
| yr               | Ordinal    | Year (0: 2011, 1: 2012)                                      |
| Mnth             | Ordinal    | Month (1 - 12)                                               |
| Holiday          | Boolean    | Whether day is holiday or not (0, 1)                         |
| Weekday          | Ordinal    | Day of the week (0 - 6)                                      |
| Workingday       | Boolean    | If day is neither weekend nor holiday is 1, otherwise is 0   |
| Weathersit       | Nominal    | Weather situation (1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds, 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog)) |
| Temp Normalized  | Continuous | Temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) |
| Atemp Normalized | Continuous | Atemp Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) |
| Hom Normalized   | Continuous | Normalized humidity. The values are divided to 100 (max)     |
| Windspeed        | Continuous | Normalized wind speed. The values are divided to 67 (max)    |
| Casual           | Discrete   | Count of casual users                                        |
| Registered       | Discrete   | Count of registered users                                    |
| Daily count      | Discrete   | Count of total rental bikes including both casual and registered |

## 2. Importing and Processing the Data

In [36]:
library(ggplot2)

In [2]:
library(RSQLite)

In [3]:
library(plyr) # used for mapvalue

In [41]:
library(dplyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### Importing

We first create a connection to the database, return the table as an object, and then send the table to a DataFrame.

In [5]:
# connect to the db
con <- dbConnect(RSQLite::SQLite(), "./data/sets.db3")

In [6]:
# return the table as a data.frame
df_bike_daily <- dbReadTable(con, "BikeRentalDailyDC")

### Filtering the DataFrame

In [7]:
# filter for f(x) < 0
df_filtered <- subset(df_bike_daily, yr == 0)

### Recasting attributes

The table has been imported without the proper data types, as expected.

`type.convert()` converts chars $\to$ numeric/bool.

`as.Date()` coverts `String` dates to `date`.

 See [strptime](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime) for date formats.

In [8]:
df_filtered$instant <- type.convert(df_filtered$instant)

In [9]:
df_filtered$season <- type.convert(df_filtered$season)

In [10]:
df_filtered$yr <- type.convert(df_filtered$yr)

In [11]:
df_filtered$mnth <- type.convert(df_filtered$mnth)

In [12]:
df_filtered$weekday <- type.convert(df_filtered$weekday)

In [13]:
df_filtered$holiday <- type.convert(df_filtered$holiday)

In [14]:
df_filtered$casual <- type.convert(df_filtered$casual)

In [15]:
df_filtered$registered <- type.convert(df_filtered$registered)

In [16]:
df_filtered$dailycount <- type.convert(df_filtered$dailycount)

In [17]:
df_filtered$temp <- type.convert(df_filtered$temp)

In [18]:
df_filtered$atemp <- type.convert(df_filtered$atemp)

In [19]:
df_filtered$hum <- type.convert(df_filtered$hum)

In [20]:
df_filtered$windspeed <- type.convert(df_filtered$windspeed)

In [21]:
df_filtered$dteday <- as.Date(df_filtered$dteday, "%d/%m/%y")

### Adding descriptive columns

The data includes some index columns, so we will add a companion description column.

Source: [mapvalues()](https://www.rdocumentation.org/packages/plyr/versions/1.8.6/topics/mapvalues)

In [22]:
df_filtered$desc_yr <- mapvalues(x = df_filtered$yr, from = c(0), to = c(2011))

In [23]:
# define lists for season
oldSeason <- c(1:4)
newSeason <- c("Winter", "Spring", "Summer", "Fall")

# replace values
df_filtered$desc_season <- mapvalues(x = df_filtered$season, from = oldSeason, to = newSeason)

In [24]:
# define lists for mnth
oldMnth <- c(1:12)
newMnth <- c("JAN", "FEB", "MAR", "APR",
             "MAY", "JUN", "JUL", "AUG",
             "SEP", "OCT", "NOV", "DEC")

# replace values
df_filtered$desc_mnth <- mapvalues(x = df_filtered$mnth, from = oldMnth, to = newMnth)

In [25]:
# define lists for workingday
oldWeekday <- c(0:6)
newWeekday <- c("Sun", "Mon", "Tues", "Wed",
                "Thurs", "Fri", "Sat")

# replace values
df_filtered$desc_weekday <- mapvalues(x = df_filtered$weekday, from = oldWeekday, to = newWeekday)

In [26]:
# define lists for workingday
oldWorkingDay <- c(0, 1)
newWorkingDay <- c("Non-working day", "Working day")

# replace values
df_filtered$desc_workingday <- mapvalues(x = df_filtered$workingday, from = oldWorkingDay, to = newWorkingDay)

In [27]:
# define lists for holiday
oldHoliday <- c(0, 1)
newHoliday <- c("Non-holiday", "Holiday")

# replace values
df_filtered$desc_holiday <- mapvalues(x = df_filtered$holiday, from = oldHoliday, to = newHoliday)

## 3. Visualising

### Average daily users by month

In [47]:
df_grouped <- group_by(.data = df_filtered, 'mnth')

In [48]:
df_grouped

instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,...,casual,registered,dailycount,desc_yr,desc_season,desc_mnth,desc_weekday,desc_workingday,desc_holiday,"""mnth"""
<int>,<date>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>,...,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,2020-01-01,1,0,1,0,6,0,2,0.3441670,...,331,654,985,2011,Winter,JAN,Sat,Non-working day,Non-holiday,mnth
2,2020-01-02,1,0,1,0,0,0,2,0.3634780,...,131,670,801,2011,Winter,JAN,Sun,Non-working day,Non-holiday,mnth
3,2020-01-03,1,0,1,0,1,1,1,0.1963640,...,120,1229,1349,2011,Winter,JAN,Mon,Working day,Non-holiday,mnth
4,2020-01-04,1,0,1,0,2,1,1,0.2000000,...,108,1454,1562,2011,Winter,JAN,Tues,Working day,Non-holiday,mnth
5,2020-01-05,1,0,1,0,3,1,1,0.2269570,...,82,1518,1600,2011,Winter,JAN,Wed,Working day,Non-holiday,mnth
6,2020-01-06,1,0,1,0,4,1,1,0.2043480,...,88,1518,1606,2011,Winter,JAN,Thurs,Working day,Non-holiday,mnth
7,2020-01-07,1,0,1,0,5,1,2,0.1965220,...,148,1362,1510,2011,Winter,JAN,Fri,Working day,Non-holiday,mnth
8,2020-01-08,1,0,1,0,6,0,2,0.1650000,...,68,891,959,2011,Winter,JAN,Sat,Non-working day,Non-holiday,mnth
9,2020-01-09,1,0,1,0,0,0,1,0.1383330,...,54,768,822,2011,Winter,JAN,Sun,Non-working day,Non-holiday,mnth
10,2020-01-10,1,0,1,0,1,1,1,0.1508330,...,41,1280,1321,2011,Winter,JAN,Mon,Working day,Non-holiday,mnth


In [33]:
fig1 = temp_pivot.plot(kind='bar')
fig1.set(title='Average daily users by month',
         xlabel='Month',
         ylabel='Average users')

ERROR: Error in temp_pivot.plot(kind = "bar"): could not find function "temp_pivot.plot"


### Histogram of dailycount

In [None]:
fig2 = sns.histplot(data=df_filtered,
                    x='dailycount',
                    binwidth=500)

### Comparison of dailycount by season

In [None]:
fig4 = sns.boxplot(data=df_filtered,
                   x='dailycount',
                   y='desc_season')

### Scatterplot of apparent temperate against casual user daily count

In [None]:
fig6 = sns.scatterplot(data=df_filtered,
                       x='atemp',
                       y='casual',
                       hue='desc_workingday')