## Exploration of the dataset
The TFL website contains **Santander cycling data** that are structured in different directories.
In this notebook, we are going to read a single file as an example of the cycling journey.

Additionally, we will also read the **docking stations data** which was found outside the main TFL website. 
The stations data contains the list of departure and destination stations mentioned in each cycling journey.

Our third dataset consists of the **historical weather data** in London over the year of 2021. The data are represented daily with 36 weather attributes. This data was originally retrieved from www.visualcrossing.com website, then stored in Google Drive to allow easy access to it.

In [1]:
# import packages
import pandas as pd
import json

### Cycling journey data

In [2]:
# download an example file
!wget https://cycling.data.tfl.gov.uk/usage-stats/252JourneyDataExtract10Feb2021-16Feb2021.csv -O journey10Feb2021-16Feb2021.csv

--2023-06-25 14:14:01--  https://cycling.data.tfl.gov.uk/usage-stats/252JourneyDataExtract10Feb2021-16Feb2021.csv
Resolving cycling.data.tfl.gov.uk (cycling.data.tfl.gov.uk)... 104.16.101.13, 104.16.100.13
Connecting to cycling.data.tfl.gov.uk (cycling.data.tfl.gov.uk)|104.16.101.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11036049 (11M) [text/csv]
Saving to: 'journey10Feb2021-16Feb2021.csv'


2023-06-25 14:14:01 (42.8 MB/s) - 'journey10Feb2021-16Feb2021.csv' saved [11036049/11036049]



In [3]:
df= pd.read_csv('journey10Feb2021-16Feb2021.csv')
df.head()

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name
0,105401285,3360,17497,15/02/2021 20:55,785,"Aquatic Centre, Queen Elizabeth Olympic Park",15/02/2021 19:59,785,"Aquatic Centre, Queen Elizabeth Olympic Park"
1,105322226,1020,4677,10/02/2021 08:03,194,"Hop Exchange, The Borough",10/02/2021 07:46,14,"Belgrove Street , King's Cross"
2,105351846,480,18046,12/02/2021 15:26,27,"Bouverie Street, Temple",12/02/2021 15:18,196,"Union Street, The Borough"
3,105324229,180,19785,10/02/2021 10:46,195,"Milroy Walk, South Bank",10/02/2021 10:43,196,"Union Street, The Borough"
4,105350696,720,14243,12/02/2021 14:17,274,"Warwick Road, Olympia",12/02/2021 14:05,219,"Bramham Gardens, Earl's Court"


In [4]:
df.shape

(89405, 9)

In [5]:
# infer a sql table schema for journey data
journey_table= pd.io.sql.get_schema(frame=df, name='journey_staging', keys='Rental Id')
print(journey_table)

CREATE TABLE "journey_staging" (
"Rental Id" INTEGER,
  "Duration" INTEGER,
  "Bike Id" INTEGER,
  "End Date" TEXT,
  "EndStation Id" INTEGER,
  "EndStation Name" TEXT,
  "Start Date" TEXT,
  "StartStation Id" INTEGER,
  "StartStation Name" TEXT,
  CONSTRAINT journey_staging_pk PRIMARY KEY ("Rental Id")
)


### Docking stations

In [6]:
!wget https://www.whatdotheyknow.com/request/664717/response/1572474/attach/3/Cycle%20hire%20docking%20stations.csv.txt -O stations.csv

--2023-06-25 14:14:02--  https://www.whatdotheyknow.com/request/664717/response/1572474/attach/3/Cycle%20hire%20docking%20stations.csv.txt
Resolving www.whatdotheyknow.com (www.whatdotheyknow.com)... 93.93.128.121, 93.93.130.118
Connecting to www.whatdotheyknow.com (www.whatdotheyknow.com)|93.93.128.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: 'stations.csv'

stations.csv            [ <=>                ]  57.09K  --.-KB/s    in 0.03s   

2023-06-25 14:14:02 (1.89 MB/s) - 'stations.csv' saved [58461]



In [7]:
df_stations= pd.read_csv('stations.csv')
df_stations.head()

Unnamed: 0,Station.Id,StationName,longitude,latitude,Easting,Northing
0,1,"River Street, Clerkenwell",-0.109971,51.5292,531202.52,182832.02
1,2,"Phillimore Gardens, Kensington",-0.197574,51.4996,525207.07,179391.86
2,3,"Christopher Street, Liverpool Street",-0.084606,51.5213,532984.81,182001.53
3,4,"St. Chad's Street, King's Cross",-0.120974,51.5301,530436.76,182911.99
4,5,"Sedding Street, Sloane Square",-0.156876,51.4931,528051.649,178742.097


In [8]:
# infer a sql table schema for stations data
stations_table= pd.io.sql.get_schema(frame=df_stations, name='stations_staging', keys='Station.Id')
print(stations_table)

CREATE TABLE "stations_staging" (
"Station.Id" INTEGER,
  "StationName" TEXT,
  "longitude" REAL,
  "latitude" REAL,
  "Easting" REAL,
  "Northing" REAL,
  CONSTRAINT stations_staging_pk PRIMARY KEY ("Station.Id")
)


### Historical weather data in 2021

In [9]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=13LWAH93xxEvOukCnPhrfXH7rZZq_-mss' -O weather-2021.json

--2023-06-25 14:14:02--  https://docs.google.com/uc?export=download&id=13LWAH93xxEvOukCnPhrfXH7rZZq_-mss
Resolving docs.google.com (docs.google.com)... 172.217.169.46
Connecting to docs.google.com (docs.google.com)|172.217.169.46|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-9c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/04gar5sb4pd3au0o2mai4dmi5af7n9oe/1687698825000/11576894146992100236/*/13LWAH93xxEvOukCnPhrfXH7rZZq_-mss?e=download&uuid=0c00f0ad-ece8-4256-9323-2ae486cc3106 [following]
--2023-06-25 14:14:03--  https://doc-0s-9c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/04gar5sb4pd3au0o2mai4dmi5af7n9oe/1687698825000/11576894146992100236/*/13LWAH93xxEvOukCnPhrfXH7rZZq_-mss?e=download&uuid=0c00f0ad-ece8-4256-9323-2ae486cc3106
Resolving doc-0s-9c-docs.googleusercontent.com (doc-0s-9c-docs.googleusercontent.com)... 142.250.187.225
Connecting to doc-0s-9c-docs.googleusercontent.

In [10]:
!head -n 20 weather-2021.json

{
  "latitude" : 51.5064,
  "longitude" : -0.12721,
  "resolvedAddress" : "London, England, United Kingdom",
  "address" : "London,UK",
  "timezone" : "Europe/London",
  "tzoffset" : 0.0,
  "name" : "London,UK",
  "days" : [ {
    "datetime" : "2021-01-01",
    "datetimeEpoch" : 1609459200,
    "tempmax" : 5.0,
    "tempmin" : -0.5,
    "temp" : 2.1,
    "feelslikemax" : 2.9,
    "feelslikemin" : -3.6,
    "feelslike" : -0.2,
    "dew" : 0.8,
    "humidity" : 91.03,
    "precip" : 0.22,


In [11]:
# we will only extract the day items
with open('weather-2021.json', 'r') as f:
    weather = json.load(f)

df_weather = pd.DataFrame.from_dict(weather["days"])
df_weather.head()

Unnamed: 0,datetime,datetimeEpoch,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,sunset,sunsetEpoch,moonphase,conditions,description,icon,stations,source,tzoffset,severerisk
0,2021-01-01,1609459200,5.0,-0.5,2.1,2.9,-3.6,-0.2,0.8,91.03,...,16:02:22,1609516942,0.53,Rain,Clear conditions throughout the day with late ...,rain,"[03769099999, 03680099999, D5621, 03672099999,...",obs,,
1,2021-01-02,1609545600,5.1,1.5,3.8,3.1,-1.5,1.5,1.0,82.51,...,16:03:28,1609603408,0.56,Rain,Clear conditions throughout the day with rain.,rain,"[03680099999, D5621, 03672099999, 03781099999,...",obs,,
2,2021-01-03,1609632000,6.0,1.1,3.8,5.6,-2.5,0.9,1.7,86.02,...,16:04:36,1609689876,0.6,Rain,Clear conditions throughout the day with rain.,rain,"[03680099999, D5621, 03672099999, 03781099999,...",obs,,
3,2021-01-04,1609718400,5.6,3.5,4.3,4.1,-0.7,0.5,1.4,81.43,...,16:05:46,1609776346,0.65,Rain,Clear conditions throughout the day with rain.,rain,"[03680099999, D5621, 03672099999, 03781099999,...",obs,,
4,2021-01-05,1609804800,4.6,2.5,3.7,0.8,-1.8,-0.4,1.0,82.39,...,16:06:59,1609862819,0.7,Rain,Clear conditions throughout the day with rain.,rain,"[03680099999, D5621, 03672099999, 03781099999,...",obs,,


In [12]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396 entries, 0 to 395
Data columns (total 37 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   datetime        396 non-null    object 
 1   datetimeEpoch   396 non-null    int64  
 2   tempmax         396 non-null    float64
 3   tempmin         396 non-null    float64
 4   temp            396 non-null    float64
 5   feelslikemax    396 non-null    float64
 6   feelslikemin    396 non-null    float64
 7   feelslike       396 non-null    float64
 8   dew             396 non-null    float64
 9   humidity        396 non-null    float64
 10  precip          396 non-null    float64
 11  precipprob      22 non-null     float64
 12  precipcover     374 non-null    float64
 13  preciptype      10 non-null     object 
 14  snow            22 non-null     float64
 15  snowdepth       31 non-null     float64
 16  windgust        167 non-null    float64
 17  windspeed       396 non-null    flo

In [13]:
print('Columns: ', df_weather.columns, '\nShape: ', df_weather.shape)

Columns:  Index(['datetime', 'datetimeEpoch', 'tempmax', 'tempmin', 'temp',
       'feelslikemax', 'feelslikemin', 'feelslike', 'dew', 'humidity',
       'precip', 'precipprob', 'precipcover', 'preciptype', 'snow',
       'snowdepth', 'windgust', 'windspeed', 'winddir', 'pressure',
       'cloudcover', 'visibility', 'solarradiation', 'solarenergy', 'uvindex',
       'sunrise', 'sunriseEpoch', 'sunset', 'sunsetEpoch', 'moonphase',
       'conditions', 'description', 'icon', 'stations', 'source', 'tzoffset',
       'severerisk'],
      dtype='object') 
Shape:  (396, 37)


In [14]:
# infer a sql table schema for weather data
weather_table= pd.io.sql.get_schema(frame=df_weather, name='weather_staging', keys='datetime')
print(weather_table)

CREATE TABLE "weather_staging" (
"datetime" TEXT,
  "datetimeEpoch" INTEGER,
  "tempmax" REAL,
  "tempmin" REAL,
  "temp" REAL,
  "feelslikemax" REAL,
  "feelslikemin" REAL,
  "feelslike" REAL,
  "dew" REAL,
  "humidity" REAL,
  "precip" REAL,
  "precipprob" REAL,
  "precipcover" REAL,
  "preciptype" TEXT,
  "snow" REAL,
  "snowdepth" REAL,
  "windgust" REAL,
  "windspeed" REAL,
  "winddir" REAL,
  "pressure" REAL,
  "cloudcover" REAL,
  "visibility" REAL,
  "solarradiation" REAL,
  "solarenergy" REAL,
  "uvindex" REAL,
  "sunrise" TEXT,
  "sunriseEpoch" INTEGER,
  "sunset" TEXT,
  "sunsetEpoch" INTEGER,
  "moonphase" REAL,
  "conditions" TEXT,
  "description" TEXT,
  "icon" TEXT,
  "stations" TEXT,
  "source" TEXT,
  "tzoffset" REAL,
  "severerisk" REAL,
  CONSTRAINT weather_staging_pk PRIMARY KEY ("datetime")
)
