In [1]:
# Connect to PostgreSQL database
import os
import psycopg2
import pandas as pd
from dotenv import load_dotenv

# Load DB credentials from .env
load_dotenv()

conn = psycopg2.connect(
    host=os.getenv("DB_HOST", "localhost"),
    port=os.getenv("DB_PORT", "5439"),
    user=os.getenv("DB_USER", "postgres"),
    password=os.getenv("DB_PASS"),
    database=os.getenv("DB_NAME", "tfl")
)

# Helper function to run SQL queries
def run_query(sql: str):
    return pd.read_sql(sql, conn)

# Test connection
test_df = run_query("SELECT COUNT(*) as total_journeys FROM journeys")
print(f"Successfully connected! Total journey records: {test_df['total_journeys'].iloc[0]}")
print(f"\nAvailable journey types:")
types_df = run_query("SELECT DISTINCT journey_type FROM journeys ORDER BY journey_type")
display(types_df)


Successfully connected! Total journey records: 936

Available journey types:


  return pd.read_sql(sql, conn)


Unnamed: 0,journey_type
0,Bus
1,Emirates Airline
2,Overground
3,TfL Rail
4,Tram
5,Underground & DLR


![tower bridge](london.jpg)

London, or as the Romans called it "Londonium"! Home to [over 8.5 million residents](https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/bulletins/populationandhouseholdestimatesenglandandwales/census2021unroundeddata#population-and-household-estimates-england-and-wales-data) who speak over [300 languages](https://web.archive.org/web/20080924084621/http://www.cilt.org.uk/faqs/langspoken.htm). While the City of London is a little over one square mile (hence its nickname "The Square Mile"), Greater London has grown to encompass 32 boroughs spanning a total area of 606 square miles! 

![underground train leaving a platform](tube.jpg)

Given the city's roads were originally designed for horse and cart, this area and population growth has required the development of an efficient public transport system! Since the year 2000, this has been through the local government body called **Transport for London**, or *TfL*, which is managed by the London Mayor's office. Their remit covers the London Underground, Overground, Docklands Light Railway (DLR), buses, trams, river services (clipper and [Emirates Airline cable car](https://en.wikipedia.org/wiki/London_cable_car)), roads, and even taxis.

The Mayor of London's office make their data available to the public [here](https://data.london.gov.uk/dataset). In this project, you will work with a slightly modified version of a dataset containing information about public transport journey volume by transport type. 

The data has been loaded into an **AWS Redshift** database called `tfl` with a single table called `journeys`, including the following data:

## tfl.journeys

| Column | Definition | Data type |
|--------|------------|-----------|
| `month`| Month in number format, e.g., `1` equals January | `INTEGER` |
| `year` | Year | `INTEGER` |
| `days` | Number of days in the given month | `INTEGER` |
| `report_date` | Date that the data was reported | `DATE` |
| `journey_type` | Method of transport used | `VARCHAR` |
| `journeys_millions` | Millions of journeys, measured in decimals | `FLOAT` |

You will execute SQL queries to answer three questions, as listed in the instructions.

In [3]:
# Most popular transport types
query = """
SELECT 
    journey_type,
    ROUND(CAST(SUM(journeys_millions) AS NUMERIC), 2) AS total_journeys_millions
FROM journeys
GROUP BY journey_type
ORDER BY total_journeys_millions DESC;
"""

df_popular = run_query(query)
print("Most popular transport types by total journeys:")
display(df_popular)


Most popular transport types by total journeys:


  return pd.read_sql(sql, conn)


Unnamed: 0,journey_type,total_journeys_millions
0,Bus,24905.19
1,Underground & DLR,15020.47
2,Overground,1666.85
3,TfL Rail,411.31
4,Tram,314.69
5,Emirates Airline,14.58


In [4]:
# Emirates Airline popularity - top 5 months
query = """
SELECT 
    month,
    year,
    ROUND(CAST(journeys_millions AS NUMERIC), 2) AS rounded_journeys_millions
FROM journeys
WHERE journey_type ILIKE 'Emirates Airline%'
  AND journeys_millions IS NOT NULL
ORDER BY rounded_journeys_millions DESC
LIMIT 5;
"""

df_emirates = run_query(query)
print("Top 5 months for Emirates Airline usage:")
display(df_emirates)


Top 5 months for Emirates Airline usage:


  return pd.read_sql(sql, conn)


Unnamed: 0,month,year,rounded_journeys_millions
0,5,2012,0.53
1,6,2012,0.38
2,4,2012,0.24
3,5,2013,0.19
4,5,2015,0.19


In [5]:
# Least popular years for Underground & DLR
query = """
SELECT
    year,
    journey_type,
    ROUND(CAST(SUM(journeys_millions) AS NUMERIC), 2) AS total_journeys_millions
FROM journeys
WHERE journey_type = 'Underground & DLR'
GROUP BY year, journey_type
ORDER BY total_journeys_millions ASC
LIMIT 5;
"""

df_tube_least = run_query(query)
print("Least popular years for Underground & DLR:")
display(df_tube_least)

# Also show most popular years for comparison
query_most = """
SELECT
    year,
    journey_type,
    ROUND(CAST(SUM(journeys_millions) AS NUMERIC), 2) AS total_journeys_millions
FROM journeys
WHERE journey_type = 'Underground & DLR'
GROUP BY year, journey_type
ORDER BY total_journeys_millions DESC
LIMIT 5;
"""

df_tube_most = run_query(query_most)
print("\nMost popular years for Underground & DLR (for comparison):")
display(df_tube_most)


Least popular years for Underground & DLR:


  return pd.read_sql(sql, conn)


Unnamed: 0,year,journey_type,total_journeys_millions
0,2020,Underground & DLR,310.18
1,2021,Underground & DLR,748.45
2,2022,Underground & DLR,1064.86
3,2010,Underground & DLR,1096.15
4,2011,Underground & DLR,1156.65



Most popular years for Underground & DLR (for comparison):


  return pd.read_sql(sql, conn)


Unnamed: 0,year,journey_type,total_journeys_millions
0,2019,Underground & DLR,1386.44
1,2016,Underground & DLR,1384.64
2,2018,Underground & DLR,1382.42
3,2015,Underground & DLR,1363.46
4,2017,Underground & DLR,1362.29
