# **NB02 - Data Analysis**

**OBJECTIVE:**
Visualise and analyse the weather data from the OpenMeteo API to answer the research question and draw a conclusion. The data visualisations include the following:
- Bar chart visualising the number of days of rain in 2003 and 2023 
- Bar chart visualising the total precipitation in mm in 2003 and 2023
- Boxplots visualising the daily precipitation in mm for each city in 2003 and 2023

**AUTHOR:** 
@nadiabegic on GitHub

**LAST EDITED:**
11-Nov-2024

--------------------

**Imports:**

In [42]:
import pandas as pd
import json
from lets_plot import *
LetsPlot.setup_html()

# 1. Visualise the number of days of rainfall in 2003 vs. 2023

1.1 Load the JSON file _days_rain_2023.json_ as a pandas dataframe

In [43]:
with open('../data/days_rain_2023.json', 'r') as file:
    days_2023_data = json.load(file)

df_days_rain_2023 = pd.DataFrame(list(days_2023_data.items()), columns=['City', 'Days of Rain in 2023'])
df_days_rain_2023

Unnamed: 0,City,Days of Rain in 2023
0,London,228
1,Edinburgh,270
2,Sarajevo,200
3,Amsterdam,266
4,Paris,225
5,Madrid,123
6,Damascus,64
7,New York City,187
8,Los Angeles,103
9,Dubai,28


1.2 Load the JSON file _days_rain_2003.json_ as a pandas dataframe

In [44]:
with open('../data/days_rain_2003.json', 'r') as file:
    days_2003_data = json.load(file)

df_days_rain_2003 = pd.DataFrame(list(days_2003_data.items()), columns=['City', 'Days of Rain in 2003'])

df_days_rain_2003
    

Unnamed: 0,City,Days of Rain in 2003
0,London,159
1,Edinburgh,191
2,Sarajevo,169
3,Amsterdam,163
4,Paris,158
5,Madrid,120
6,Damascus,75
7,New York City,162
8,Los Angeles,49
9,Dubai,16


1.3 Merge the two dataframes in order to plot them on a singular bar chart

In [45]:
df_days_rain = df_days_rain_2023.merge(df_days_rain_2003)
df_days_rain

Unnamed: 0,City,Days of Rain in 2023,Days of Rain in 2003
0,London,228,159
1,Edinburgh,270,191
2,Sarajevo,200,169
3,Amsterdam,266,163
4,Paris,225,158
5,Madrid,123,120
6,Damascus,64,75
7,New York City,187,162
8,Los Angeles,103,49
9,Dubai,28,16


1.4 Visualise the discrete data as a bar chart using the ggplot library

In [46]:
# This is a partially AI generated code cell, AI was used to melt the dataframe

df_days_melted = df_days_rain.melt(id_vars=['City'], 
                              value_vars=['Days of Rain in 2003', 'Days of Rain in 2023'], 
                              var_name='Year', 
                              value_name='Days of Rain')

(
     ggplot(df_days_melted, aes(x='City', y='Days of Rain', fill='Year')) +
     geom_bar(stat='identity', position='dodge') + 
     scale_fill_manual(values={'Days of Rain in 2003': '#505fd6', 'Days of Rain in 2023': '#08189b'}) +
     scale_y_continuous() +
     labs(x="City",
          y="Days of Rain",
          title="Days of Rain in 2003 and 2023",
          subtitle="Collected with OpenMeteo API")
)

# <font color='pink'> 2. Visualise the total precipitation in 2003 vs. 2023

2.1 Load the JSON file _total_precipitation_2023.json_ as a pandas dataframe

In [47]:
with open('../data/total_precipitation_2023.json', 'r') as file:
    total_prec_2023_data = json.load(file)

df_total_prec_2023 = pd.DataFrame(list(total_prec_2023_data.items()), columns=['City', 'Total Precipitation in 2023 (mm)'])
df_total_prec_2023

Unnamed: 0,City,Total Precipitation in 2023 (mm)
0,London,769.5
1,Edinburgh,1109.2
2,Sarajevo,1262.3
3,Amsterdam,1199.2
4,Paris,963.8
5,Madrid,516.4
6,Damascus,182.0
7,New York City,1492.5
8,Los Angeles,979.7
9,Dubai,79.6


2.2 Load the JSON file _total_precipitation_2003.json_ as a pandas dataframe

In [48]:
with open('../data/total_precipitation_2003.json', 'r') as file:
    total_prec_2003_data = json.load(file)

df_total_prec_2003 = pd.DataFrame(list(total_prec_2003_data.items()), columns=['City', 'Total Precipitation in 2003 (mm)'])
df_total_prec_2003

Unnamed: 0,City,Total Precipitation in 2003 (mm)
0,London,546.6
1,Edinburgh,515.4
2,Sarajevo,787.9
3,Amsterdam,587.5
4,Paris,502.1
5,Madrid,558.0
6,Damascus,236.8
7,New York City,1225.4
8,Los Angeles,307.8
9,Dubai,22.6


2.3 Merge the two dataframes in order to plot them on a singular bar chart

In [49]:
df_total_prec = df_total_prec_2023.merge(df_total_prec_2003)
df_total_prec

Unnamed: 0,City,Total Precipitation in 2023 (mm),Total Precipitation in 2003 (mm)
0,London,769.5,546.6
1,Edinburgh,1109.2,515.4
2,Sarajevo,1262.3,787.9
3,Amsterdam,1199.2,587.5
4,Paris,963.8,502.1
5,Madrid,516.4,558.0
6,Damascus,182.0,236.8
7,New York City,1492.5,1225.4
8,Los Angeles,979.7,307.8
9,Dubai,79.6,22.6


2.4 Visualise the discrete data as a bar chart using the ggplot library

In [50]:
# This is also a partially AI generated code cell, AI was used to melt the dataframe

df_total_prec_melted = df_total_prec.melt(id_vars=['City'], 
                              value_vars=['Total Precipitation in 2003 (mm)', 'Total Precipitation in 2023 (mm)'], 
                              var_name='Year', 
                              value_name='Total Precipitation (mm)')

(
     ggplot(df_total_prec_melted, aes(x='City', y='Total Precipitation (mm)', fill='Year')) +
     geom_bar(stat='identity', position='dodge') +
     scale_fill_manual(values={'Days of Rain in 2003': '#505fd6', 'Days of Rain in 2023': '#08189b'}) +
     scale_y_continuous() +
     labs(x="City",
          y="Total Precipitation (mm)",
          title="Total Precipitation in 2003 and 2023",
          subtitle="Collected with OpenMeteo API")
)

# 3. Visualise the daily precipitation in 2003 vs. 2023

3.1 Load the JSON file _daily_precipitation_2023.json_ as a pandas dataframe

In [51]:
with open('../data/daily_precipitation_2023.json', 'r') as file:
    daily_prec_2023_data = json.load(file)

df_daily_prec_2023 = pd.DataFrame(list(daily_prec_2023_data.items()), columns=['City', 'Daily Precipitation in 2023 (mm)'])
df_daily_prec_2023

Unnamed: 0,City,Daily Precipitation in 2023 (mm)
0,London,"[4.0, 0.2, 3.2, 0.9, 0.1, 1.2, 5.0, 1.8, 0.3, ..."
1,Edinburgh,"[6.0, 0.0, 3.7, 1.8, 5.9, 0.5, 7.6, 0.5, 0.1, ..."
2,Sarajevo,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 18.4,..."
3,Amsterdam,"[6.0, 3.5, 2.6, 11.0, 1.7, 2.2, 0.5, 0.8, 4.3,..."
4,Paris,"[2.2, 7.2, 2.3, 2.1, 0.4, 0.0, 1.4, 4.2, 0.8, ..."
5,Madrid,"[0.0, 1.1, 0.0, 0.0, 0.0, 0.0, 1.9, 12.4, 0.0,..."
6,Damascus,"[0.0, 0.0, 0.4, 0.5, 0.0, 0.5, 0.1, 0.0, 0.0, ..."
7,New York City,"[6.3, 0.2, 9.6, 1.3, 1.2, 9.6, 0.0, 0.0, 1.2, ..."
8,Los Angeles,"[21.8, 0.0, 6.4, 14.2, 35.3, 1.8, 0.0, 0.2, 14..."
9,Dubai,"[0.0, 0.0, 0.5, 0.0, 0.1, 0.1, 0.9, 0.0, 0.0, ..."


In [52]:
plot_daily_prec_2023_df = df_daily_prec_2023.explode('Daily Precipitation in 2023 (mm)') # explode the dataframe 

plot_daily_prec_2023_df['Daily Precipitation in 2023 (mm)']=plot_daily_prec_2023_df['Daily Precipitation in 2023 (mm)'].astype(float)

In [53]:
# test
plot_daily_prec_2023_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3650 entries, 0 to 9
Data columns (total 2 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   City                              3650 non-null   object 
 1   Daily Precipitation in 2023 (mm)  3650 non-null   float64
dtypes: float64(1), object(1)
memory usage: 85.5+ KB


In [54]:
# view the summary statistics of the dataframe
plot_daily_prec_2023_df.groupby('City')['Daily Precipitation in 2023 (mm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Amsterdam,365.0,3.285479,5.17472,0.0,0.0,1.1,4.6,34.5
Damascus,365.0,0.49863,2.164467,0.0,0.0,0.0,0.0,18.4
Dubai,365.0,0.218082,1.867463,0.0,0.0,0.0,0.0,28.1
Edinburgh,365.0,3.038904,5.239948,0.0,0.0,0.8,3.8,55.1
London,365.0,2.108219,3.679015,0.0,0.0,0.4,2.6,27.4
Los Angeles,365.0,2.68411,10.130406,0.0,0.0,0.0,0.2,97.5
Madrid,365.0,1.414795,5.343981,0.0,0.0,0.0,0.3,62.7
New York City,365.0,4.089041,10.189312,0.0,0.0,0.1,2.9,84.9
Paris,365.0,2.640548,4.682672,0.0,0.0,0.4,3.2,25.5
Sarajevo,365.0,3.458356,7.273285,0.0,0.0,0.1,3.1,45.3


3.2 Load the JSON file _daily_precipitation_2003.json_ as a pandas dataframe

In [55]:
with open('../data/daily_precipitation_2003.json', 'r') as file:
    daily_prec_2003_data = json.load(file)

df_daily_prec_2003 = pd.DataFrame(list(daily_prec_2003_data.items()), columns=['City', 'Daily Precipitation in 2003 (mm)'])
df_daily_prec_2003

Unnamed: 0,City,Daily Precipitation in 2003 (mm)
0,London,"[16.3, 10.6, 0.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0..."
1,Edinburgh,"[10.0, 5.8, 0.0, 0.0, 0.0, 0.0, 0.0, 2.3, 0.4,..."
2,Sarajevo,"[2.4, 0.0, 0.3, 0.1, 3.9, 0.0, 15.3, 0.0, 11.6..."
3,Amsterdam,"[13.6, 15.7, 3.7, 0.0, 0.3, 0.0, 0.0, 0.0, 0.0..."
4,Paris,"[6.5, 9.3, 3.0, 4.9, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,Madrid,"[0.7, 0.9, 0.9, 0.5, 10.9, 0.7, 5.5, 1.6, 5.3,..."
6,Damascus,"[0.0, 0.1, 0.0, 5.8, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,New York City,"[15.3, 15.7, 3.4, 5.9, 0.0, 0.0, 0.0, 0.0, 0.0..."
8,Los Angeles,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, ..."
9,Dubai,"[0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [56]:
plot_daily_prec_2003_df = df_daily_prec_2003.explode('Daily Precipitation in 2003 (mm)') # explode the dataframe 

plot_daily_prec_2003_df['Daily Precipitation in 2003 (mm)']=plot_daily_prec_2003_df['Daily Precipitation in 2003 (mm)'].astype(float)

In [58]:
# view the summary statistics of the dataframe
plot_daily_prec_2003_df.groupby('City')['Daily Precipitation in 2003 (mm)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Amsterdam,365.0,1.609589,3.275473,0.0,0.0,0.0,1.7,21.5
Damascus,365.0,0.648767,2.270518,0.0,0.0,0.0,0.0,18.4
Dubai,365.0,0.061918,0.542381,0.0,0.0,0.0,0.0,8.0
Edinburgh,365.0,1.412055,2.512139,0.0,0.0,0.1,1.8,16.1
London,365.0,1.497534,3.383268,0.0,0.0,0.0,1.0,27.0
Los Angeles,365.0,0.843288,4.514615,0.0,0.0,0.0,0.0,42.4
Madrid,365.0,1.528767,4.35532,0.0,0.0,0.0,0.4,34.8
New York City,365.0,3.35726,7.487402,0.0,0.0,0.0,2.6,51.5
Paris,365.0,1.375616,3.215575,0.0,0.0,0.0,1.3,28.9
Sarajevo,365.0,2.15863,5.89042,0.0,0.0,0.0,1.6,75.7


3.3 Visualise the data with boxplots

3.3.1 Visualise a boxplot for each city's daily precipitation in 2023

In [101]:
daily_prec_2023_plot = (ggplot(plot_daily_prec_2023_df, aes(x='City', y='Daily Precipitation in 2023 (mm)', fill='City')) +
     geom_boxplot() +
     scale_fill_brewer(palette='Set3') +
     ylim(0, 6)) # zooming into the boxplot to better visualise the IQR, upper limit must be at least 4.6 as the maximum value for Q3 in the dataframe is 4.6

daily_prec_2023_plot

3.3.2 Visualise a boxplot for each city's daily precipitation in 2003

In [103]:
daily_prec_2003_plot = (ggplot(plot_daily_prec_2003_df, aes(x='City', y='Daily Precipitation in 2003 (mm)', fill='City')) +
     geom_boxplot() +
     scale_fill_brewer(palette='Set3') +
     ylim(0, 6)) # zooming into the boxplot to better visualise the IQR, upper limit must be at least 2.6 as the maximum value for Q3 in the dataframe is 2.6

daily_prec_2003_plot

# 4. Conclusion