**OBJECTIVE:** Visualise data collected for further analysis and summarisation.

**AUTHOR:** Joshua Xu

**LAST EDITED:** 2024-10-31

---


# Data Visualisation 

In this notebook, I will be visualising the data I have collected in a more sophisticated way to support my [analysis](./DataAnalysis/DataAnalysis.ipynb)

Let's start off by importing both the data we need and the modules we require:

In [288]:
import json
import requests
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html()

Running the saved [.py file](df_weather_code.py), I will be working with the same data frame and variables as before: 

In [289]:
%run df_weather_code.py

dfWC

Unnamed: 0,Light to Moderate Rain,Heavy Rain,Snow,Total Rain,Others
London,204,19,6,223,137
Porto Novo,62,0,0,62,304
Kigali,223,28,0,251,115
Apia,304,30,0,334,32
Tiraspol,123,14,19,137,210
Dublin,255,15,11,270,85
Ljubljana,159,21,24,180,162
Kuwait City,37,5,0,42,324
Bangkok,224,36,0,260,106
Santo Domingo,304,44,0,348,18


## 1. Plotting Total Rain Percentage

For this graph we have the following requirements:

1. A horizontal bar graph, with cities and total rain percentage as the axis
2. Descending order - from most to least 
3. Data is labelled 
4. London is highlighted

In [290]:
# We first change the structure of the Data Frame
dfWC = dfWC.reset_index(names = 'City')

# Add a new column which we will use as the x-axis of our graph
dfWC['Total Rain Percentage (%)'] = round(dfWC['Total Rain']/366*100, 2)

# Sort the column in descending order
dfWC = dfWC.sort_values(by='Total Rain Percentage (%)', ascending = False)

# Define 'London' and 'Others' to help highlight our graph
dfWC['Cities'] = dfWC['City'].apply(lambda x: 'London' if x == 'London' else 'Others')


I have a feeling we are going to need to catogorise the cities into London and others quite a lot...

In [291]:
def london_others(df):
    df['London'] = df['City'].apply(lambda x: 'London' if x == 'London' else 'Others')

Having shaped the data into the structure we want we can now plot the graph!

In [292]:
# Plot the graph!
(
ggplot(data=dfWC, mapping = aes(x = 'Total Rain Percentage (%)', y = 'City', fill = 'Cities')) +
       geom_bar(stat='identity', width = 0.8, size = 1) +
       scale_y_discrete_reversed() +
       scale_x_continuous(limits = (0,110)) +
       scale_fill_manual(values = {'London': 'darkblue', 'Others': 'lightblue'}) +
       geom_text(mapping = aes(label='Total Rain Percentage (%)'), colour = 'grey', nudge_x = 8, size = 5) +
       labs(x = 'Percentage of Rain for 2023 - 2024 (%)', y = 'Sampled City', title = 'London was below average!', subtitle = 'London was raining for 61% of the days, ranking 7th out of 11 cities.') + 
       coord_fixed(ratio=8/1) +
       theme_minimal()
)

## 1* We can do better

Just one bar per city is boring, let's directly show the ratio of all different catogories of weather for all cities!

In [293]:
dfWC
dfWClong = pd.melt(dfWC, id_vars=['City', 'Total Rain Percentage (%)'], value_vars = dfWC.columns[[1,2,3,5]], var_name = 'Weather Code', value_name = 'Count')
dfWClong
dfWClong

Unnamed: 0,City,Total Rain Percentage (%),Weather Code,Count
0,Santo Domingo,95.08,Light to Moderate Rain,304
1,Apia,91.26,Light to Moderate Rain,304
2,Brasiléia,76.78,Light to Moderate Rain,250
3,Dublin,73.77,Light to Moderate Rain,255
4,Bangkok,71.04,Light to Moderate Rain,224
5,Kigali,68.58,Light to Moderate Rain,223
6,London,60.93,Light to Moderate Rain,204
7,Ljubljana,49.18,Light to Moderate Rain,159
8,Tiraspol,37.43,Light to Moderate Rain,123
9,Porto Novo,16.94,Light to Moderate Rain,62


We can now plot a better bar chart...

In [294]:
(
    ggplot(data = dfWClong, mapping = aes(x = 'Count', y = 'City', group = 'Weather Code', fill = 'Weather Code'))
    + geom_bar(stat = 'identity', width = 0.8)
    + scale_y_discrete_reversed()
    + scale_fill_manual(values = {'Light to Moderate Rain': 'blue', 'Heavy Rain': 'darkblue', 'Snow': 'grey', 'Others': 'lightgrey'})
    + coord_fixed(ratio = 40/1)
    + theme_minimal()
)

To highlight London, we can tweak our Data Frame slightly...

In [295]:
# Changing entries for London's rainy days on Data Frame to help with colour coding
dfWClong.loc[
    (dfWClong['City'] == 'London') & (dfWClong['Weather Code'].isin(['Light to Moderate Rain', 'Heavy Rain'])),
    'Weather Code'
] += ' - London'

# Colouor coding
colour = {'Light to Moderate Rain': 'lightblue', 'Heavy Rain': 'blue', 'Snow': 'grey', 'Others': 'lightgrey',
        'Light to Moderate Rain - London': 'pink',
        'Heavy Rain - London': 'red'
}

dfWClongp = dfWClong.iloc[0:11,:]

The final graph will thus look like:

In [296]:
(
    ggplot(data = dfWClong, mapping = aes(x = 'Count', y = 'City', group = 'Weather Code', fill = 'Weather Code'))
    + geom_bar(stat = 'identity', width = 0.8, size = 1)
    + geom_text(data = dfWClongp,
                mapping = aes(label = 'Total Rain Percentage (%)'),
                nudge_x = 52,
                size = 6,
                colour = 'white')
    + scale_y_discrete_reversed()
    + scale_fill_manual(values = colour)
    + theme(legend_position = 'bottom')
    + labs(x = 'Days',
           title = 'London is below average in total days of rain!', 
           subtitle = 'Also it snows a lot in jubljana?',
           caption = 'The labels show Total Rain Percentage (%)')
    + ggsize(1200, 800)
)


**Analysis** - It's not hard to see from either graph plotted that London is not an outlier in amount of rainy days during 2023-2024, neither in Light to Moderate Rain nor Heavy Rain. 


Alternatively, if we can plot them side by side to compare catogories of weather more directly.

In [297]:
(
    ggplot(data = dfWClong, mapping = aes(x = 'Count', y = 'City', group = 'Weather Code', fill = 'Weather Code'))
    + geom_bar(stat = 'identity', 
               position = 'dodge', 
               width = 1, 
               size = 1)
    + scale_y_discrete_reversed()
    + scale_fill_manual(values = colour)
    + theme(legend_position = 'bottom')
    + labs(x = 'Days',
           title = 'London is below average in total days of rain!', 
           subtitle = 'Also it snows a lot in jubljana?')
    + ggsize(1200, 800)
)

## 2. Plotting Cumulative Rain Sum

To plot this, we have the following requirements for the graph:
1. A line graph, with cities and cumulative rain sum as the axis
2. London is highlighted
3. Each city is named

We begin by transforming the data to the structure that allows us to plot the above graph.

In [298]:
dfRScum = dfRS.iloc[:].cumsum().reset_index(names = 'Days')

dfRScumlong = pd.melt(dfRScum, id_vars = ['Days'], value_vars = dfRScum.columns[1:], var_name = 'City', value_name = 'Cumulative Rain Sum')

dfRScumlong

Unnamed: 0,Days,City,Cumulative Rain Sum
0,0,London,4.0
1,1,London,4.2
2,2,London,7.4
3,3,London,8.3
4,4,London,8.4
...,...,...,...
4021,361,Brasiléia,1680.5
4022,362,Brasiléia,1681.0
4023,363,Brasiléia,1684.8
4024,364,Brasiléia,1703.2


In [299]:
(
    ggplot(data=dfRScumlong, mapping = aes(x = 'Days', y = 'Cumulative Rain Sum', group = 'City')) + 
    geom_line(color = 'grey') +
    labs(x = '2023-2024 (Days)', y = 'Cumulative Rain Sum (Litres per Square Meter)', title = 'London accumulated a meager amount of rain last year', subtitle = 'London ranks 8th in terms of total rain sum') +
    theme_minimal()
)

Let us highlight London by tweaking the Data Frame:

In [300]:
# Adding column to catagorise data into 'London' and 'Others'
london_others(dfRScumlong)

# Creating Data Frames for data input in plotting functions
line_label = dfRScumlong.groupby('City').tail(1)
dfRScumlongL = dfRScumlong[dfRScumlong['London'] == 'London']
dfRScumlongO = dfRScumlong[dfRScumlong['London'] != 'London']

dfRScumlong  

Unnamed: 0,Days,City,Cumulative Rain Sum,London
0,0,London,4.0,London
1,1,London,4.2,London
2,2,London,7.4,London
3,3,London,8.3,London
4,4,London,8.4,London
...,...,...,...,...
4021,361,Brasiléia,1680.5,Others
4022,362,Brasiléia,1681.0,Others
4023,363,Brasiléia,1684.8,Others
4024,364,Brasiléia,1703.2,Others


In [301]:
(
    ggplot(dfRScumlong, mapping = aes(x = 'Days',
                                      y = 'Cumulative Rain Sum', 
                                      group = 'City',
                                      colour='London'))
    + geom_line(data = dfRScumlongL, 
                linetype = 1)
    + geom_line(data =dfRScumlongO, 
                linetype = 4,
                alpha = 0.6)
    + geom_text(data = line_label, 
                mapping = aes(label = 'City'), 
                size = 6,
                nudge_x = 20,
                position = position_jitterdodge(dodge_width = 1, jitter_width = 1, jitter_height = 1, seed = 42)
              ) 
 
    + labs(x = '2023-2024 (Days)', 
           y = 'Cumulative Rain Sum (Litres per Square Meter)', 
          title = 'London accumulated a meager amount of rain last year', 
          subtitle = 'London ranks 8th in terms of total rain sum')
    + scale_x_continuous(limits = (0, 380))
    + theme_minimal()
    + theme(legend_position = 'none')
    + ggsize(1200, 800)
)

To fix the labels overlapping, we have to adjust the Data Frame further.

In [302]:
# Constructing a new column for corrected y-coordinates
line_label['y-coordinate'] = line_label['Cumulative Rain Sum']

# Choosing rows with corresponding labels to move
rows_to_modify = line_label.loc[:,'City'].isin(['Bangkok', 'Kigali'])

# Changing the coordinates for chosen labels
line_label.loc[rows_to_modify,'y-coordinate'] = line_label.loc[rows_to_modify,'y-coordinate'] + 20


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  line_label['y-coordinate'] = line_label['Cumulative Rain Sum']


Now the final graph looks like:

In [303]:
(
    ggplot(dfRScumlong, mapping = aes(x = 'Days',
                                      y = 'Cumulative Rain Sum', 
                                      group = 'City',
                                      colour='London'))
    + geom_line(data = dfRScumlongL, 
                linetype = 1,
                size = 1)
    + geom_line(data =dfRScumlongO, 
                linetype = 2,
                alpha = 0.6,
                size = 1)
    + geom_text(data = line_label, 
                mapping = aes(y = 'y-coordinate', # insert new changed coordinate
                             label = 'City'), 
                size = 6,
                nudge_x = 20
              ) 
 
    + labs(x = '2023-2024 (Days)', 
           y = 'Cumulative Rain Sum (Litres per Square Meter)', 
          title = 'London accumulated a meager amount of rain last year', 
          subtitle = 'London ranks 8th in terms of total rain sum')
    + scale_x_continuous(limits = (0, 365))
    + theme_minimal()
    + theme(legend_position = 'none')
    + ggsize(1200, 800)
)

**Analysis** - Again, London does not seem like an outlier in rain accumulated between 2023-2024, ranking below average once again.

## 3. Plotting Precipitation Hours Sum

Our requirements for the graph include:

1. A density graph for London compared against all cites ranked
2. London is clearly highlighted
3. Data is labelled to show sum

We begin a similar process of shapping our Data Frame in a structure that enables us to plot the desired graph. 

In [304]:
dfPH = dfPH.reset_index(names = 'Days')
# Creating smaller Data Frame for labelling later AND reordering columns in descending order
dfPHcum = dfPH.cumsum()
cumsum = dfPHcum.iloc[-1]
cumsum_col = cumsum.sort_values(ascending=False).index
dfPH = dfPH[cumsum_col]
dfPH

Unnamed: 0,Days,Apia,Santo Domingo,Dublin,Brasiléia,Bangkok,Ljubljana,London,Kigali,Tiraspol,Porto Novo,Kuwait City
0,0,10.0,8.0,5.0,8.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0
1,1,17.0,9.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
2,2,17.0,8.0,18.0,4.0,0.0,1.0,14.0,1.0,0.0,0.0,0.0
3,3,23.0,1.0,1.0,10.0,0.0,2.0,5.0,6.0,0.0,0.0,22.0
4,4,22.0,9.0,7.0,22.0,0.0,0.0,1.0,7.0,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
361,361,4.0,11.0,6.0,2.0,0.0,0.0,4.0,6.0,0.0,0.0,2.0
362,362,9.0,6.0,3.0,2.0,0.0,2.0,4.0,5.0,0.0,0.0,0.0
363,363,12.0,14.0,11.0,6.0,0.0,0.0,10.0,9.0,0.0,0.0,1.0
364,364,11.0,6.0,11.0,13.0,0.0,8.0,8.0,8.0,0.0,0.0,0.0


We now melt this Data Frame like before and catagorise it before plotting the final graph.

In [305]:
dfPHlong = pd.melt(dfPH, id_vars = ['Days'], value_vars = dfPH.columns[1:], var_name = 'City', value_name = 'Precipitation Hours')
london_others(dfPHlong)
# Data Input for labels
cumlabel = pd.melt(dfPHcum,
                      id_vars = ['Days'],
                      value_vars = dfPH.columns[1:], 
                      var_name = 'City', 
                      value_name = 'Precipitation Hours').groupby('City').tail(1)

cumlabel['London'] = cumlabel['City'].apply(lambda x: 'London' if x == 'London' else 'Others')

In [306]:
cumlabel

Unnamed: 0,Days,City,Precipitation Hours,London
365,66795,Apia,3496.0,Others
731,66795,Santo Domingo,3152.0,Others
1097,66795,Dublin,2058.0,Others
1463,66795,Brasiléia,1901.0,Others
1829,66795,Bangkok,1783.0,Others
2195,66795,Ljubljana,1694.0,Others
2561,66795,London,1546.0,London
2927,66795,Kigali,1296.0,Others
3293,66795,Tiraspol,873.0,Others
3659,66795,Porto Novo,390.0,Others


In [307]:
(
    ggplot(data=dfPHlong, mapping=aes(x='Days', y='Precipitation Hours', group = 'City', colour = 'London'))
    + geom_density(data = dfPHlong, stat='identity')
    + geom_text(data = cumlabel,
                 mapping = aes(label = 'Precipitation Hours'),
                 inherit_aes = False,
                 size = 4,
                 nudge_x = 340,
                 nudge_y = 23,
                 color = 'grey'
                )
    + scale_color_manual(values = {
        'London': 'blue',
        'Others': 'lightblue'
    })
    + facet_wrap(facets = 'City', order = 0)
    + labs(title = "London didn't rain for THAT long between 2023 - 2024.",
           subtitle = "Ranking 7th overall, summing up to 1546 hours in total.",
           caption = 'Cumulative Rain Hours is labelled on the top right of each graph (Hours)')
    + ggsize(3200, 1000)
)

**Analysis** - London disappoints finally in rain duration too, with less than half of the total rain time of Apia.

## Conclusion

To summarise our finding, it is clear to see that London is **not** an outlier across all measurements of 'raininess' in our given sample.

<ins>In other words, London is NOT as rainy as the movies make it out to be.</ins>

Of course, our finding is limited to this sample size specifically, which can be misleading given its size, both in terms of number of cities and the years chosen. To improve on the accuracy, we can repeat the process with different batches of samples with our [randomiser](../Sampling/Code/Randomiser.ipynb), as well as increasing the time sampled. This of course will make the visualisation and analysing more time-consuming and difficult.