# Example Notebook 1 - Basic Visualizations

This notebook will show some visualizations that might come handy when working the data provided in the competition.

## Table of Content
* [Read data](#Read_data)
* [Data pre-processing](#Data_pre-processing)
  * [Check dates](#Check_dates)
  * [Check weather data](#Check_weather_data)
  * [Map watch survey questions to question ID](#Map_question_id)
* [Plot weather data](#Plot_weather_data)
  * [Plot weather station locations](#Plot_weather_station_locations)
  * [Plot weather weather data time-series](#Plot_weather_data_time-series)
  * [Map of all weather stations that measure air temperature](#Map_of_all_weather_stations_temperature)
  * [Map of all weather stations that measure X](#Map_of_all_weather_stations_x)
* [Plot Cozie data](#Plot_Cozie_data)
  * [Plot locations of watch survey responses](#Plot_locations_watch_surveys)
  * [Plot heart rate of all participants](#Plot_heart_rate_of_all_participants)
  * [Plot all health kit data for one participant](#Plot_all_health_kit_data_for_one)

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import math

<a id="Read_data"></a>
## Read data

In [6]:
df_rain = pd.read_csv('./data/weather_rainfall.csv', parse_dates=True, index_col=[0])
df_temperature = pd.read_csv('./data/weather_air-temperature.csv', parse_dates=True, index_col=[0])
df_humidity = pd.read_csv('./data/weather_relative-humidity.csv', parse_dates=True, index_col=[0])
df_wind_speed = pd.read_csv('./data/weather_wind-speed.csv', parse_dates=True, index_col=[0])
df_wind_direction = pd.read_csv('./data/weather_wind-direction.csv', parse_dates=True, index_col=[0])
df_stations = pd.read_csv('./data/weather_stations.csv').drop(columns=['Unnamed: 0'])

In [7]:
df_stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         74 non-null     object 
 1   name       74 non-null     object 
 2   latitude   74 non-null     float64
 3   longitude  74 non-null     float64
dtypes: float64(2), object(2)
memory usage: 2.4+ KB


<a id="Data_pre-processing"></a>
# Data pre-processing
<a id="Check_dates"></a>
## Check dates

When checking the last date of different kinds of dataframes it becomes clear that the watch survey responses only last until Jan 21, 2023, while the weather data lasts until July 3, 2023. 

In [8]:
print(df_rain.index.max())

2023-07-03 23:55:00+08:00


We don't really need the weather data after Jan 21, 2023, so we might just as well get rid of it.

In [9]:
df_rain = df_rain[df_rain.index<'2023-01-22']
df_temperature = df_temperature[df_temperature.index<'2023-01-22']
df_humidity = df_humidity[df_humidity.index<'2023-01-22']
df_wind_speed = df_wind_speed[df_wind_speed.index<'2023-01-22']
df_wind_direction = df_wind_direction[df_wind_direction.index<'2023-01-22']

<a id="Check_weather_data"></a>
## Check weather data

In [10]:
# Show new shortened air temperature dataframe
df_temperature.describe()

Unnamed: 0,S24,S43,S44,S50,S60,S100,S102,S104,S106,S107,S108,S109,S111,S115,S116,S117,S121
count,119012.0,134440.0,134429.0,135560.0,132078.0,133850.0,97921.0,134568.0,126010.0,135360.0,105771.0,133852.0,131970.0,128186.0,124786.0,0.0,127438.0
mean,27.245389,27.373185,26.519823,26.885249,27.537616,27.176477,27.538939,26.824677,26.588148,27.747836,27.650423,27.079041,26.793322,27.340461,27.388705,,26.993606
std,2.017736,1.934505,2.024704,2.168521,1.876663,2.339254,1.16579,2.045194,2.260571,1.490635,2.05402,2.206582,2.001261,1.745694,1.628094,,2.244568
min,22.4,22.7,21.8,22.0,22.9,22.1,23.4,22.0,22.2,22.8,22.6,22.2,22.1,22.8,22.4,,22.1
25%,25.7,25.9,24.9,25.1,26.0,25.3,26.7,25.2,24.8,26.6,26.1,25.4,25.3,26.0,26.1,,25.2
50%,26.9,27.0,26.1,26.4,27.3,26.6,27.5,26.4,25.9,27.6,27.2,26.6,26.4,27.1,27.2,,26.5
75%,28.7,28.7,28.0,28.4,28.9,28.8,28.4,28.2,28.2,28.9,29.0,28.6,28.2,28.6,28.6,,28.6
max,34.6,34.1,33.0,34.0,34.5,35.4,31.2,34.2,34.5,32.4,34.3,35.6,34.4,33.1,32.8,,34.4


It looks like there is no temperature data available in before Jan 21, 2023 for weather station S117. So, we might just as well remove that column

In [11]:
df_temperature = df_temperature.drop(columns=['S117'])

In [12]:
# Get dimensions of the air temperature data
df_temperature.shape

(136133, 16)

The weather dataframes are rather large. This can make the plotting of the data slow and the output images large. To make it a bit lighter, we are going to resample the data hourly.

In [13]:
# Resample weather data
df_temperature_average = df_temperature.resample('60min').mean()
df_rain_sum = df_rain.resample('60min').sum() # note: for rain the data is aggregated with a sum instead of the mean
df_humidity_average = df_humidity.resample('60min').mean()
df_wind_speed_average = df_wind_speed.resample('60min').mean()
df_wind_direction_average = df_wind_direction.resample('60min').mean()

# Check size of resulting dataframe
df_temperature_average.shape

(2496, 16)

<a id="Map_question_id"></a>
## Map watch survey questions to question ID

In [14]:
ws_questions = {"q_noise_nearby": "Noise distractions nearby? (without earphones)",
                "q_noise_kind": "What kind of noise?",
                "q_earphones": "Wearing earphones?",
                "q_thermal_preference": "Thermally, what do you prefer now?",
                "q_location": "Where are you?",
                "q_location_office": "What kind of office?",
                "q_location_transport": "What kind of transport?",
                "q_alone_group": "Alone or in a group?",
                "q_activity_category_alone": "Category of activity? (alone)",
                "q_activity_category_group": "Category of activity? (group)"}

<a id="Plot_weather_data"></a>
# Plot weather data

<a id="Plot_weather_data_time-series"></a>
## Plot weather weather data time-series

Below we are going to plot the air temperature data from all weather stations.
I used the Matplotlib method plt.plot() instead of sns.lineplot(). The reason is that plt.plot() shows the gaps in the data, while sns.lineplot() draws straight lines accross the gaps.

In [15]:
cols_num = 1
rows_num = len(df_temperature_average.columns)
col_current = 0
row_current = 0 
fig, axs = plt.subplots(rows_num, cols_num, figsize=(20,3*rows_num))
plt.subplots_adjust(hspace = 0.5)

for i, col in enumerate(df_temperature_average.columns):
    row_current = i
    axs[row_current].plot(df_temperature_average.index, df_temperature_average[col])
    station_name = df_stations[df_stations['id']==col]['name'].values[0]
    axs[row_current].set_title(f'Weather Station {col}: {station_name}')
    axs[row_current].set_ylabel('Air Temperature [°C]')

Since, we are going to plot more weather data, we might just as well wrap the above code for plotting in to a function.

In [None]:
# Function that plots weather data fromm all weather stations
def plot_weather(df, ylabel):
    cols_num = 1
    rows_num = len(df.columns)
    col_current = 0
    row_current = 0 
    fig, axs = plt.subplots(rows_num, cols_num, figsize=(20,3*rows_num))
    plt.subplots_adjust(hspace = 0.5)

    for i, col in enumerate(df.columns):
        row_current = i
        axs[row_current].plot(df.index, df[col])
        station_name = df_stations[df_stations['id']==col]['name'].values[0]
        axs[row_current].set_title(f'Weather Station {col}: {station_name}')
        axs[row_current].set_ylabel(ylabel)
    
    return fig, axs

### Temperature

In [None]:
fig, axs = plot_weather(df_temperature_average, 'Temperature [°C]')

### Rainfall

In [None]:
fig, axs = plot_weather(df_rain_sum, 'Rainfall [mm/h]')

### Relative humdity

In [None]:
fig, axs = plot_weather(df_humidity_average, 'Relative humidity [%]')

### Wind speed

In [None]:
fig, axs = plot_weather(df_wind_speed_average, 'Wind speed [knots]')

### Wind direction

In [None]:
fig, axs = plot_weather(df_wind_direction_average, 'Wind direction [°]')

Looking at the temperature data above, it is clear that there are some significant gaps in January, 2023.

<a id="Map_of_all_weather_stations_temperature"></a>
## Map of all weather stations that measure air temperature

In [None]:
# Have a look at the weather station meta data
df_stations.head()

In [None]:
df_stations.shape

In [None]:
# Have a look at the air temperature data
df_temperature.columns

Note, the list of all weather stations is much longer then the list with weather stations that can measure air temperature. Hence, we need to pre-process the data a bit and only pick the weather stations that can measure temperature data.

In [None]:
# Initialize a new dataframe for creating the map
df_map = pd.DataFrame(columns=['latitude', 'longitude', 'text'])


# Go through each column of the temperature dataframe
for col in df_temperature.columns:
    # Get the weather station name based on the station id
    station_name = df_stations.loc[df_stations.id==col].name.values[0]
    
    # Create a new row, with latitude, longitude, hover text, and marker color as columns
    df_row = pd.DataFrame([{'latitude': df_stations.loc[df_stations.id==col].latitude.values[0], 
                            'longitude': df_stations.loc[df_stations.id==col].longitude.values[0], 
                            'text': f'Station ID: {col} <br>Station name: {station_name} <br>Measurement: Rain',
                            'color': 'rgb(0, 128, 255)'}])
    
    # Append new row to map dataframe
    df_map = pd.concat([df_map, df_row])
    
df_map.head()

The pre-processing for the maps is done. Now, we can go and actually create the map chart.

In [None]:
# Create trace and add markers to map
trace = go.Scattermapbox(lat=df_map.latitude, 
                         lon=df_map.longitude,
                         mode='markers',
                         marker=dict(size=10,
                                     color=df_map['color'],
                                     opacity=0.8),
                         hovertemplate = "",
                         text=df_map['text'])

# Define the layout for the map
layout = go.Layout(
    margin={"l": 0, "r": 0, "t": 0, "b": 0},
    mapbox=dict(
        style='carto-positron', # Alternative: 'open-street-map',
        center=dict(lat=df_map.latitude.median(),
                    lon=df_map.longitude.median()),
        zoom=10,
        ),
    height=700, width=900)

# Create figure
fig = go.Figure(data=[trace], layout=layout)
fig.show()

Note, you can hovering the cursor over the markers will reveal some meta information about the weather station.

<a id="Map_of_all_weather_stations_x"></a>
## Map of all weather stations that measure X
Again, we can wrap up the above code into a function for convenience.

In [None]:
def plot_map_stations(df_measurements, df_stations):
    # Initialize a new dataframe for creating the map
    df_map = pd.DataFrame(columns=['latitude', 'longitude', 'text'])


    # Go through each column of the temperature dataframe
    for col in df_measurements.columns:
        # Get the weather station name based on the station id
        station_name = df_stations.loc[df_stations.id==col].name.values[0]

        # Create a new row, with latitude, longitude, hover text, and marker color as columns
        df_row = pd.DataFrame([{'latitude': df_stations.loc[df_stations.id==col].latitude.values[0], 
                                'longitude': df_stations.loc[df_stations.id==col].longitude.values[0], 
                                'text': f'Station ID: {col} <br>Station name: {station_name} <br>Measurement: Rain',
                                'color': 'rgb(0, 128, 255)'}])

        # Append new row to map dataframe
        df_map = pd.concat([df_map, df_row])

    # Create trace and add markers to map
    trace = go.Scattermapbox(lat=df_map.latitude, 
                             lon=df_map.longitude,
                             mode='markers',
                             marker=dict(size=10,
                                         color=df_map['color'],
                                         opacity=0.8),
                             hovertemplate = "",
                             text=df_map['text'])

    # Define the layout for the map
    layout = go.Layout(
        margin={"l": 0, "r": 0, "t": 0, "b": 0},
        mapbox=dict(
            style='carto-positron',
            center=dict(lat=df_map.latitude.median(),
                        lon=df_map.longitude.median()),
            zoom=10,
            ),
        height=700, width=900)

    # Create figure
    fig = go.Figure(data=[trace], layout=layout)
    
    # Return figure
    return fig

In [None]:
fig = plot_map_stations(df_wind_speed, df_stations)
fig.show()

Feel free to plot the maps for the other weather data. You could also try to modify the `plot_map_stations` function to show the station locations for all weather data. (Tip: Some of the weather stations measure more than one type of weather data. Hence, the will have the same coordinates. Adding a small offset to the weather station coordinate can help with the visualization ;-)).

<a id="Plot_Cozie_data"></a>
# Plot Cozie data

Until now, we've only seen weather data. It is time to look into the Cozie data.

<a id="Plot_locations_watch_surveys"></a>
## Plot locations of watch survey responses
We can leverage the skill for map making and reuse most of the code above to show the location of the watch survey responses.

In [None]:
df_map.columns

In [None]:
df_map = df_train.copy()

# Filter for micro-survey response data only
df_map = df_map[df_map.ws_survey_count.notna()]

# Create text for hover
df_map["text"] = df_map['id_participant'] + ", "
df_map["text"] = df_map["text"] + df_map.index[0].strftime('%a, %d.%m.%Y, %H:%M') + "<br>"
for key in ws_questions:
    question = ws_questions[key]
    df_map[key] = df_map[key].fillna('-')
    df_map["text"] = df_map["text"] + question + "  " + df_map[key] +"<br>"

# Add markers to map
trace = go.Scattermapbox(lat=df_map.ws_latitude, 
                            lon=df_map.ws_longitude,
                            mode='markers',
                            marker=dict(
                                size=10,
                                color='rgb(0, 128, 255)',
                                opacity=0.8),
                            hovertemplate = "",
                            text=df_map['text'])

# Define the layout for the map
layout = go.Layout(
    #title='Scatter Map using Plotly',
    margin={"l": 0, "r": 0, "t": 0, "b": 0},
    mapbox=dict(
        style='carto-positron', #'open-street-map',
        center=dict(lat=df_map.ws_latitude.median(),
                    lon=df_map.ws_longitude.median()),
        zoom=10,
        ),
    height=700, width=900)

fig = go.Figure(data=[trace], layout=layout)
fig.update_layout(title_text='')

<a id="Plot_heart_rate_of_all_participants"></a>
## Plot heart rate of all participants

Again Matplotlib was chosen over Seaborn because it takes much less time to create the charts. 
Further, this time only the markers are plotted without connecting lines. This way only real data is shown and gaps in the data are preserved. Due to the irregular logging interval of the health data is much more difficult to determine when to draw a line between two markers and when to omit it.

In [None]:
list_participants = df_train.id_participant.unique()
list_participants.sort()
cols_num = 1
rows_num = len(list_participants)
col_current = 0
row_current = 0 
fig, axs = plt.subplots(rows_num, cols_num, figsize=(20,3*rows_num))
plt.subplots_adjust(hspace = 0.5)

for i, id_participant in enumerate(list_participants):
    row_current = i
    df_participant = df_train[df_train.id_participant == id_participant]
    df_participant = df_participant[df_participant['ts_heart_rate'].notna()]
    axs[row_current].scatter(df_participant.index, df_participant['ts_heart_rate'], marker='.')
    axs[row_current].set_title(f'Participant ID: {id_participant}')
    axs[row_current].set_ylabel('Heart Rate [bpm]')

<a id="Plot_all_health_kit_data_for_one"></a>
## Plot all health kit data for one participant

In [None]:
list_health_cols = ['ts_audio_exposure_environment', 
                    'ts_heart_rate',
                    'ts_oxygen_saturation', 
                    'ts_resting_heart_rate', 
                    'ts_stand_time',
                    'ts_step_count', 
                    'ts_walking_distance']

id_participant_current = 'xesh001'
df_participant = df_train[df_train.id_participant == id_participant_current]
df_participant = df_participant.reset_index()
cols_num = 1
rows_num = len(list_health_cols)
col_current = 0
row_current = 0 
fig, axs = plt.subplots(rows_num, cols_num, figsize=(20,3*rows_num))
plt.subplots_adjust(hspace = 0.5)

for i, col in enumerate(list_health_cols):
    row_current = i
    axs[row_current].plot(df_participant['time'], df_participant[col], marker='.', linestyle = 'None')
    axs[row_current].set_title(f'Participant ID: {id_participant_current}')
    axs[row_current].set_ylabel(col)