# Plausi der Messwerte der Wetterstationen der Wasserschutzpolizei Zürich
Datum: 12.01.2022

**Ausgangslage:**

Hr. Namnick liefert jährlich per Mail die Jahresdaten der Wetterstationen. Bisher haben wir die neu gelieferten Jahresdaten in SAS eingelesen, die Variabelnamen standardisiert und richtig sortiert und letztlich die neuen mit den bisherigen Jahren zeitlich sortiert zusammengehängt.

Odi hat zusätzlich noch ein [**Python-Skript serverseitig**](https://github.com/opendatazurich/ogd-data-processing/blob/main/sid_wapo_wetterstationen/convert_csv.py) generiert um aus der mitgelieferten `utc` das `cet` Datum aus dem Datumsfeld zu berechnen. Mit diesem Notebook können wir das aber gleich beim Update erledigen.

Die Datenaufbereitung habe ich 2022 mit Jupyter Lab gemacht, siehe [**Github**](https://github.com/DonGoginho/myPy/blob/main/update_ogd/update_sid_wapo_wetterstationen.ipynb )



**Dataset auf PROD Datenkatalog**:  https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen

## Einstellungen
### Importiere die notwendigen Packages

In [1]:
#%pip install openpyxl geopandas altair fiona requests folium mplleaflet contextily seaborn datetime plotly

In [2]:
import pandas as pd
import datetime
import time
import numpy as np

#import lux

import requests
import io

import pivottablejs
from pivottablejs import pivot_ui
import altair as alt
import matplotlib.pyplot as plt
#from datetime import datetime
import geopandas as gpd
import folium 

Definiere Settings. Hier das Zahlenformat von Float-Werten (z.B. *'{:,.2f}'.format* mit Komma als Tausenderzeichen), 

In [3]:
SSL_VERIFY = False
# evtl. SSL_VERIFY auf False setzen wenn die Verbindung zu https://www.gemeinderat-zuerich.ch nicht klappt (z.B. wegen Proxy)
# Um die SSL Verifikation auszustellen, bitte die nächste Zeile einkommentieren ("#" entfernen)
# SSL_VERIFY = False

In [4]:
if not SSL_VERIFY:
    import urllib3
    urllib3.disable_warnings()

Definiere Settings. Hier das Zahlenformat von Float-Werten (z.B. *'{:,.2f}'.format* mit Komma als Tausenderzeichen), 

In [5]:
#pd.options.display.float_format = lambda x : '{:,.1f}'.format(x) if (np.isnan(x) | np.isinf(x)) else '{:,.0f}'.format(x) if int(x) == x else '{:,.1f}'.format(x)
pd.options.display.float_format = '{:.1f}'.format
pd.set_option('display.width', 100)
pd.set_option('display.max_columns', 15)

### Zeitvariabeln
Bestimme den aktuellst geladenen Monat. Hier ist es der Stand vor 2 Monaten. 
Bestimme noch weitere evt. sinnvolle Zeitvariabeln.

Zum Unterschied zwischen import `datetime` und `from datetime import datetime`, siehe https://stackoverflow.com/questions/15707532/import-datetime-v-s-from-datetime-import-datetime

Zuerst die Zeitvariabeln als Strings

In [6]:
#today_date = datetime.date.today()
#date_time = datetime.datetime.strptime(date_time_string, '%Y-%m-%d %H:%M')
now = datetime.date.today()
date_today = now.strftime("%Y-%m-%d")
year_today = now.strftime("%Y")
month_today = now.strftime("%m")
day_today = now.strftime("%d")

date_day_a_week_ago = (datetime.datetime.now() - datetime.timedelta(days=7)).date()
day_a_week_ago = date_day_a_week_ago.strftime('%Y-%m-%d')

lastYear = (datetime.datetime.now() - datetime.timedelta(days=365)).date().strftime('%Y')

print(now," vor einer Woche: ", day_a_week_ago)


2022-11-15  vor einer Woche:  2022-11-08


Und hier noch die Zeitvariabeln als Integers:
- `aktuellesJahr`
- `aktuellerMonat`: Der gerade jetzt aktuelle Monat
- `selectedMonat`: Der aktuellste Monat in den Daten. In der Regel zwei Monate her.

In [7]:
int_times = now.timetuple()

aktuellesJahr = int_times[0]
aktuellerMonat = int_times[1]
selectedMonat = int_times[1]-2

print(lastYear, aktuellesJahr, 
      aktuellerMonat,
      'datenstand: ', 
      selectedMonat,
     int_times)


2021 2022 11 datenstand:  9 time.struct_time(tm_year=2022, tm_mon=11, tm_mday=15, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=319, tm_isdst=-1)


## Importiere die bereits veröffentlichten und die aktuelle Zeitreihen der Messstationen 

- Beachte dabei die Notation des Pfades...
- Definiere mal aktuell noch keine weiteren Parameter beim Import

**Dataset auf INTEG Datenkatalog**:  https://data.integ.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen

**Dataset auf PROD Datenkatalog**:  https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen

### Setze einige Pfadvariabeln

- Der Packagename ist eigentlich der **Verzeichnisname** unter dem die Daten und Metadaten auf der Dropzone abgelegt werden.
- Definiert wird er bei SASA-Prozessen auf dem **Produkte-Sharepoint ([Link](https://kollaboration.intranet.stzh.ch/orga/ssz-produkte/Lists/SASA_Outputs/PersonalViews.aspx?PageView=Personal&ShowWebPart={6087A3E7-8AC8-40BA-8278-DECFACE124FF}))**.
- Der Packagename wird auf CKAN teil der URL, daher ist die exakte Schreibweise wichtig.

Beachte: im Packagename müssen alle Buchstaben **klein** geschrieben werden. Dies weil CKAN aus grossen kleine Buchstaben macht.

**Statische Pfade in DWH-Dropzones**

In [8]:
dropzone_path_integ = r"\\szh\ssz\applikationen\OGD_Dropzone\INT_DWH"

In [9]:
dropzone_path_prod = r"\\szh\ssz\applikationen\OGD_Dropzone\DWH"

**Statische Pfade CKAN-URLs**

In [10]:
ckan_integ_url ="https://data.integ.stadt-zuerich.ch/dataset/int_dwh_"

In [11]:
ckan_prod_url ="https://data.stadt-zuerich.ch/dataset/"

**BITTE HIER ANPASSEN**

In [12]:
package_name = "sid_wapo_wetterstationen"

In [13]:
messstationen = ["mythenquai_", "tiefenbrunnen_"]


In [14]:
endings = ["2007-"+lastYear,year_today]

In [15]:
dataset_name = "messwerte_"+messstationen[0]+endings[0]+".csv"
print(dataset_name)

messwerte_mythenquai_2007-2021.csv


### Importiere die Datensätze 

Definiere zuerst folgende Werte:
1) Kommt der Datensatz von PROD oder INTEG?
2) Beziehst Du den Datensatz direkt ab der DROPZONE oder aus dem INTERNET?


In [16]:
#Die Datasets sind nur zum Testen auf INT-DWH-Dropzone. Wenn der Test vorbei ist, sind sie auf PROD. 
# Über den Status kann man einfach switchen

status = "prod"; #prod vs something else
data_source = "web"; #dropzone vs something else
print(status+" - "+ data_source)

prod - web


In [17]:
# Filepath
if status == "prod":
    if data_source == "dropzone":
            fp = dropzone_path_prod+"\\"+ package_name +"\\"+"messwerte_"
            print("fp lautet:"+fp)
    else:
        #fp_ = ckan_prod_url+package_name+'/download/'+"messwerte_"+messstation+ending+".csv"
        fp = ckan_prod_url+package_name+'/download/'+"messwerte_"
        print("fp lautet:"+fp)
else:
    if data_source == "dropzone":
        fp = dropzone_path_integ+"\\"+ package_name +"\\"+"messwerte_"
        print("fp lautet:"+fp)
    else:
        fp = ckan_integ_url+package_name+'/download/'+"messwerte_"
        print("fp lautet:"+fp)


fp lautet:https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen/download/messwerte_


#### Importiere mehrere Datensätze automatisiert

Dieser Datensatz enthält vier Ressourcen. Pro Messstation zwei. Einer davon ist die zusammengesetzte Zeitreihe mit Jahresbeständen. Der andere ist das aktuelle Jahr. 

In [18]:
#print(fp+messstation+ending+".csv")

In [19]:
# Read the data
dfs=[]

if data_source == "dropzone": 
    #Eigentlich kann hier im Moment alles gelöscht werden. 
    print("Aktuell gibt es keine Daten auf der Dropzone mehr. Sie werden per CKAN-API hoch gepushed.")
    for messstation in messstationen:        
        #print(messstation)        
        for ending in endings:
            print(fp+messstation+ending+".csv")
            df = pd.read_csv(
                fp+messstation+ending+".csv"
                , sep=','
                ,parse_dates=['timestamp_utc']
                ,low_memory=False
            )
            new_df_name = 'df_'+ messstation + ending.replace('-','_')
            print(new_df_name)
            exec(f'{new_df_name} = df.copy()')
            #print(new_df_name)
            dfs.append(new_df_name)            
    print("Ende Loop for dropzone: ")
    
else: 
    for messstation in messstationen:        
        #print(messstation)        
        for ending in endings:
            print(fp+messstation+ending+".csv")
            df = pd.read_csv(
                fp+messstation+ending+".csv"
                , sep=','
                ,parse_dates=['timestamp_utc']
                ,low_memory=False
            )
            #Gib den importierten Daten den passenden Dataframe-Namen
            new_df_name = 'df_'+ messstation + ending.replace('-','_')
            print(new_df_name)
            exec(f'{new_df_name} = df.copy()')
            #Nimm alle Namen der Dataframes in eine Liste auf. Diese werden im nächsten Schritt verwendet
            dfs.append(new_df_name)         
            
    print("Ende Loop for web: ")          

#https://stackoverflow.com/questions/40973687/create-new-dataframe-in-pandas-with-dynamic-names-also-add-new-column

https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen/download/messwerte_mythenquai_2007-2021.csv
df_mythenquai_2007_2021
https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen/download/messwerte_mythenquai_2022.csv
df_mythenquai_2022
https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen/download/messwerte_tiefenbrunnen_2007-2021.csv
df_tiefenbrunnen_2007_2021
https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen/download/messwerte_tiefenbrunnen_2022.csv
df_tiefenbrunnen_2022
Ende Loop for web: 


In [20]:
  
# Read the data
dfs=[]

if data_source == "web":    
    for messstation in messstationen:              
        for ending in endings:
            url= fp+messstation+ending+".csv"
            r = requests.get(url, verify=False)
            r.encoding = 'utf-8'
            df = pd.read_csv(
                io.StringIO(r.text)
                , parse_dates=['timestamp_utc']
                , sep=','
                ,low_memory=False)                
            new_df_name = 'df_'+ messstation + ending.replace('-','_')
            print(new_df_name)
            exec(f'{new_df_name} = df.copy()')
            #print(new_df_name)
            dfs.append(new_df_name)            
    print("Ende Loop for dropzone: ")
    
else: 
    for messstation in messstationen:        
        print("Aktuell gibt es keine Daten auf der Dropzone mehr. Sie werden per CKAN-API hoch gepushed.")         
            
    print("Ende Loop for dropzone: ")          

#https://stackoverflow.com/questions/40973687/create-new-dataframe-in-pandas-with-dynamic-names-also-add-new-column

df_mythenquai_2007_2021
df_mythenquai_2022
df_tiefenbrunnen_2007_2021
df_tiefenbrunnen_2022
Ende Loop for dropzone: 


In [21]:
#Ich möchte gesamthafte Dataframes pro Messstation. Also Zeitreihen plus aktuelles Jahr

i=0
df_zeitreihe_my = pd.DataFrame()
df_zeitreihe_tb = pd.DataFrame()

for df in dfs:
    if i <=1:
        print('first dataset: ',dfs[i])
        #df_zeitreihe_my = pd.DataFrame()
        df_zeitreihe_my = df_zeitreihe_my.append([eval(dfs[i])])
        i += 1
    else:
        print('second dataset: ', dfs[i])
        #df_zeitreihe_tb = pd.DataFrame()
        df_zeitreihe_tb = df_zeitreihe_tb.append([eval(dfs[i])])
        i += 1
print('done')

first dataset:  df_mythenquai_2007_2021
first dataset:  df_mythenquai_2022
second dataset:  df_tiefenbrunnen_2007_2021
second dataset:  df_tiefenbrunnen_2022
done


In [22]:
df_zeitreihe_my.shape

(804798, 15)

In [23]:
df_zeitreihe_tb.shape

(852105, 15)

Beschreibe einzelne Attribute

In [30]:
df_zeitreihe_my.describe()
#data2bextended_tb.describe()

Unnamed: 0,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,barometric_pressure_qfe,precipitation,dew_point,global_radiation,humidity,water_level
count,804798.0,704401.0,804798.0,804798.0,804798.0,804798.0,804798.0,800057.0,704401.0,804798.0,704401.0,804798.0,704401.0
mean,11.7,13.4,3.5,1.9,1.8,184.1,10.5,975.4,0.0,6.9,135.3,74.9,405.9
std,8.0,7.1,2.6,1.4,1.3,106.2,8.6,17.4,0.2,6.5,293.8,16.5,0.2
min,-13.4,0.1,-0.1,0.0,0.0,0.0,-25.6,930.7,0.0,-17.2,0.0,16.0,405.2
25%,5.3,6.3,1.8,0.9,1.0,101.0,3.8,966.3,0.0,1.9,0.0,64.0,405.9
50%,11.5,13.2,3.0,1.7,1.7,175.0,10.4,971.1,0.0,6.9,6.0,79.0,405.9
75%,17.6,19.8,4.8,2.6,2.4,286.0,17.0,977.1,0.0,12.2,156.0,87.0,406.0
max,37.7,28.2,32.0,17.1,16.8,360.0,37.8,1037.5,17.0,24.6,4293.0,100.0,406.5


Wie viele Nullwerte gibt es im Datensatz?

In [31]:
df_zeitreihe_my.isnull().sum()

timestamp_utc                   0
timestamp_cet                   0
air_temperature                 0
water_temperature          100397
wind_gust_max_10min             0
wind_speed_avg_10min            0
wind_force_avg_10min            0
wind_direction                  0
windchill                       0
barometric_pressure_qfe      4741
precipitation              100397
dew_point                       0
global_radiation           100397
humidity                        0
water_level                100397
dtype: int64

In [32]:
df_zeitreihe_tb.isnull().sum()

timestamp_utc                   0
timestamp_cet                   0
air_temperature                 0
water_temperature              96
wind_gust_max_10min             0
wind_speed_avg_10min            0
wind_force_avg_10min            0
wind_direction                  0
windchill                       0
barometric_pressure_qfe     52549
precipitation              852105
dew_point                       0
global_radiation           852105
humidity                        0
water_level                852105
dtype: int64

### Checke die Metadaten auf der CKAN INTEG- oder PROD-Webseite

Offenbar lassen sich aktuell im Markdownteil keine Variabeln ausführen, daher gehen wir wie unten gezeigt vor. Siehe dazu: https://data-dive.com/jupyterlab-markdown-cells-include-variables
Instead of setting the cell to Markdown, create Markdown from withnin a code cell! We can just use python variable replacement syntax to make the text dynamic

In [33]:
from IPython.display import Markdown as md

In [34]:
md(" **1. Dataset auf INTEG-Datakatalog:** Link {} ".format(ckan_integ_url+package_name))

 **1. Dataset auf INTEG-Datakatalog:** Link https://data.integ.stadt-zuerich.ch/dataset/int_dwh_sid_wapo_wetterstationen 

In [36]:
md(" **2. Dataset auf PROD-Datakatalog:** Link {} ".format(ckan_prod_url+package_name))

 **2. Dataset auf PROD-Datakatalog:** Link https://data.stadt-zuerich.ch/dataset/sid_wapo_wetterstationen 

## Grafische Auswertungen
### Verwende das Datum als Index

While we did already parse the `datetime` column into the respective datetime type, it currently is just a regular column. 
**To enable quick and convenient queries and aggregations, we need to turn it into the index of the DataFrame**

In [37]:
df_zeitreihe_my = df_zeitreihe_my.set_index("timestamp_utc")

In [38]:
df_zeitreihe_my 

Unnamed: 0_level_0,timestamp_cet,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,barometric_pressure_qfe,precipitation,dew_point,global_radiation,humidity,water_level
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2007-04-22 19:20:00+00:00,2007-04-22T21:20:00+02:00,18.9,15.2,1.6,0.7,0.7,321,18.9,973.5,0.0,4.4,3.0,38.0,405.9
2007-04-22 19:30:00+00:00,2007-04-22T21:30:00+02:00,18.1,15.2,1.3,0.8,0.8,346,18.1,973.7,0.0,4.8,3.0,41.0,405.9
2007-04-22 19:40:00+00:00,2007-04-22T21:40:00+02:00,17.7,15.1,0.9,0.2,0.2,4,17.7,973.7,0.0,5.1,3.0,43.0,405.9
2007-04-22 19:50:00+00:00,2007-04-22T21:50:00+02:00,17.6,15.3,0.6,0.1,0.1,235,17.6,973.8,0.0,5.7,3.0,45.0,405.9
2007-04-22 20:00:00+00:00,2007-04-22T22:00:00+02:00,17.4,15.4,0.7,0.3,0.3,178,17.4,973.9,0.0,6.3,3.0,48.0,405.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-15 07:40:00+00:00,2022-11-15T08:40:00+01:00,7.9,13.1,0.0,0.0,0.0,0,7.9,967.3,0.0,7.1,59.0,94.0,406.0
2022-11-15 07:50:00+00:00,2022-11-15T08:50:00+01:00,8.2,13.1,1.8,0.3,1.0,62,8.2,967.3,0.0,7.5,116.0,95.0,406.0
2022-11-15 08:00:00+00:00,2022-11-15T09:00:00+01:00,9.3,13.1,1.4,0.8,1.0,95,9.3,967.4,0.0,7.8,199.0,90.0,406.0
2022-11-15 08:10:00+00:00,2022-11-15T09:10:00+01:00,10.2,13.1,1.8,1.3,1.0,134,10.2,967.3,0.0,7.7,176.0,85.0,406.0


In [39]:
df_zeitreihe_my.info()
df_zeitreihe_my.index.year.unique()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 804798 entries, 2007-04-22 19:20:00+00:00 to 2022-11-15 08:20:00+00:00
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   timestamp_cet            804798 non-null  object 
 1   air_temperature          804798 non-null  float64
 2   water_temperature        704401 non-null  float64
 3   wind_gust_max_10min      804798 non-null  float64
 4   wind_speed_avg_10min     804798 non-null  float64
 5   wind_force_avg_10min     804798 non-null  float64
 6   wind_direction           804798 non-null  int64  
 7   windchill                804798 non-null  float64
 8   barometric_pressure_qfe  800057 non-null  float64
 9   precipitation            704401 non-null  float64
 10  dew_point                804798 non-null  float64
 11  global_radiation         704401 non-null  float64
 12  humidity                 804798 non-null  float64
 13  water_level  

Int64Index([2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
            2021, 2022],
           dtype='int64', name='timestamp_utc')

In [40]:
df_zeitreihe_tb = df_zeitreihe_tb.set_index("timestamp_utc")

In [41]:
df_zeitreihe_tb.info()
df_zeitreihe_tb.index.year.unique()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 852105 entries, 2007-04-15 09:30:00+00:00 to 2022-11-15 08:20:00+00:00
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   timestamp_cet            852105 non-null  object 
 1   air_temperature          852105 non-null  float64
 2   water_temperature        852009 non-null  float64
 3   wind_gust_max_10min      852105 non-null  float64
 4   wind_speed_avg_10min     852105 non-null  float64
 5   wind_force_avg_10min     852105 non-null  float64
 6   wind_direction           852105 non-null  int64  
 7   windchill                852105 non-null  float64
 8   barometric_pressure_qfe  799556 non-null  float64
 9   precipitation            0 non-null       float64
 10  dew_point                852105 non-null  float64
 11  global_radiation         0 non-null       float64
 12  humidity                 852105 non-null  float64
 13  water_level  

Int64Index([2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
            2021, 2022],
           dtype='int64', name='timestamp_utc')

In [42]:
# first we create the sums per week
weekly_means = df_zeitreihe_tb.resample("W").mean()
# then we generate the weekly means for each quarter
quarterly_means = df_zeitreihe_tb.resample("Q").mean()
quarterly_median = df_zeitreihe_tb.resample("Q").median()
# for readability we'll revert the values back to integers
#weekly_means.dropna().astype(int).head(5)
#weekly_means
#quarterly_means
#quarterly_median

### Einfache Visualisierungen zur Plausi

Exploriere die Daten mit Pivottable.JS

Daten zu gross hierfür... Stürzt ab

In [43]:
#from pivottablejs import pivot_ui

#pivot_ui(df_zeitreihe_tb)

### Zeitpunkte und Zeiträume abfragen


#### Aggregiere Werte nach Zeitausschnitten

Mit den Funktionen zur Zeit kann einfach zwischen Stunden, Tagen, Monaten, etc. gewechselt und aggregiert werden.

Hier z.B. ob es nach dem Zusammenhängen der Jahresbestände flüssige Übergänge gibt oder ob etwas verdächtig aussieht.


In [44]:
min_date_tb = df_zeitreihe_tb.reset_index().timestamp_utc.min().strftime("%Y-%m-%d")
print(min_date_tb, date_today)
min_date_my = df_zeitreihe_my.reset_index().timestamp_utc.min().strftime("%Y-%m-%d")
print(min_date_my, date_today)

2007-04-15 2022-11-15
2007-04-22 2022-11-15


In [45]:
#data2betested_my.loc["2017-06-30"]
df_zeitreihe_tb.loc[min_date_tb:date_today].resample("H").mean()
df_zeitreihe_my.loc[min_date_my:date_today].resample("H").mean()
#df_zeitreihe_tb.reset_index().sort_values('timestamp_utc', ascending=False)

Unnamed: 0_level_0,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,barometric_pressure_qfe,precipitation,dew_point,global_radiation,humidity,water_level
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2007-04-22 19:00:00+00:00,18.1,15.2,1.1,0.5,0.5,226.5,18.1,973.7,0.0,5.0,3.0,41.8,405.9
2007-04-22 20:00:00+00:00,16.8,15.3,1.3,0.9,0.9,159.0,16.8,974.1,0.0,6.7,3.0,51.0,405.9
2007-04-22 21:00:00+00:00,15.2,15.2,1.3,0.6,0.6,171.2,15.2,974.5,0.0,7.7,3.0,60.7,405.9
2007-04-22 22:00:00+00:00,14.4,15.2,1.6,1.1,1.1,174.0,14.4,975.1,0.0,8.2,3.0,66.8,405.9
2007-04-22 23:00:00+00:00,13.7,15.2,1.6,1.2,1.2,165.0,13.7,975.3,0.0,7.8,3.0,67.7,405.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-15 04:00:00+00:00,6.6,13.2,1.4,0.8,0.8,201.3,6.6,968.1,0.0,6.5,0.0,98.7,406.0
2022-11-15 05:00:00+00:00,6.3,13.1,1.0,0.4,0.7,112.0,6.3,967.7,0.0,6.2,0.0,99.3,406.0
2022-11-15 06:00:00+00:00,6.6,13.1,1.7,0.9,1.0,119.5,6.6,967.4,0.0,6.5,7.3,99.0,406.0
2022-11-15 07:00:00+00:00,7.7,13.1,1.3,0.6,0.7,95.5,7.7,967.3,0.0,7.0,62.7,95.5,406.0


In [46]:
# first we create the sums per week
weekly_medians = df_zeitreihe_tb.resample("W").median()
# then we generate the weekly means for each quarter
quarterly_medians = df_zeitreihe_tb.resample("Q").median()
# for readability we'll revert the values back to integers
quarterly_medians.dropna().astype(int).head(2)

Unnamed: 0_level_0,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,barometric_pressure_qfe,precipitation,dew_point,global_radiation,humidity,water_level
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1


### Visualisierungen nach Zeitausschnitten

Liniendiagramm 
[Link zur Doku](https://altair-viz.github.io/gallery/multiline_highlight.html)

In [47]:
#weekly_medians.dtypes
days_tb = df_zeitreihe_tb.resample("D").median()
months_tb = df_zeitreihe_tb.resample("M").median()
years_tb = df_zeitreihe_tb.resample("Y").median()
years_tb.dtypes


air_temperature            float64
water_temperature          float64
wind_gust_max_10min        float64
wind_speed_avg_10min       float64
wind_force_avg_10min       float64
wind_direction             float64
windchill                  float64
barometric_pressure_qfe    float64
precipitation              float64
dew_point                  float64
global_radiation           float64
humidity                   float64
water_level                float64
dtype: object

#### Testing 
**Ich möchte eine Grafik, in der die Jahre farblich eingefärbt werden und der Verlauf jedes Jahres dargestellt wird**

In [48]:
# add a column for month names and month number
months_tb['month_number'] = months_tb.index.month
months_tb['month_name'] = months_tb.index.month_name()
months_tb['year'] =months_tb.index.year
years_tb['year']=years_tb.index.year
#months_tb.head(2)

In [49]:
months_tb[['air_temperature','water_temperature','month_number','month_name','year']].reset_index().head(2)
years_tb[['air_temperature','water_temperature','year']].reset_index().head(2)

Unnamed: 0,timestamp_utc,air_temperature,water_temperature,year
0,2007-12-31 00:00:00+00:00,14.5,17.3,2007
1,2008-12-31 00:00:00+00:00,11.4,13.7,2008


In [50]:
chart1 = alt.Chart(years_tb[['air_temperature','water_temperature','year']].loc["2007":date_today].reset_index()).mark_line( strokeWidth=1.5, opacity=0.9).encode(
    x='year:T',
    y='air_temperature',
    color=alt.Color('year', legend=alt.Legend(title="Jahre Chart1"), scale=alt.Scale(scheme='category20'))
).properties(width=800, height=400).interactive()

chart2 = alt.Chart(months_tb[['air_temperature','water_temperature','month_number','month_name','year']].loc[min_date_tb:date_today].reset_index()).mark_line(interpolate="basis", opacity=0.6, strokeWidth=0.8).encode(
    x='month_number',
    y='air_temperature',
    color=alt.Color('year', legend=alt.Legend(title="Jahre"), scale=alt.Scale(scheme='cividis'))
).properties(width=800, height=400).interactive()

chart1 + chart2

Weitere custimisation, siehe https://altair-viz.github.io/user_guide/customization.html


**--------------- end test ----------------------**

In [51]:
months_tb[['air_temperature','water_temperature']]

Unnamed: 0_level_0,air_temperature,water_temperature
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1
2007-04-30 00:00:00+00:00,17.0,15.2
2007-05-31 00:00:00+00:00,15.7,16.1
2007-06-30 00:00:00+00:00,18.6,20.3
2007-07-31 00:00:00+00:00,18.6,20.1
2007-08-31 00:00:00+00:00,18.0,21.0
...,...,...
2022-07-31 00:00:00+00:00,22.5,24.0
2022-08-31 00:00:00+00:00,21.2,24.0
2022-09-30 00:00:00+00:00,13.1,19.1
2022-10-31 00:00:00+00:00,14.0,16.4


In [52]:
chart1 = alt.Chart(months_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(strokeWidth=1, opacity=0.25).encode(
    x='timestamp_utc',
    y='value',
    color='variable',
).properties(width=800, height=400)

chart2 = alt.Chart(years_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(interpolate="basis", opacity=1).encode(
    x='timestamp_utc',
    y='value',
    color='variable',
)

chart1 + chart2

In [53]:
rolling = months_tb[['air_temperature','water_temperature']].rolling(60, center=True, win_type="triang").mean()

chart1 = alt.Chart(rolling.reset_index().melt("timestamp_utc")).mark_line(strokeWidth=1.5, opacity=1).encode(
    x='timestamp_utc', y='value', color='variable',
).properties(width=800, height=400)

# same as the two charts in previous code cell, except more transparent
chart2 = alt.Chart(months_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(strokeWidth=1.6, opacity=0.25).encode(
    x='timestamp_utc', y='value', color='variable',
)

chart3 = alt.Chart(years_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(interpolate="basis", opacity=0.25).encode(
    x='timestamp_utc', y='value', color='variable',
)

chart1 + chart2 + chart3

In [54]:
# add a column for month names and month number
months_tb['month_number'] = months_tb.index.month
months_tb['month_name'] = months_tb.index.month_name()
months_tb.head(2)

Unnamed: 0_level_0,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,...,dew_point,global_radiation,humidity,water_level,month_number,month_name,year
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2007-04-30 00:00:00+00:00,17.0,15.2,1.6,0.8,0.8,117.0,16.8,...,8.9,,60.0,,4,April,2007
2007-05-31 00:00:00+00:00,15.7,16.1,2.6,1.1,1.1,192.0,14.1,...,9.1,,68.0,,5,May,2007


In [55]:
grp_months_tb = months_tb[['air_temperature','water_temperature', 'month_number', 'month_name']].groupby("month_name").mean()

In [56]:
alt.Chart(grp_months_tb.reset_index()).mark_bar(width=20).encode(
    x='month_number:O',
    y='air_temperature:Q',
    color='month_name:O'
).properties(width=300, height=300).interactive()

In [57]:
months_tb = df_zeitreihe_tb.resample("M").median()
years_tb = df_zeitreihe_tb.resample("Y").median()
#months = df_zeitreihe_tb.resample("M").min()
#years = df_zeitreihe_tb.resample("Y").min()

brush = alt.selection(type='interval', encodings=['x'])

upper = alt.Chart(years_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_area(interpolate="basis").encode(
    x = alt.X('timestamp_utc:T', axis=None),
    y = alt.Y('value:Q', axis=None),
    color='variable'
).properties(width=800, height=50).add_selection(brush)

lower = alt.Chart(months_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(strokeWidth=1).encode(
    x = alt.X('timestamp_utc:T', scale=alt.Scale(domain=brush)),
    y='value',
    color='variable',
).properties(width=800, height=300)

upper & lower

In [58]:
days_tb

Unnamed: 0_level_0,air_temperature,water_temperature,wind_gust_max_10min,wind_speed_avg_10min,wind_force_avg_10min,wind_direction,windchill,barometric_pressure_qfe,precipitation,dew_point,global_radiation,humidity,water_level
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2007-04-15 00:00:00+00:00,21.3,11.5,2.8,1.3,1.3,55.0,20.9,973.1,,9.2,,41.0,
2007-04-16 00:00:00+00:00,16.8,11.6,2.2,1.2,1.2,58.5,16.6,973.1,,7.6,,52.5,
2007-04-17 00:00:00+00:00,16.5,12.6,1.0,0.5,0.5,173.5,16.5,972.5,,6.8,,53.5,
2007-04-18 00:00:00+00:00,14.6,12.4,2.7,1.2,1.2,63.0,14.2,973.2,,8.7,,64.0,
2007-04-19 00:00:00+00:00,13.0,9.8,2.5,1.1,1.1,303.0,12.9,972.4,,5.4,,61.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-11 00:00:00+00:00,9.2,13.8,1.2,0.2,0.0,175.0,9.2,984.6,,8.3,,92.0,
2022-11-12 00:00:00+00:00,8.7,13.8,1.5,0.4,1.0,100.0,8.7,984.4,,7.4,,93.0,
2022-11-13 00:00:00+00:00,7.0,13.7,1.8,0.6,1.0,59.0,7.0,974.9,,5.7,,92.0,
2022-11-14 00:00:00+00:00,7.4,13.5,1.7,0.4,1.0,111.0,7.4,968.9,,6.8,,94.0,


In [59]:
#aktuell nimmt die Zeitreihe nicht das laufende Jahr auf.

days_tb = df_zeitreihe_tb[['air_temperature','water_temperature']].dropna(axis=1).loc["2017-01-01 00:00":date_today].dropna(axis=1).resample("D").median()
months_tb =df_zeitreihe_tb[['air_temperature','water_temperature']].dropna(axis=1).loc["2017-01-01 00:00":date_today].resample("M").median()

years_tb = df_zeitreihe_tb.resample("Y").median()
#months = df_zeitreihe_tb.resample("M").min()
#years = df_zeitreihe_tb.resample("Y").min()

brush = alt.selection(type='interval', encodings=['x'])

upper = alt.Chart(months_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_area(interpolate="basis").encode(
    x = alt.X('timestamp_utc:T', axis=None),
    y = alt.Y('value:Q', axis=None),
    color='variable'
).properties(width=800, height=50).add_selection(brush)

lower = alt.Chart(days_tb[['air_temperature','water_temperature']].reset_index().melt("timestamp_utc")).mark_line(strokeWidth=1).encode(
    x = alt.X('timestamp_utc:T', scale=alt.Scale(domain=brush)),
    y='value',
    color='variable',
).properties(width=800, height=300)

upper & lower

KeyError: "['water_temperature'] not in index"

In [None]:
#df_zeitreihe_tb[['air_temperature','water_temperature']].dropna(axis=1).loc["2018-12-31 21:50":"2022-01-01 04:30"].resample("M").median()

In [60]:
months_tb.reset_index().sort_values("timestamp_utc", ascending=False)

Unnamed: 0,timestamp_utc,air_temperature
70,2022-11-30 00:00:00+00:00,9.1
69,2022-10-31 00:00:00+00:00,14.0
68,2022-09-30 00:00:00+00:00,13.1
67,2022-08-31 00:00:00+00:00,21.2
66,2022-07-31 00:00:00+00:00,22.5
...,...,...
4,2017-05-31 00:00:00+00:00,15.5
3,2017-04-30 00:00:00+00:00,10.4
2,2017-03-31 00:00:00+00:00,8.7
1,2017-02-28 00:00:00+00:00,4.1


## Datenexport

Wenn alle Tests positiv und plausibel sind, kann die neu updateten Datasets als csv exportiert und später veröffentlicht werden.

Checke zuerst kurz, ob im Verlauf der Plausis etwas falsches reingerutscht ist

In [None]:
df_zeitreihe_tb.shape
#df_zeitreihe_tb.describe()

In [None]:
df_zeitreihe_my.shape
#df_zeitreihe_my.describe()

### Exportpfade definieren:

#### Dynamisch berechnete Min- und Max-Jahre

In [None]:
#years.index.year
years = df_zeitreihe_tb.resample("Y").median()
min_year = years.index.year.min()
max_year= years.index.year.max()

print(min_year, max_year, r"\\szh\ssz\applikationen\OGD\Daten\Quelldaten\SID\WAPO\02_veroeffentlichte_zeitreihe\messwerte_mythenquai_"+str(min_year)+"-"+str(max_year)+".csv")

#### Pfade zusammensetzen

In [None]:
export_fp_my = r"\\szh\ssz\applikationen\OGD\Daten\Quelldaten\SID\WAPO\02_veroeffentlichte_zeitreihe\messwerte_mythenquai_"+str(min_year)+"-"+str(max_year)+".csv" #Mythenquai
export_fp_tb= r"\\szh\ssz\applikationen\OGD\Daten\Quelldaten\SID\WAPO\02_veroeffentlichte_zeitreihe\messwerte_tiefenbrunnen_"+str(min_year)+"-"+str(max_year)+".csv" #Tiefenbrunnen

#### Exportoptionen festlegen

Optionen:
`DataFrame.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None)`

### Export ausführen

1. Indexe aufheben. Dabei verliert der indexierte Wert das Iso-Datumsformat
2. Zeitformat bei zuvor indexiertem Datum nochmals ISO-konform definieren: Dies ist nur für utc wichtig, da cet nie verwendet wurde zuvor und weiterhin als String in ISO-Format vorliegt.

##### Mythenquai


In [None]:
df_my = df_zeitreihe_my.reset_index()
df_my['timestamp_utc'] = df_my.timestamp_utc.apply(datetime.datetime.isoformat)
df_my.head(2)

In [None]:
df_my.to_csv(export_fp_my
             , sep=','
             , encoding='utf-8-sig'
             ,index=False
                      )

##### Tiefenbrunnen

In [None]:
df_tb = df_zeitreihe_tb.reset_index()

# Achtung: Reihenfolge spielt hier eine Rolle. 
df_tb['timestamp_utc'] = df_tb.timestamp_utc.apply(datetime.datetime.isoformat)
df_tb.head(2)

In [None]:
df_tb.to_csv(export_fp_tb
             , sep=','
             , encoding='utf-8-sig'
             ,index=False
            )

### Zeitformatierungen

Alternativ könnte man die `timestamp_cet`beim den Importen noch nicht machen, resp. bei den bestehenden Daten droppen und erst ganz am Schluss alles berechnen.

Aktuell gehe ich so vor:
1. Die neuen Daten importieren --> aus der Datumsangabe --> als utc parsen. Danach cet berechnen
2. Die bisherigen Daten importieren --> die Datumsangaben aber nur als String importieren. Mit utc wird später gerechnet. cet nicht. Am Schluss speichere ich es eigentlich als String, aber merkt man nicht, weil der Export ohne "" bei Strings kommt.



### Teste global radiation und precipitation

In [86]:
#weekly_medians.dtypes
days_my = df_zeitreihe_my.resample("D").mean()
months_my = df_zeitreihe_my.resample("M").mean()
years_my = df_zeitreihe_my.resample("Y").mean()
years_tb.dtypes


air_temperature            float64
water_temperature          float64
wind_gust_max_10min        float64
wind_speed_avg_10min       float64
wind_force_avg_10min       float64
wind_direction             float64
windchill                  float64
barometric_pressure_qfe    float64
precipitation              float64
dew_point                  float64
global_radiation           float64
humidity                   float64
water_level                float64
dtype: object

In [95]:
#days_my[['precipitation','global_radiation']].loc['2019':'2022']
days_my[['precipitation','global_radiation']].loc['2022-01-01':'2022-11-14']


Unnamed: 0_level_0,precipitation,global_radiation
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-05-05 00:00:00+00:00,0.2,63.4
2022-06-05 00:00:00+00:00,0.2,99.3
2022-06-07 00:00:00+00:00,0.1,122.6
2022-06-24 00:00:00+00:00,0.1,88.9
2022-06-30 00:00:00+00:00,0.2,181.9
2022-08-17 00:00:00+00:00,0.1,185.9
2022-08-19 00:00:00+00:00,0.1,32.6
2022-09-15 00:00:00+00:00,0.1,63.8
2022-09-28 00:00:00+00:00,0.2,27.6


In [97]:
df_zeitreihe_my[['precipitation','global_radiation']].loc['2022-11-05 10:00:00':'2022-11-14 10:00:00'].query("precipitation>0.1")

Unnamed: 0_level_0,precipitation,global_radiation
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-11-05 13:40:00+00:00,0.6,45.0
2022-11-05 13:50:00+00:00,0.3,42.0
2022-11-05 14:00:00+00:00,0.3,15.0
2022-11-09 05:50:00+00:00,0.2,0.0
2022-11-09 06:10:00+00:00,0.3,0.0
2022-11-09 06:20:00+00:00,0.4,0.0
2022-11-09 06:40:00+00:00,0.2,4.0
2022-11-09 06:50:00+00:00,0.2,7.0
2022-11-09 13:00:00+00:00,0.2,46.0
2022-11-09 13:30:00+00:00,0.2,24.0
