Add a run script for data_collection.py with fire and add a table of …

…available data per TimeResolution
panodata · Jun 20, 2020 · 9a5928d · 9a5928d
1 parent 90e96b7
commit 9a5928d
Show file tree

Hide file tree

Showing 6 changed files with 57 additions and 9 deletions.
diff --git a/DWD_FTP_STRUCTURE.md b/DWD_FTP_STRUCTURE.md
@@ -1,6 +1,6 @@
 # Folder structure of dwd ftp server ###
 
-[LINK TO DWD](ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate)
+[LINK TO DWD](https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate)
 
 | Timescale | Variable | Abbrevation | Period | Filename |
 | --- | --- | --- | --- | --- |

diff --git a/Dockerfile b/Dockerfile
@@ -7,4 +7,6 @@ ENV TERM linux
 COPY ./requirements.txt /opt/requirements.txt
 RUN pip install -r /opt/requirements.txt
 
-WORKDIR /app
+WORKDIR /app
+
+ENV PYTHONPATH /app/
diff --git a/README.md b/README.md
@@ -98,19 +98,43 @@ metadata = python_dwd.metadata_for_dwd_data(parameter=Parameter.PRECIPITATION_MO
 ```
 
 
-## 4. Listing server files
+## 4. Availability table 
+
+It is also possible to use enumeration keywords. The availability table shows the enumeration keyword mapping the availability via python_dwd and on the CDC server.
+
+|Paramater/Granularity                       |1_minute                             |   10_minutes                    |hourly | daily     |monthly | annual| 
+|----------------|-------------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|
+| `TEMPERATURE_SOIL = "soil_temperature"`  | :x: | :x: | :heavy_check_mark:|:heavy_check_mark: |:x: | :x:|
+| `TEMPERATURE_AIR = "air_temperature"` |:x: | :heavy_check_mark:| :heavy_check_mark:| :x:|:x: |:x: |
+| `TEMPERATURE_AIR = "air_temperature"`  |:x: | :x: |:x: |:x: | :x:|:x: |
+| `PRECIPITATION = "precipitation"`    | :heavy_check_mark: | :heavy_check_mark: |:x: | :x:| :x:|:x: |
+| `TEMPERATURE_EXTREME = "extreme_temperature"` | :x:|:heavy_check_mark: | :x:|:x: | :x:|:x: |
+| `WIND_EXTREME = "extreme_wind"  `  |:x: | :heavy_check_mark: | :x:| :x:|:x: |:x: |
+| `SOLAR = "solar"`  | :x: | :heavy_check_mark: | :heavy_check_mark:| :heavy_check_mark:| :x:|:x: |
+| `WIND = "wind"  ` |:x: |:heavy_check_mark: | :heavy_check_mark:|:x: |:x: |:x: |
+| `CLOUD_TYPE = "cloud_type"`  |:x: | :x: | :heavy_check_mark:|:x: |:x: |:x: |
+| `CLOUDINESS = "cloudiness"  `    | :x: | :x: |:heavy_check_mark: | :x:| :x:| :x:|
+| `SUNSHINE_DURATION = "sun"` |:x: |:x: | :heavy_check_mark:| :x:|:x:|:x: |
+| `VISBILITY = "visibility"`  | :x:|  :x:|:heavy_check_mark: |:x: | :x:| :x:|
+| `WATER_EQUIVALENT = "water_equiv"`  | :x:| :x: |:x: |:heavy_check_mark: |:x: | :x:|
+| `PRECIPITATION_MORE = "more_precip"  `    | :x: | :x: |:x: | :heavy_check_mark:|:heavy_check_mark: | :heavy_check_mark:|
+| `PRESSURE = "pressure"` | :x:| | :heavy_check_mark:|:x: |:x:|:x: |
+| `CLIMATE_SUMMARY = "kl"`  |:x: | :x: |:x: | :heavy_check_mark:|:heavy_check_mark: |:heavy_check_mark: |
+
+
+## 5. Listing server files
 
 The server is constantly updated to add new values. This happens in a way that existing station data is appended by newly measured data approxamitly once a year somewhere after new year. This occasion requires the toolset to retrieve a new **filelist**, which has to beinitiated by the user when getting an error about this. For this purpose a function is scanning the server folder for a given parameter set if requested.
 
 The created filelist is also used for the metadata, namely the column **HAS_FILE**. This is due to the fact that not every station listed in the given metadata also has a corresponding file. With this information one can simply filter the metadata with **HAS_FILE == True** to only get those stations that really have a file on the server.
 
-## 5. About the metadata
+## 6. About the metadata
 
 The metadata for a set of parameters is not stored in a usual .csv but instead put in a .txt file next to the stationdata. This file has to be parsed first, as unfortunately there's no regular seperator in those files. After parsing the text from those files, a .csv is created which then can be read easily. There's one exception for this case: For 1-minute precipitation data, the metadata is stored in seperate zipfiles, which contain more detailed information. For this reason, when calling metadata_dwd with those parameters will download and read-in all the files and store them in a similar DataFrame to provide a seemless functionality over all parameter types. 
 
 Also this data doesn't include the **STATE** information, which sometimes can be useful to filter the data for a certain region. To get this data into our metadata, we run another metadata request for the parameters of historical daily precipitation data, as we expect it to have the most information, because it is the most common station type in Germany. For some cases it still could happen that there's no STATE information as it might be that some stations are only run to individually measure the performance of some values at a special site.
 
-## 6. Conclusion
+## 7. Conclusion
 
 Feel free to use the library if you want to automate the data access and analyze the german climate. Be aware that it could happen that the server is blocking the ftp client once in a while. It could be useful though to use a try-except-block and retry to get the data. For further examples of this library check the notebook **python_dwd_example.ipynb** in the **example** folder!
 
@@ -130,5 +154,11 @@ To run the tests in the given environment, just call
 docker run -ti -v $(pwd):/app python_dwd:latest pytest tests/
 ```
 from the main directory. To work in an iPython shell you just have to change the command `pytest tests/` to `ipython`.
-Soon there will be a `fire` based command line script. 
+
+#### Command line script  
+You can download data as csv files after building docker container. Actually only the `collect_dwd_data` is supported from this service. 
+
+```
+docker run -ti -v $(pwd):/app python_dwd:latest python3 python_dwd/run.py collect_dwd_data "[1048]" "kl" "daily" "historical" /app/dwd_data/ False False True False True True
+```
 
diff --git a/python_dwd/data_collection.py b/python_dwd/data_collection.py
@@ -1,7 +1,7 @@
 """ Data collection pipeline """
 import logging
 from pathlib import Path
-from typing import List, Union
+from typing import List, Union, Optional
 import pandas as pd
 
 from python_dwd.constants.column_name_mapping import GERMAN_TO_ENGLISH_COLUMNS_MAPPING_HUMANIZED
@@ -26,7 +26,8 @@ def collect_dwd_data(station_ids: List[int],
                      parallel_download: bool = False,
                      write_file: bool = False,
                      create_new_filelist: bool = False,
-                     humanize_column_names: bool = False) -> pd.DataFrame:
+                     humanize_column_names: bool = False,
+                     run_download_only: bool = False) -> Optional[pd.DataFrame]:
     """
     Function that organizes the complete pipeline of data collection, either
     from the internet or from a local file. It therefor goes through every given
@@ -45,6 +46,7 @@ def collect_dwd_data(station_ids: List[int],
         write_file: boolean if to write data to local storage
         create_new_filelist: boolean if to create a new filelist for the data selection
         humanize_column_names: boolean to yield column names better for human consumption
+        run_download_only: boolean to run only the download and storing process
 
     Returns:
         a pandas DataFrame with all the data given by the station ids
@@ -86,7 +88,10 @@ def collect_dwd_data(station_ids: List[int],
                 station_data, station_id, parameter, time_resolution, period_type, folder)
 
         data.append(station_data)
-
+
+    if run_download_only: 
+        return None
+
     data = pd.concat(data)
 
     # Assign meaningful column names (humanized).

diff --git a/python_dwd/run.py b/python_dwd/run.py
@@ -0,0 +1,10 @@
+""" entrypoints ro tun scripts via Docker or command line """
+import fire
+
+from python_dwd.data_collection import collect_dwd_data
+
+
+if __name__ == '__main__':
+  fire.Fire({
+      'collect_dwd_data': collect_dwd_data
+})
diff --git a/requirements.txt b/requirements.txt
@@ -14,3 +14,4 @@ fire==0.3.1
 docopt==0.6.2
 munch==2.5.0
 dateparser==0.7.4
+fire==0.3.1