# Data Analysis in Geoscience Remote Sensing Projects: Exercises
## Final task handed in by: YOURNAME, on DATE
Hendrik Andersen, contact: hendrik.andersen@kit.edu

## Part one: Regression and sensitivity estimation with remote sensing data

For this part you are provided with a data set contained in the file 'data_final_task_regression.csv'. The data contains information on regional averages of low-cloud occurrence and meteorological factors in the Southeast Atlantic (10°S-20°S, 0°E-10°E - this means that the study area is about 1000 km x 1000 km large). The DataFrame contains the following variables:
- sst: sea surface temperature
- eis: estimated inversion strength
- t_adv: temperature advection
- w700: vertical pressure velocity at 700 hPa (this is the vertical wind speed, given in Pa/s: positive numbers mean subsiding air masses)
- rhft: relative humidity in the free troposphere (free troposphere is above the cloud layer)
- clf: Liquid water cloud fraction

The data on clouds are from a satellite data set from the MODIS sensor on board NASA's Terra satellite. The product name is MOD08_M3, downloaded from https://ladsweb.modaps.eosdis.nasa.gov/ for more information check out https://ladsweb.modaps.eosdis.nasa.gov/missions-and-measurements/products/MOD08_M3/#overview

The meteorological data are ERA5 reanalysis data on meteorological factors thought to be important for low cloud cover. The data is downloaded from https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview

For more information on how reanalysis data are generated, check out the 2-minute video from Copernicus ECMWF here: https://www.youtube.com/watch?v=FAGobvUGl24

Your task is to analyze the data to
1. quantitatively and visually describe the data 
2. analyze relationships between cloud fraction and the cloud-controlling factors using
    - regression analysis
    - a machine learning model (e.g. feature importance, SHAP)
3. describe the results of your analyses in the provided markdown cells. Are your results in agreement with the described relationships in the following study?: Klein et al. (2017): Low-Cloud Feedbacks from Cloud-Controlling Factors: A Review, Surveys in Geophysics, doi: 10.1007/s10712-017-9433-3

## Part two: Classification in a remote sensing retrieval setting

For the second part of the exercise, you are tasked to develop a machine learning method to detect fog and low clouds for a location in the Namib Desert on the basis of observations from a geostationary satellite platform. You are provided a data set (ILIAS: 'data_final_task_classification.csv') of night-time satellite observations at different wave lengths over a meteorological measurement station. The geostationary satellite (Spinning Enhanced Visible and Infrared Imager; SEVIRI) makes a scan every 15 minutes at a spatial resolution of 3km x 3km. For the exact time steps of the satellite observations made available here, a boolean (True/False) data set on the presence of fog and low clouds from the measurement station is provided, which should be used as the labeled target data.
- IR_016: Measurements at the 1.6 µm channel
- IR_039: Measurements at the 3.9 µm channel
- IR_087: Measurements at the 8.7 µm channel
- IR_097: Measurements at the 9.7 µm channel
- IR_108: Measurements at the 10.8 µm channel
- IR_120: Measurements at the 12.0 µm channel
- IR_134: Measurements at the 13.4 µm channel
- station_fls: A boolean (True/False) information if fog or low clouds are present at the given time

Your task is to analyze the data to
1. Train and optimize a machine learning classifier (e.g. GradientBoostingClassifier) to detect fog and low clouds and analyze the results using a confusion matrix and using performance metrics
2. Compare the results to a logistic regression approach, and discuss  in the provided markdown cell which method is better at classifying fog and low cloud presence/absence
3. Compare the results to a dedicated detection approach developed for the region in Andersen and Cermak (2018): First fully diurnal fog and low cloud satellite detection reveals life cycle in the Namib, Atmospheric Measurement Techniques, doi: 10.5194/amt-11-5461-2018. Use the provided markdown cell for the discussion.

You can find more specific tasks in the cells below


## Part one: Regression and sensitivity estimation with remote sensing data
__Task__: 
In a typical scientific workflow, the first step is to get an overview of the data. Typically, visualizations and descriptive statistics are very useful to achieve this.  
1. Calculate the mean and standard deviation of cloud fraction and plot the distribution of cloud fraction in a histogram.
2. Plot the CLF time series and describe seasonal patterns.

In [9]:
# use this cell for your code, make sure to comment your code to make it understandable
# just to get you started:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_df = pd.read_csv("data_final_task_regression.csv", index_col='time', parse_dates=True)
data_df.head()

Unnamed: 0_level_0,sst,eis,t_adv,w700,rhft,clf
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2001-01-01,296.1685,4.824931,-1.682254,0.022467,35.642094,0.663396
2001-02-01,297.11526,3.476827,-1.93279,0.026573,37.940147,0.60588
2001-03-01,298.159,2.265783,-2.19912,0.03156,33.834457,0.52929
2001-04-01,298.17062,2.402812,-2.532412,0.039651,21.56533,0.464151
2001-05-01,297.06927,3.534788,-2.962148,0.037318,7.997189,0.318764


In [None]:
# your code here

Describe the results here

__Task__: 
1. Compute a regression analysis of CLF with each meteorological predictor
2. Describe the sensitivities of CLF to the meteorological predictors:
    - How sensitive are low clouds to changes in meteorological predictors (in individual simple regression models, and in a multiple regression framework)?
    - Are the relationships significant?
    - Are there strong correlations between the different predictors that could influence the sensitivity estimates?
  
For the multiple linear regression, you want to be able to compare the sensitivity estimates of the different predictors (to see which ones are most important). To do this, the predictors need to be on the same scale. This is done in the code cell below by using the standard scaler (it subtracts the mean and divides by a standard deviation, so that all predictors in X have a mean value of 0 and unit variance).

In [10]:
from sklearn.preprocessing import StandardScaler

X = data_df.drop(['clf'],axis=1)
y = data_df.clf

X_standardized = StandardScaler().fit(X).transform(X)

In [3]:
# use this cell for your code, make sure to comment your code to make it understandable

Describe the results here

__Task__:
Use a machine learning model to 
1. Predict CLF as accurately as possible [low validation error (e.g. MSE or RMSE) and high explained variance (R²), tuning of hyperparameters]
2. Plot a scatter plot of observed CLF vs. model predicted CLF for the both training and test data sets to visualize model performance and check for overfitting. Do the same for a multiple linear regression model, is the machine learning model better than the multiple regression?
3. Which predictors are most important for the model to predict CLF? [feature importance]
3. Analyze the two most important meteorological features in more detail: How do they influence the prediction of CLF? [partial dependency, SHAP] 

In [4]:
# use this cell for your code, make sure to comment your code to make it understandable
# some code to help you get started:
X = data_df.drop(['clf'],axis=1)
y = data_df.drop(['sst','eis','w700','rhft','t_adv'],axis=1)

# you can start here by separating training and test data sets


Describe the results here and compare your findings with Klein et al. (2017), specifically Table 1 of that paper.

## Part two: Classification in a remote sensing retrieval setting

__Task__
1. Divide data into test and training data sets
2. Train and optimize (hyperparameter tuning) a machine learning model 
3. Train a logistic regression model
4. Analyze both classifiers with a confusion matrix and perfomance metrics

In [5]:
import xarray as xr

data = xr.open_dataset('data_final_task_classification.nc') # load the data set
X = data[['IR_016','IR_039','IR_087','IR_097','IR_108','IR_120','IR_134']] # define X to be the satellite observations from different channels
X['hour'] = data.time['time.hour'] # use the hour of the observation as an additional predictor
X = X.to_dataframe() # convert to a pandas data frame

y = data.station_fls # define y to be the True/False labels from the meteorological station

# continue your code here


Use this cell to descibe and compare the classification results of the machine learning and logistic regression approaches.

Use this cell to compare your results to the results in Andersen et al. (2018)