# Table of Contents
* [Data overview](#data)
* [Init](#init)
* [Import and Overview (subset)](#import)
* [Targets and Features](#features_target)
* [Correlations](#corr)
* [Other Explorations](#other)
* [Import individual columns (full data)](#import_col)

<a id='data'></a>
# Data overview

## Inputs:
### Arrays (dimension 60):
* state_t: air temperature
* state_q0001: specific humidity
* state_q0002: cloud liquid mixing ratio
* state_q0003: cloud ice mixing ratio
* state_u: zonal wind speed
* state_v: meridional wind speed
* pbuf_ozone: ozone volume mixing ratio
* pbuf_CH4: methane volume mixing ratio
* pbuf_N2O: nitrous oxide volume mixing ratio

### Scalars:
* state_ps: surface pressure
* pbuf_SOLIN: solar insolation
* pbuf_LHFLX: surface latent heat flux
* pbuf_SHFLX: surface sensible heat flux
* pbuf_TAUX: zonal surface stress
* pbuf_TAUY: meridional surface stress
* pbuf_COSZRS: cosine of solar zenith angle
* cam_in_ALDIF: albedo for diffuse longwave radiation
* cam_in_ALDIR: albedo for direct longwave radiation
* cam_in_ASDIF: albedo for diffuse shortwave radiation
* cam_in_ASDIR: albedo for direct shortwave radiation
* cam_in_LWUP: upward longwave flux
* cam_in_ICEFRAC: sea-ice areal fraction
* cam_in_LANDFRAC: land areal fraction
* cam_in_OCNFRAC: ocean areal fraction
* cam_in_SNOWHLAND: snow depth over land

## Targets:
### Arrays (dimension 60):
* ptend_t: heating tendency
* ptend_q0001: moistening tendency
* ptend_q0002: cloud liquid mixing ratio change over time	
* ptend_q0003: cloud ice mixing ratio change over time
* ptend_u: zonal wind acceleration
* ptend_v: meridional wind acceleration

### Scalars:
* cam_out_NETSW: net shortwave flux at surface
* cam_out_FLWDS: downward longwave flux at surface
* cam_out_PRECSC: snow rate (liquid water equivalent)
* cam_out_PRECC: rain rate
* cam_out_SOLS: downward visible direct solar flux to surface
* cam_out_SOLL: downward near-infrared direct solar flux to surface
* cam_out_SOLSD: downward diffuse solar flux to surface
* cam_out_SOLLD: downward diffuse near-infrared solar flux to surface


<a id='init'></a>
# Init

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# faster alternative to pandas
import polars as pl

In [None]:
# configs
pd.set_option('display.max_columns', None) # we want to display all columns in this notebook

# aesthetics
default_color_1 = 'darkblue'
default_color_2 = 'darkgreen'
default_color_3 = 'darkred'

# random seed
my_random_seed = 111

<a id='import'></a>
# Import and Overview (subset)

In [None]:
# file overview
!ls -l '../input/leap-atmospheric-physics-ai-climsim/'

#### Datasets are HUGE, let's start with a small subset:

In [None]:
# import SUBSET of data
n_rows = 50000
folder = 'leap-atmospheric-physics-ai-climsim'
t1 = time.time()
df_train = pl.read_csv('../input/'+folder+'/train.csv', n_rows=n_rows).to_pandas()
df_test = pl.read_csv('../input/'+folder+'/test.csv', n_rows=n_rows).to_pandas()
df_sub = pl.read_csv('../input/'+folder+'/sample_submission.csv', n_rows=n_rows).to_pandas()
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# preview - train
df_train.head(10)

In [None]:
# train set overview
df_train.info(verbose=True, show_counts=True)

In [None]:
# preview - test
df_test.head(10)

In [None]:
# test set overview
df_test.info(verbose=True, show_counts=True)

<a id='features_target'></a>
# Targets and Features

In [None]:
# targets (extract from submission file)
targets = [x for x in df_sub.columns.tolist() if x not in ['sample_id']]

# numerical features
features_num = [x for x in df_train.columns.tolist() if x not in ['sample_id']+targets]

# categorical features
features_cat = []

# all features combined
features = features_num + features_cat

In [None]:
# ouput of dimensions
print("Number of numerical features:", len(features_num))
print("Number of categorical features:", len(features_cat))
print("Number of targets:", len(targets))
print("Size of subset:", n_rows)

### Targets

In [None]:
# basic stats - targets
df_train[targets].describe()

In [None]:
# plot target distributions in compact matrix form
fig, axs = plt.subplots(92, 4, figsize=(16,350))
i = 0
for t in targets:
    current_ax = axs.flat[i]
    current_ax.hist(df_train[t], bins=100, color=default_color_3)
    current_ax.set_title('Target ' + str(t))
    current_ax.grid()
    i = i + 1

### Features

In [None]:
# basic stats - train
df_train[features_num].describe()

In [None]:
# basic stats - test
df_test[features_num].describe()

In [None]:
# plot histograms for numerical features (train and test)
for f in features_num:
    plt.figure(figsize=(12,2))
    ax1 = plt.subplot(1,2,1)
    df_train[f].plot(kind='hist', bins=100, color=default_color_1)
    plt.title(f + ' - Train')
    plt.grid()
    ax2 = plt.subplot(1,2,2, sharex=ax1)
    df_test[f].plot(kind='hist', bins=100, color=default_color_2)
    plt.title(f + ' - Test')
    plt.grid()
    plt.show()

In [None]:
# compact boxplot of all features - train only
n_plot_rows = 10
n_plot_cols = 60
n = len(features_num)
for i in range(n_plot_rows):
    a = n_plot_cols*i+1
    b = min(n_plot_cols*i+n_plot_cols, n)
    print('Columns', a, 'to', b)
    plt.figure(figsize=(14,3))
    df_train.iloc[:,a:(b+1)].plot(kind='box', figsize=(15,5))
    plt.xticks(rotation=90)
    plt.grid()
    plt.show()

In [None]:
# boxplots (train and test)
for f in features_num:
    plt.figure(figsize=(14,0.5))
    ax1 = plt.subplot(1,2,1)
    df_temp = df_train[f].dropna() # boxplot does not like missings...
    plt.boxplot(df_temp, vert=False)
    plt.title(f + ' - Train')
    plt.grid()
    ax2 = plt.subplot(1,2,2, sharex=ax1)
    df_temp = df_test[f].dropna()
    plt.boxplot(df_temp, vert=False)
    plt.title(f + ' - Test')
    plt.grid()
    plt.show()

<a id='corr'></a>
# Correlations

### Targets

In [None]:
# calc and plot correlation matrix
cor_p_target = df_train[targets].corr(method='pearson')
plt.figure(figsize=(14,12))
sns.heatmap(cor_p_target, annot=False, cmap='RdYlGn',
            vmin=-1, vmax=+1)
plt.title('Targets - Pearson Correlation')
plt.show()

### Features (train)

In [None]:
# calc and plot correlation matrix
cor_p_train = df_train[features_num].corr(method='pearson')
plt.figure(figsize=(14,12))
sns.heatmap(cor_p_train, annot=False, cmap='RdYlGn',
            vmin=-1, vmax=+1)
plt.title('Features - Pearson Correlation (train)')
plt.show()

### Features (test)

In [None]:
# calc and plot correlation matrix
cor_p_test = df_test[features_num].corr(method='pearson')
plt.figure(figsize=(14,12))
sns.heatmap(cor_p_test, annot=False, cmap='RdYlGn',
            vmin=-1, vmax=+1)
plt.title('Features - Pearson Correlation (test)')
plt.show()

In [None]:
# export results
df_train.to_csv('df_train_subset.csv')
cor_p_target.to_csv('cor_p_target.csv')
cor_p_train.to_csv('cor_p_train.csv')
cor_p_test.to_csv('cor_p_test.csv')

<a id='other'></a>
# Other Explorations

### Target vs row index

In [None]:
# plot target values
for t in targets:
    plt.figure(figsize=(14,2))
    plt.scatter(df_train.index, df_train[t], color=default_color_3,
                alpha=0.25, s=1)
    plt.title(t)
    plt.grid()
    plt.show()

### Features vs row index

In [None]:
# plot target values
for f in features_num:
    plt.figure(figsize=(14,2))
    plt.scatter(df_train.index, df_train[f], color=default_color_1,
                alpha=0.25, s=1)
    plt.title(f)
    plt.grid()
    plt.show()

<a id='import_col'></a>
# Import individual columns (full data)

#### In order to approach the full dataset we could try to import just a subset of columns. This is done in the following section.

In [None]:
# define columns (has to be a list)
n_max = 20 # columns with index 0..n_max
cols_select = ['state_t_' + str(t) for t in range(0,n_max+1)]
print(cols_select)

In [None]:
# load only selected column
t1 = time.time()
df_col = pl.read_csv('../input/'+folder+'/train.csv', columns=cols_select).to_pandas()
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))
print('Number of rows: ', df_col.shape[0])
print('Number of cols: ', df_col.shape[1])

In [None]:
# basic stats
df_col.describe(percentiles=[0.01,0.1,0.25,0.5,0.75,0.9,0.99])

In [None]:
# plot distributions
for f in cols_select:
    plt.figure(figsize=(10,3))
    plt.hist(df_col[f], bins=1000, color=default_color_1)
    plt.title(f + ' - full data')
    plt.grid()
    plt.show()

#### 💡 state_t_0 shows some unusually high values, let's have a closer look:

In [None]:
# boxplot
plt.figure(figsize=(10,0.5))
plt.boxplot(df_col.state_t_0, vert=False)
plt.title('state_t_0')
plt.grid()
plt.show()

In [None]:
# let's check the most extreme outliers
df_col[df_col.state_t_0>400]

### Correlations:

In [None]:
# calc and plot correlation matrix
cor_p_train_few_cols = df_train[cols_select].corr(method='pearson')
plt.figure(figsize=(14,10))
sns.heatmap(cor_p_train_few_cols, annot=True, cmap='RdYlGn',
            fmt='.2f', linecolor='black', linewidth=.5,
            vmin=-1, vmax=+1)
plt.title('Features - Pearson Correlation (train/selected columns)')
plt.show()