# Correlation Analysis with Python

This example uses **Pandas** and **SciKit Learn** for data analysis and **Vega Altair** for plotting. 
Install the libraries with the following commands. 

```shell
pip install pandas
pip install "altair[all]"
pip install -U scikit-learn
```

In [30]:
import pandas as pd
import altair as alt
alt.data_transformers.enable("vegafusion")
from sklearn.linear_model import LinearRegression

In [10]:
# Load the dataset
file_path = './data/CAMELS_CH_obs_based_2152.csv'  # Adjust to your actual file path
data = pd.read_csv(file_path, sep=';')

# Preview the first few rows to understand the structure
data.head()

Unnamed: 0,date,discharge_vol(m3/s),discharge_spec(mm/d),waterlevel(m),precipitation(mm/d),temperature_min(°C),temperature_mean(°C),temperature_max(°C),rel_sun_dur(%),swe(mm)
0,1981-01-01,35.556,1.363,430.839,5.28,-8.99,-3.34,1.86,0.87,
1,1981-01-02,35.459,1.359,430.838,8.95,-7.52,-5.08,-0.91,0.06,
2,1981-01-03,35.619,1.365,430.84,29.21,-4.55,-0.59,2.91,5.41,
3,1981-01-04,38.814,1.488,430.888,35.71,-7.88,-2.91,0.25,0.04,
4,1981-01-05,41.653,1.597,430.93,7.14,-9.62,-8.12,-6.91,0.56,


In [11]:
# Basic summary statistics
data.describe()

Unnamed: 0,discharge_vol(m3/s),discharge_spec(mm/d),waterlevel(m),precipitation(mm/d),temperature_min(°C),temperature_mean(°C),temperature_max(°C),rel_sun_dur(%),swe(mm)
count,14610.0,14610.0,14610.0,14610.0,14610.0,14610.0,14610.0,14610.0,8157.0
mean,109.042704,4.179872,431.51922,4.615357,0.860379,4.514891,8.321746,41.838642,122.031984
std,67.337396,2.581206,0.578028,8.268318,6.524576,6.924003,7.680192,36.104614,124.521328
min,22.392,0.858,430.557,0.0,-26.75,-22.71,-17.72,0.03,0.0
25%,54.379,2.08425,431.037,0.01,-3.99,-0.72,2.4725,4.3925,9.85
50%,89.192,3.419,431.405,0.67,1.04,4.55,8.24,36.45,81.51
75%,152.37825,5.84125,431.953,5.93,6.09,9.95,14.21,76.61,209.21
max,470.26,18.026,433.792,93.34,16.69,22.1,28.94,99.91,597.32


## Creating a Correlation Matrix

This gives you a correlation coefficient (like Pearson’s r) between every pair of numeric columns. But it does not tell you if that correlation is statistically significant. The statistical significance asks the question if this correlation could have happened just by chance in random data.

In [12]:
# Calculate correlation matrix (default = Pearson)
correlation_matrix = data.corr(numeric_only=True)
correlation_matrix

Unnamed: 0,discharge_vol(m3/s),discharge_spec(mm/d),waterlevel(m),precipitation(mm/d),temperature_min(°C),temperature_mean(°C),temperature_max(°C),rel_sun_dur(%),swe(mm)
discharge_vol(m3/s),1.0,1.0,0.990703,0.117266,0.569567,0.561815,0.544722,-0.059546,-0.090138
discharge_spec(mm/d),1.0,1.0,0.990703,0.117267,0.569568,0.561816,0.544723,-0.059546,-0.090137
waterlevel(m),0.990703,0.990703,1.0,0.122967,0.593996,0.586448,0.56924,-0.061233,-0.102539
precipitation(mm/d),0.117266,0.117267,0.122967,1.0,0.053461,0.016646,-0.004874,-0.43292,-0.036128
temperature_min(°C),0.569567,0.569568,0.593996,0.053461,1.0,0.984455,0.957767,0.110849,-0.463726
temperature_mean(°C),0.561815,0.561816,0.586448,0.016646,0.984455,1.0,0.991201,0.217544,-0.432833
temperature_max(°C),0.544722,0.544723,0.56924,-0.004874,0.957767,0.991201,1.0,0.292959,-0.406579
rel_sun_dur(%),-0.059546,-0.059546,-0.061233,-0.43292,0.110849,0.217544,0.292959,1.0,-0.034477
swe(mm),-0.090138,-0.090137,-0.102539,-0.036128,-0.463726,-0.432833,-0.406579,-0.034477,1.0


In [20]:
# Display correlation for selected pair (precipitation vs discharge)
correlation = correlation_matrix.loc['precipitation(mm/d)', 'discharge_vol(m3/s)']
print(f"Correlation between precipitation and discharge: {correlation:.2f}")

Correlation between precipitation and discharge: 0.12


### Statistical Significance

The significance can be tested with a p-value and with another library. We will not cover this during the introduction. 

```python
from scipy.stats import pearsonr

corr, p_value = pearsonr(df['precipitation(mm/d)'], df['discharge_vol(m3/s)'])

# The correlation coefficient tells you how strong and in what direction the relationship is.
# The p-value tells you how likely you’d see this correlation in random data.
# If p < 0.05, the correlation is usually considered statistically significant (less than 5% chance this happened randomly).
print(f'Pearson correlation: {corr:.2f}')
print(f'p-value: {p_value:.5f}')
```

## Linear Regression

In [36]:
# Example: Correlation between precipitation and discharge
precipitation = data['precipitation(mm/d)']
discharge = data['discharge_vol(m3/s)']

X = precipitation.values.reshape(-1, 1)
y = discharge.values

In [37]:
model = LinearRegression()
model.fit(X, y)

In [38]:
regression_data = pd.DataFrame({
    'Precipitation': precipitation,
    'Discharge': discharge,
    'Predicted Discharge': model.predict(X)
})

In [39]:
# Scatter plot with regression line using Altair
scatter_plot = alt.Chart(regression_data).mark_point().encode(
    x='Precipitation',
    y='Discharge',
    tooltip=['Precipitation', 'Discharge']
)

regression_line = alt.Chart(regression_data).mark_line(color='red').encode(
    x='Precipitation',
    y='Predicted Discharge'
)

(scatter_plot + regression_line).properties(
    width=600,
    height=400,
    title=f'Linear Regression: Precipitation vs Discharge (r={correlation:.2f})'
).show()

# Summary of regression
print(f"Regression Coefficient (Slope): {model.coef_[0]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")

# Encourage students to experiment with other pairs of variables
print("\nTry analyzing other pairs like temperature vs discharge, or SWE vs discharge!")

Regression Coefficient (Slope): 0.955
Intercept: 104.635

Try analyzing other pairs like temperature vs discharge, or SWE vs discharge!
