# CS5830 - Group15 - Project 6

## Baseflow Dataset Analysis + Linear Regression

#### Dataset

- `Date` – number of days since 01/01/0000
- `Segment id` – an identifier of the segment of river; it can be treated as a categorical variable
- `x/y` – the spatial location of the gaging station at which observations are obtained
- `Evapotranspiration` – the evapotranspiration amount of an area adjacent to the river segment in the given month
- `Precipitation` - the precipitation amount of an area adjacent to the river segment in the given month
- `Irrigation pumping` - the amount of groundwater pumped out for irrigation in an area adjacent to the river segment in the given month
- `Observed` – observed baseflow \[target\]

#### Imports

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#### Data Preparation

In [None]:
df = pd.read_csv("./data/RRCA_baseflow.csv")

df['Date'] = df['Date'] - 693963    # Fix Date Column

df.info()

### Analysis

In [None]:
target = 'Observed'

X = df.drop(columns=[])
y = df[target]

for col in ['Date', 'x', 'y', 'Evapotranspiration', 'Precipitation', 'Irrigation_pumping']:
    data = df[col]
    plt.figure(col)
    ax = sns.regplot(x=data, y=y, scatter_kws={'s':2})
    ax.set_title(f'{col} vs Observed')
    ax.set_ylabel("Observed")
    ax.set_xlabel(col)
    # plt.savefig(f'figures/{col}-plot.pdf')
    plt.show()
    print(col)
    print(f'Pearson Test: {stats.pearsonr(data, y)}')
    print(f'Average: {data.mean()}')
    print(f'Standard Deviation: {stats.tstd(data)}')

# TODO: Look At Data Over Time
# .
# .
# .

### Linear Regression

In [None]:
train, test = train_test_split(df, test_size=0.2, random_state=123)

X_train = train.drop(columns=[target], axis=1)
y_train = train[target]

X_test = test.drop(columns=[target], axis=1)
y_test = test[target]

model = LinearRegression()
model.fit(X_train, y_train)

print("Original:", model.score(X_test, y_test))

numerical_data = ['x', 'y', 'Evapotranspiration', 'Precipitation', 'Irrigation_pumping']

ct = make_column_transformer(
    (StandardScaler(), numerical_data),
)

pipe = make_pipeline(
    ct,
    LinearRegression()
)
pipe.fit(X_train, y_train)

print("Scaled:", pipe.score(X_test, y_test))