# Things to do: 
- Lag Variable function updates
- Plotting total demand by 30min and group by week day
- Outlier handling: How we want to approach outliers found (Trimming, Capping, Discretization) 
- Plot demand vs season
- CDD & HDD and what does it do?

In [147]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

In [148]:
warnings.filterwarnings('ignore')

In [149]:
# pd.options.display.max_columns = 50
# pd.options.display.max_rows = 50
# pd.options.display.width = 120
# pd.options.display.float_format = '{:.2f}'.format

# Loading Data
Loading data from the processed and combined csv file into the dataframe to commence preprocessing and cleansing.

In [150]:
source_data = r'./../data/NSW/processed_data.csv'  
# todo: this comes from my earlier stuff, need to add this work to that. no point in having 2 files

In [151]:
source_df = pd.read_csv(source_data).set_index('Unnamed: 0')
source_df

Checking if there are any NA values that we need to take into considering or drop the columns completely since data has been smoothed.

In [152]:
source_df.isna().sum()

In [153]:
source_df.columns

In [154]:
source_df.dtypes

In [155]:
updated_df = source_df

In [156]:
updated_df.index

In [157]:
updated_df.index=pd.to_datetime(updated_df.index)

In [218]:
updated_df.isna().sum()

Dropped Forecast_Daily and TotalDemand_daily - contained 33 N/A's.

In [158]:
updated_df.drop(['FORECASTDEMAND_daily', 'TOTALDEMAND_daily'], axis=1, inplace=True)

In [159]:
updated_df.isna().sum()

In [160]:
updated_df

In [161]:
updated_df.head()

# Feature Engineering

## DateTime Features
The following section creates date time features.

In [162]:
demand = updated_df.copy()
demand = demand[['totaldemand']]
demand.loc[:, 'dow'] = demand.index.dayofweek
demand.loc[:, 'doy'] = demand.index.dayofyear
demand.loc[:, 'year'] = demand.index.year
demand.loc[:, 'month'] = demand.index.month
demand.loc[:, 'quarter'] = demand.index.quarter
demand.loc[:, 'hour'] = demand.index.hour


In [163]:
demand.head()

In [164]:
demand.iloc[45: 60]

Merge to the updated_df

In [165]:
demand.isna().sum()

In [166]:
demand.index

In [167]:
final_df = pd.merge(updated_df, demand, left_index=True, right_index=True)

In [168]:
final_df

In [169]:
final_df.isna().sum()

In [170]:
final_df.index

## Adding Season Data
Season data being added to the dataframe before the final_df is created and exported to csv.

In [171]:
final_df['season'] = final_df['month'].apply(lambda month: 1 if month in [12, 1, 2] else
                                  (2 if month in [3, 4, 5] else
                                  (3 if month in [6, 7, 8] else
                                  (4 if month in [9, 10, 11] else None))))

In [172]:
final_df.index

In [173]:
final_df.head()

# Exporting Dataframe to CSV
The final dataframe is being exported to csv so that it can be used for additional analysis and modelling.

In [174]:
# final_df.to_csv(os.path.join('./../data/NSW', 'final_df.csv'))
# todo: this is what i have been working on but i need to add the later work

## Degree Days

In [175]:
def Degree_Days2(df, HDD_ct=17, CDD_ct=19.5):
    Tbar = df.resample('24H', offset='21H').mean() # Mean from 9pm (day i-1) - 9pm (day i)
    DD = pd.DataFrame(index=df.index, columns=['HDD', 'CDD'])
    for i in range(0,Tbar.shape[0]):
        DD['HDD'].iloc[48*(i):48*(i+1)] = max(0, HDD_ct-Tbar.iloc[i])
        DD['CDD'].iloc[48*(i):48*(i+1)] = max(0, Tbar.iloc[i]-CDD_ct)
    return Tbar, DD

In [176]:
Degree_Days2(final_df['TEMPERATURE'])

In [177]:
Tbar, DD = Degree_Days2(final_df['TEMPERATURE'])
final_df['HDD'] = DD['HDD']
final_df['CDD'] = DD['CDD']
# fix_me: this is a daily metric for now. repeat same value for each 30 min sample on that day
# fix_me: can we combine these so we don't have 0 values?

In [178]:
Tbar.isna().sum()

In [179]:
DD.isna().sum()

In [180]:
DD['HDD']

In [181]:
Tbar

In [182]:
plt.plot(DD.iloc[1: 10000])
plt.title("Degree Days")
plt.xlabel("Time")
plt.ylabel("Degrees °C")
plt.legend()
plt.show()
# todo: fix warnings
# todo: these are the plotted CDD and HDD vars

# Outliers
Outlier detection is a method used to find unusual or abnormal data points in a data set. 
Methods of treating outliers:
- **Trimming**: removing the data from the dataset
- **Capping**: For instance, if we decide on a specific value, any data point above or below that value is considered an outlier. 
- **Discretization**: create groups and categorise the outliers into specific group making them follow the same behavior as the other points in that group.

## Temperature
The following section looks at the outlier identification within the Temperature column of our dataframe.

In [183]:
temp_df = final_df[['TEMPERATURE', 'month', 'season']]

### ScatterPlot of Temperature

In [184]:
plt.figure(figsize=(20, 10))
sns.scatterplot(data=temp_df, x='Unnamed: 0', y='TEMPERATURE', hue='season')
plt.title('Temperature ScatterPlot')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.show()

### Boxplot of Temperature
The boxplot is a method which allows for the easy visualisation of outliers within a given dataset. It can be seen that there are a number of data points within temperature which are at the very extreme and could be considered as outliers.

In [185]:
plt.figure(figsize=(20, 10))
sns.boxplot(data=temp_df, x='season', y='TEMPERATURE', hue='season')
plt.show()

## Temperature Outliers
The following sections looks further into the outliers for temperature which were identified in the temperature column of our datasets.

In [186]:
temp_highest_allowed = round(temp_df['TEMPERATURE'].mean() + 3 * temp_df['TEMPERATURE'].std(), 2)
temp_lowest_allowed = round(temp_df['TEMPERATURE'].mean() - 3 * temp_df['TEMPERATURE'].std(), 2)
print('Highest Allowed:', temp_highest_allowed)
print('Lowest Allowed:', temp_lowest_allowed)

In [187]:
temp_outliers = temp_df[(temp_df['TEMPERATURE'] > temp_highest_allowed) | (temp_df['TEMPERATURE'] < temp_lowest_allowed)]
print('Total Rows:', len(temp_outliers))

how do we treat these?
- cap

In [188]:
# todo: create a separate temp column, check with if statement, cap values if outside threshold
# final_df.loc['2013-02-01']  # problem
# final_df.loc['2016-07-15']  # there is a problem from 16 - 18
# todo: somehow we need to check that that each day has 48 samples

In [189]:
plt.figure(figsize=(20, 10))
sns.scatterplot(data=temp_outliers, x='Unnamed: 0', y='TEMPERATURE', hue='season')
plt.title('Temperature Outliers ScatterPlot')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.show()

In [190]:
temp_month_df = temp_df[[]]
# fix_me: what was the plan here? 

### High Temperature Values
Outliers were identified for temperature values and high temperature outliers are more frequent from Nov to March. Indicating that during this period predictions may be less accurate.

In [191]:
high_outliers = temp_outliers.loc[temp_outliers['TEMPERATURE'] > temp_highest_allowed]
t_high_df = high_outliers['month'].value_counts()
t_high_df = t_high_df.reset_index()
t_high_df.columns = ['Month', 'Count']
t_high_df_sorted = t_high_df.sort_values(by='Month', ascending=True)
print(t_high_df_sorted)

In [192]:
# histogram
plt.figure(figsize=(15, 10))
sns.barplot(data=t_high_df_sorted, x='Month', y='Count')
plt.title('Distribution of High Temperature Outliers')
plt.show()

### Low Temperature Values
Lower tempature variations are far less those outliers observed for higher temperature variations. Lower tempature values can be found in June and July (Winter periods).

In [193]:
low_outliers = temp_outliers.loc[temp_outliers['TEMPERATURE'] < temp_lowest_allowed]
t_low_df = low_outliers['month'].value_counts()
t_low_df = t_low_df.reset_index()
t_low_df.columns = ['Month','Count']
t_low_df_sorted = t_low_df.sort_values(by='Month', ascending=True)
print(t_low_df_sorted)

In [194]:
# Histogram
plt.figure(figsize=(15, 10))
sns.barplot(data=t_low_df, x='Month', y='Count')
plt.title('Distribution of Low Temperature Outliers')
plt.show()

Now that we know what the upper and lower caps are, we can apply capping to the outliers. This effectively means that we replace any values which exceed the upper and lower limits are replaced with the upper and lower limit respectively.

In [195]:
# final_df['TEMPERATURE'] = np.where(final_df['TEMPERATURE']>temp_highest_allowed,temp_highest_allowed, np.where(final_df['TEMPERATURE']<temp_lowest_allowed,temp_lowest_allowed,final_df['TEMPERATURE']))

In [196]:
# final_df_temp_highest = round(final_df['TEMPERATURE'].mean() + 3*final_df['TEMPERATURE'].std(),2)
# final_df_temp_lowest = round(final_df['TEMPERATURE'].mean() - 3*final_df['TEMPERATURE'].std(),2)
# print(final_df_temp_highest)
# print(final_df_temp_lowest)

## Price
The following sections looks at the outliers which might exist in the price data points which we are going to be using for future modelling.

In [197]:
price_outlier_df = final_df[['rrp', 'month', 'season']]
# print(price_outlier_df)

### Scatterplot of Price

In [198]:
plt.figure(figsize=(20, 10))
sns.scatterplot(data=price_outlier_df, x='Unnamed: 0', y='rrp', hue='season')
plt.title('Price ScatterPlot')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

### Price Outlier Analysis
Copying the values which are identified as outliers to a dataframe to analyse a little further.

In [199]:
price_highest_allowed = round(price_outlier_df['rrp'].mean() + 3 * price_outlier_df['rrp'].std(), 2)
price_lowest_allowed = round(price_outlier_df['rrp'].mean() - 3 * price_outlier_df['rrp'].std(), 2)

In [200]:
price_outliers = price_outlier_df[(price_outlier_df['rrp']>price_highest_allowed) | (price_outlier_df['rrp']<price_lowest_allowed)]
print(len(price_outliers))

In [201]:
price_outliers

In [202]:
plt.figure(figsize=(20, 10))
sns.scatterplot(data=price_outliers,x='month', y='rrp', hue='season')
plt.title('Scatterplot of Price by Month')
plt.legend()
plt.show()

#### Price Outlier by Month

In [203]:
# low_outliers = temp_outliers.loc[temp_outliers['TEMPERATURE']<temp_lowest_allowed]
price_out_df = price_outliers['month'].value_counts()
price_out_df = price_out_df.reset_index()
price_out_df.columns = ['Month', 'Count']
price_out_df_sorted = price_out_df.sort_values(by='Month', ascending=True)
# print(price_out_df_sorted)

In [204]:
# Histogram
plt.figure(figsize=(15, 10))
sns.barplot(data=price_out_df_sorted, x='Month', y='Count')
plt.title('Distribution of Price Outliers')
plt.show()

## Total Demand

In [205]:
total_demand_highest_allowed = round(final_df['TOTALDEMAND'].mean() + 3 * final_df['TOTALDEMAND'].std(), 2)
total_demand_lowest_allowed = round(final_df['TOTALDEMAND'].mean() - 3 * final_df['TOTALDEMAND'].std(), 2)
print(total_demand_highest_allowed)
print(total_demand_lowest_allowed)

In [206]:
total_demand_outliers = final_df[(final_df['TOTALDEMAND'] > total_demand_highest_allowed) | (final_df['TOTALDEMAND'] < total_demand_lowest_allowed)]
print(len(total_demand_outliers))

In [207]:
plt.figure(figsize=(20, 10))
plt.scatter(
    total_demand_outliers.index, 
    total_demand_outliers['TOTALDEMAND'], 
    c=total_demand_outliers['month']
)
plt.legend()
plt.show()

In [208]:
print(total_demand_outliers['month'].value_counts())
#ToDO: Sort values by month
# reluctant to mess with TOTALDEMAND unless there is a really solid justification

Total Demand outliers can be seen during the months from Nov to Feb. 

## Outlier Handling
Outlier handling shall be done using isolation forest.

In [209]:
from sklearn.ensemble import IsolationForest

In [210]:
handle_outliers = final_df[['TEMPERATURE', 'rrp', 'TOTALDEMAND', 'month', 'season']]

In [220]:
final_df

In [211]:
random_state = np.random.RandomState(42)

model = IsolationForest(
    n_estimators=100, 
    max_samples='auto', 
    contamination=float(0.003)
)

model.fit(handle_outliers[['TEMPERATURE']])
print(model.get_params())
handle_outliers['Iso_forest_scores'] = model.decision_function(handle_outliers[['TEMPERATURE']])
handle_outliers['anomaly_score'] = model.predict(handle_outliers[['TEMPERATURE']])
handle_outliers[handle_outliers['anomaly_score'] == -1].head()
anomaly_df = handle_outliers[handle_outliers['anomaly_score'] == -1]
no_anomaly_df = handle_outliers[handle_outliers['anomaly_score'] == 1]

In [212]:
anomaly_df.columns

In [213]:
anomaly_df

In [214]:
print('Total Anomalies:', len(anomaly_df))
print('Total non-Anomaly:', len(no_anomaly_df))

## Lag Variables (Data Engineering)


In [215]:
# def lag_variable(df, n, var_name):
#     """
#     Creates a lag variable, linear interpolation for NA values created in the shift
#     Parameters:
#         df (dataframe): pandas df
#         n (int): lag size
#         var_name (str): name of variable
        
#     Returns:
#         pd.series with index matching df
#     """
#     #Shifting the variable
#     varShifted = df[var_name].shift(n)
    
#     #Dealing With NA - median of all the other matching timestamps
#     varShifted = varShifted.interpolate(method='linear', limit_direction='both', axis=0)
#     print(varShifted)
    
#     return varShifted

In [216]:
lag_test = final_df['TEMPERATURE'].shift(2)
lag_test
# todo: new var?

In [217]:
varShifted = final_df['TEMPERATURE'].shift(2)
varShifted = varShifted.interpolate(method='linear', limit_direction='both', axis=0)
varShifted
# todo: new var?