## MIT - Safi Project Documentation
This documentation serves the purpose to share current running codes and results, which can be turned into a beta version for testing phase. 

Individual functions are coded in separate python files: 
1. 'data_preparation.py' is the file containing functions to process measurements dataframe (data obtained by Plum air device) as well as forecast dataframe (official forecast from the weather agency). Specifically, it includes functions to: 
    - read raw CSV files for measurement(two per year from 2015S1 to 2020S1) and official government forecast data
    - clean data, fill missing values 
    - smooth wind angle into continuous data using cos and sin functions
    - extract some key features such as (time of the day, seasonality feature)  
    - concatenate everything  into one dataframe 
    
    
2. 'data_process.py' is the file reading the previous cleaned dataframes and outputs data ready to be used for prediction algorithms:
    - at each present time, enrich past $n$ time steps of measurement data
    - at each prediction time, enrich forecast data for the forecasted time
    
    
3. 'main_XGB.py' is the main function to be called to perform XGB regression and classification. You can amend the following arguments:
    - steps_in: number of past data for prediction, default = 48 
    - t_list: prediction time steps, default = [1,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]
    
4. 'main_OCT.py' is the main function to be called to perform Optimal Trees for regression and classification. You can amend in particular the following arguments:
    - steps-in: number of past data for prediction, default = 48 (hours of data)
    - steps-out: what specific hour do you want to build a model for.
    
    
5. 'utils_scenario.py' contains miscellaneous functions
    - in particular from a wind speed and wind direction it outputs the scenario, and dangerous or not. This can be changed easily to reflect any change of policy defining scenarios.

Note: this function can take a long time to run in Jupyter notebook. We have written a python file called 'main_XGB.py' which could be called and ran on a cluster to speed up running time.  

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')

#import local functions
from utils import utils_scenario as utils, data_preparation as prep, data_process as proc
import main_XGB
import pickle

In [10]:
#first let's define some functions 
def run_xgb(steps_in, steps_out):
    """
    This function construct three XGB models to predict speed, cos_wind_direction and sin_wind_direction
    inputs:
        steps_in: number of past measurement data included
        steps_out: prediction time 
    output: 
        predict: predicted dataframe for [speed, cos_wind_direction and sin_wind_direction, angle(reconstructed)]
        true: true dataframe [speed, cos_wind_direction and sin_wind_direction, angle]
        base: official forecast dataframe [speed, cos_wind_direction and sin_wind_direction, angle]
    """
    
    #Parameter list:
    param_list =['speed','cos_wind_dir','sin_wind_dir']

    predict = pd.DataFrame(columns={'speed','cos_wind_dir','sin_wind_dir'})
    true = pd.DataFrame(columns={'speed','cos_wind_dir','sin_wind_dir'})
    baseline = pd.DataFrame(columns={'speed','cos_wind_dir','sin_wind_dir'})

    for param in param_list:
        x_df, y_df, x, y = proc.prepare_x_y(measurement, forecast, steps_in, steps_out, param)
        
        
#         X_train, X_test, y_train, y_test= train_test_split(x, y, test_size=0.2, shuffle = False)
#         xg = XGBRegressor(max_depth = 5)
#         xg.fit(X_train, y_train)
#         y_baseline = x_df
#         y_hat = xg.predict(X_test)

#         predict[param] = pd.Series(y_hat)
#         true[param] = pd.Series(y_test.flatten())
#         baseline[param] = x_df[param+'_forecast'][-len(y_hat):]
    
    with open('model_cols_past_48.pkl', 'wb') as f:
        pickle.dump(x_df.columns, f)
    #reset index
    baseline.reset_index(inplace=True)
    return predict, true, baseline


def scenario_accuracy(predict, true, baseline):
    """
    This function calculates accuracy of scenario prediction based on predicted speed, cos and sin of wind direction
    """
    pred = utils.get_all_scenarios(predict['speed'], predict['cos_wind_dir'],predict['sin_wind_dir'], b_scenarios=True)
    true = utils.get_all_scenarios(true['speed'], true['cos_wind_dir'],true['sin_wind_dir'], b_scenarios=True)
    base = utils.get_all_scenarios(baseline['speed'], baseline['cos_wind_dir'],baseline['sin_wind_dir'], b_scenarios=True)

    #calculate prediction accuracies
    pred_score = metrics.accuracy_score(pred, true).round(3)
    base_score = metrics.accuracy_score(base, true).round(3)

    return  pred_score, base_score

def binary_accuracy(predict, true, baseline):
    """
    This function calculates accuracy of binary (dangerous vs. not dangerous) prediction based on predicted speed, cos and sin of wind direction
    """
    pred = utils.get_all_dangerous_scenarios(predict['speed'], predict['cos_wind_dir'],predict['sin_wind_dir'])
    true = utils.get_all_dangerous_scenarios(true['speed'], true['cos_wind_dir'],true['sin_wind_dir'])
    base = utils.get_all_dangerous_scenarios(baseline['speed'], baseline['cos_wind_dir'],baseline['sin_wind_dir'])

    #calculate prediction accuracies
    pred_score = metrics.accuracy_score(pred, true).round(3)
    base_score = metrics.accuracy_score(base, true).round(3)
    #calculate auc
    pred_auc = metrics.roc_auc_score(pred, true).round(3)
    base_auc = metrics.roc_auc_score(base, true).round(3)
    return  pred_score, base_score, pred_auc, base_auc

def get_mae(predict, true, baseline):
    """
    This function calculates the mean squared error of wind speed and angle prediction 
    """
    speed = metrics.mean_absolute_error(predict['speed'], true['speed'])
    speed_base=metrics.mean_absolute_error(baseline['speed'], true['speed'])
    angle = metrics.mean_absolute_error(predict['angle'], true['angle'])
    angle_base=metrics.mean_absolute_error(baseline['angle'], true['angle'])
    return speed, speed_base, angle, angle_base 
    

In [11]:
#arguments: 
## t_list: list of prediction steps,
## steps_in: how many past n steps are included for each prediction
t_list=[1]
steps_in=48

#get data
measurement=prep.prepare_measurement()
forecast = prep.prepare_forecast()
#keep useful columns
measurement= measurement[['speed', 'cos_wind_dir', 'sin_wind_dir', 'temp', 'radiation', 'precip','season', 'am']]

#set up empty dataframes to record results 
accuracy = pd.DataFrame(columns={})
pred_speed=pd.DataFrame(columns={})
pred_angle=pd.DataFrame(columns={})

#construct XGB model for each prediction time step 
for t in t_list:
    print(t)
    #run model
    predict, true, base = run_xgb(steps_in, steps_out=t)
    
    #calculate angles from sine and cosine  
    predict['angle'] = predict.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    true['angle'] = true.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    base['angle'] = base.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    
    #calculate mae for regression 
    mae_speed, mae_speed_base, mae_angle, mae_angle_base = get_mae(predict, true, base) 
    #calculate accuracy & auc for scenario prediction 
    pred_scenario, base_scenario  = scenario_accuracy(predict, true, base)
    pred_bin_accu, base_bin_accu, pred_bin_auc, base_bin_auc= binary_accuracy(predict, true, base)
    
    
    #record accuracy
    accuracy = accuracy.append({'past_n_steps': str(steps_in),
                                      'pred_n_steps': str(t),
                                      'xgb_scenario_accu': pred_scenario,
                                      'base_scenario_accu': base_scenario,
                                      'xbg_binary_accu':pred_bin_accu,
                                      'base_binary_accu':base_bin_accu,
                                        'xbg_binary_auc':pred_bin_auc,
                                      'base_binary_auc':base_bin_auc,
                                        'xbg_speed_mae': mae_speed,
                                        'base_speed_mae': mae_speed_base,
                                        'xgb_angle_mae': mae_angle,
                                        'base_angle_mae': mae_angle_base}, ignore_index=True)
    #record predicted speed
    pred_speed = pd.concat([pred_speed, predict['speed'].rename('speed_t+'+str(t))], axis=1)
    #record predicted angle
    pred_angle = pd.concat([pred_angle, predict['angle'].rename('angle_t+'+str(t))], axis=1)

    
#return accuracy results and prediction results : 
print(pred_speed)
print(pred_angle)
print(accuracy)

read csv semester csv files from 2015s2 to 2020s1
smooth wind direction
generate seasonality categorical feature
generate am/pm categorical feature
reading forecast data
smooth wind direction
1


ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

In [None]:
#arguments: 
## t_list: list of prediction steps,
## steps_in: how many past n steps are included for each prediction
t_list=[1,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]
steps_in=48

#get data
measurement=prep.prepare_measurement()
forecast = prep.prepare_forecast()
#keep useful columns
measurement= measurement[['speed', 'cos_wind_dir', 'sin_wind_dir', 'temp', 'radiation', 'precip','season', 'am']]

#set up empty dataframes to record results 
accuracy = pd.DataFrame(columns={})
pred_speed=pd.DataFrame(columns={})
pred_angle=pd.DataFrame(columns={})

#construct XGB model for each prediction time step 
for t in t_list:
    print(t)
    #run model
    predict, true, base = run_xgb(steps_in, steps_out=t)
    
    #calculate angles from sine and cosine  
    predict['angle'] = predict.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    true['angle'] = true.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    base['angle'] = base.apply(lambda row : utils.get_angle_in_degree(row['cos_wind_dir'],row['sin_wind_dir']), axis = 1)
    
    #calculate mae for regression 
    mae_speed, mae_speed_base, mae_angle, mae_angle_base = get_mae(predict, true, base) 
    #calculate accuracy & auc for scenario prediction 
    pred_scenario, base_scenario  = scenario_accuracy(predict, true, base)
    pred_bin_accu, base_bin_accu, pred_bin_auc, base_bin_auc= binary_accuracy(predict, true, base)
    
    
    #record accuracy
    accuracy = accuracy.append({'past_n_steps': str(steps_in),
                                      'pred_n_steps': str(t),
                                      'xgb_scenario_accu': pred_scenario,
                                      'base_scenario_accu': base_scenario,
                                      'xbg_binary_accu':pred_bin_accu,
                                      'base_binary_accu':base_bin_accu,
                                        'xbg_binary_auc':pred_bin_auc,
                                      'base_binary_auc':base_bin_auc,
                                        'xbg_speed_mae': mae_speed,
                                        'base_speed_mae': mae_speed_base,
                                        'xgb_angle_mae': mae_angle,
                                        'base_angle_mae': mae_angle_base}, ignore_index=True)
    #record predicted speed
    pred_speed = pd.concat([pred_speed, predict['speed'].rename('speed_t+'+str(t))], axis=1)
    #record predicted angle
    pred_angle = pd.concat([pred_angle, predict['angle'].rename('angle_t+'+str(t))], axis=1)

    
#return accuracy results and prediction results : 
print(pred_speed)
print(pred_angle)
print(accuracy)