In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import metrics



# Energy Forecast

## Part I. Feature engineering

The goal of the second part of the project is to use the previous data to predict the peak in advance. Currently, we have the monthwide average energy consumption for each time interval during a day. This dataset was generated from the last peak analysis part. The name of the data is "monthwide_average.csv".  We begin our feature engineering step by clarifying some assumptions.
## 1. Assumptions on generating the feature space:
<ol>
<li> Since we want to predict the peak in advance (at least 15 min), the main idea of generating our feature space is that we want to use the data before the current time interval as features. In this situation, we will assume the data during the past one hour has impact on predicting the whether there will be a peak in the current interval.</li>
<li> We assume the weather conditions have impact on whether there will be peaks during the current interval. And all the weather data we got could be obtained before the current time interval point. For example, the 8:00 weather condition information could be obtained before 8:00. The reason for this assumption is that we want to make sure that all the data in our feature space is the "seen" data, which we have already had and not including any unknown data.</li>
</ol>
  
## 2. Components of the feature space:
<ol>
<li> Since we decide to use past one hour data to predict the peak, some of the features in our feature space should the past 4 time interval average energy use for each circuit. </li>
<li> When doing **time series analysis**, the **historical change** of the variables are often very important. So we want to also add the change of the average energy for each circuit as important features. In addition, we also want to add the features such as the average energy use for each circuit during the past one hour.</li>
<li> The third part of our fearure space would be the weather information for that time. </li>
</ol>

### Step 1: Add the past 1 hour data into our feature space

In [2]:
# Import the monthwide average data
# Note: X2, X3,...X72 represent the 71 circuits

data_monthwide= pd.read_csv("monthwide_average.csv")
data_monthwide.head(5)

Unnamed: 0,hour_min,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X69,X70,X71,X72,Label,hour,min,year_mon,month,year
0,0,2569.999878,2779.999817,1769.999969,1521.445237,1378.346362,2132.499939,2814.999939,2770.0,332.999985,...,26474.998535,27224.998535,26899.999512,2624.999939,0,0,0,20160700,7,2016
1,15,2569.999878,2772.133468,1767.53586,1513.517472,1384.441351,2146.986023,2817.76593,2745.710136,333.002252,...,26158.991927,26925.121973,26569.421354,2607.673588,0,0,15,20160700,7,2016
2,30,2569.999878,2762.499939,1471.119179,1268.127825,1205.714195,2147.499878,2819.999939,2722.836491,332.999985,...,22015.902421,22685.22627,22404.523958,2604.999939,0,0,30,20160700,7,2016
3,45,2571.323625,2762.499939,1715.946231,1484.120264,1362.346784,2149.033541,2819.999939,2710.0,333.177592,...,25749.999023,26549.998535,26174.999023,2604.999939,0,0,45,20160700,7,2016
4,100,2574.999817,2762.499939,1401.33408,1200.172838,1162.360573,2149.999817,2819.999939,2711.592212,333.249992,...,20893.533801,21533.086019,21265.799626,2604.999939,0,1,0,20160700,7,2016


**Note** 
1. For each circuit, we want to generate 4 feature vectors. 
2. The past 1h, past 45 min, past 30 min, and past 15 min energy use.
3. We should exclude the first 4 time points because there are not enough past data for these 4 points.
4. So the final matix generated in this step will have **668(672-4) rows**, and **71* 4 = 284 columns**.
5. 71 * 4 represents each circuit will have 4 past points recorded and there are 71 circuits in total.

In [3]:
# construct a empty matix to hold new features
past_1h_matrix=np.zeros((668,284))

# the precious circuit data matix
circuit_matrix=data_monthwide.ix[:,1:-4].values

# create the feature space
i=0
j=0
while i < 284 and j < 71:
    past_1h_matrix[:,i]=circuit_matrix[:-4,j]
    past_1h_matrix[:,i+1]=circuit_matrix[1:-3,j]
    past_1h_matrix[:,i+2]=circuit_matrix[2:-2,j]
    past_1h_matrix[:,i+3]=circuit_matrix[3:-1,j]
    i=i+4
    j+=1


### Step 2: Add historical change & average for each time interval point during past 1 hour

In [4]:
# construct a empty matrix to hold the "change & average over an hour" variable
# The dimension is 668 * 142 because we want to hold 71 "change" variable for each circuit and 71 "average" variable
change_1h = np.zeros((668,142))
i=0
j=0
k=0
while i <72 and j < 284 and k<284:
    change_1h[:,i]= np.absolute(past_1h_matrix[:,k]- past_1h_matrix[:,k+3])
    change_1h[:,i+71]=np.mean(past_1h_matrix[:,j:j+4], axis=1)
    i+=1
    j=j+4
    k=k+4


In [5]:
fearure_matrix=np.hstack((change_1h,past_1h_matrix))

# Generate the column names
col_name=[]
for i in range(1,72):
    col_name.append("C"+str(i)+"_1h")
    col_name.append("C"+str(i)+"_45m")
    col_name.append("C"+str(i)+"_30m")
    col_name.append("C"+str(i)+"_15m")
for i in range(1,72):
    col_name.append("C"+str(i)+"_change")
for i in range(1,72):
    col_name.append("C"+str(i)+"_avg")
#col_name

# Construct feature space dataframe
df_feature=pd.DataFrame(fearure_matrix, columns=col_name)
df_feature['year_mon']=data_monthwide['year_mon'].values[4:]
df_feature['year']=data_monthwide['year'].values[4:]
df_feature['month']=data_monthwide['month'].values[4:]
df_feature['hour']=data_monthwide['hour'].values[4:]
df_feature['minute']=data_monthwide['min'].values[4:]
#df_feature['label']=data_monthwide['Label'].values[4:]
df_feature=df_feature[["year_mon","year","month","hour","minute"]+col_name]
df_feature.head()

Unnamed: 0,year_mon,year,month,hour,minute,C1_1h,C1_45m,C1_30m,C1_15m,C2_1h,...,C62_avg,C63_avg,C64_avg,C65_avg,C66_avg,C67_avg,C68_avg,C69_avg,C70_avg,C71_avg
0,20160700,2016,7,1,0,1.323747,17.499878,54.053738,37.324973,15.999578,...,22685.22627,26549.998535,26899.999512,26569.421354,22404.523958,26174.999023,2624.999939,2607.673588,2604.999939,2604.999939
1,20160700,2016,7,1,15,4.999939,9.633529,366.20178,313.344634,222.080779,...,26549.998535,21533.086019,26569.421354,22404.523958,26174.999023,21265.799626,2607.673588,2604.999939,2604.999939,2604.999939
2,20160700,2016,7,1,30,4.999939,0.0,161.285451,145.53202,91.639445,...,21533.086019,20740.028044,22404.523958,26174.999023,21265.799626,20437.985954,2604.999939,2604.999939,2604.999939,2604.999939
3,20160700,2016,7,1,45,3.37487,0.0,338.933895,264.666146,179.289524,...,20740.028044,21725.356083,26174.999023,21265.799626,20437.985954,21374.249394,2604.999939,2604.999939,2604.999939,2604.999939
4,20160700,2016,7,2,0,7.499939,0.0,118.474115,98.481982,57.822576,...,21725.356083,19803.037988,21265.799626,20437.985954,21374.249394,19502.152584,2604.999939,2604.999939,2604.999939,2604.999939


### Step 3: Add the weather conditons to the feature space

We got the processed data from the R. The weather data was imputed through random forest algorithm. The imputated data called "weather_imputed.csv".

In [6]:
#load the imputed weather data
df_weather=pd.read_csv("weather_imputed.csv")
del df_weather['date']
df_weather.head()


Unnamed: 0,year,month,hour,apparentTemperature,cloudCover,dewPoint,humidity,icon,precipIntensity,precipProbability,pressure,summary,temperature,time,visibility,windBearing,windSpeed
0,2016,7,7,55.2,0.0,48.68,0.79,clear-night,0.0,0.0,1010.21,Clear,55.2,1467356000.0,8.28,222.0,4.17
1,2016,7,8,55.06,0.0,49.47,0.81,clear-night,0.0,0.0,1010.05,Clear,55.06,1467360000.0,8.28,228.0,4.38
2,2016,7,9,54.56,0.15,48.45,0.8,clear-night,0.0,0.0,1009.81,Clear,54.56,1467364000.0,7.94,225.0,5.31
3,2016,7,10,54.62,0.31,49.67,0.83,partly-cloudy-night,0.0,0.0,1009.77,Partly Cloudy,54.62,1467367000.0,8.06,221.0,4.3
4,2016,7,11,54.46,0.15,49.87,0.84,clear-night,0.0,0.0,1009.83,Clear,54.46,1467371000.0,7.62,235.0,3.64


In [7]:
#Check any missing value ---- No
df_weather.isnull().values.any()

False

In [8]:
# Calculate monthwide average statistics for each variable at each hour
#In this part, we would like to only focus on the numeric values.
df_weather_avg=df_weather.groupby(['year','month','hour']).mean()
col_name=list(df_weather_avg.columns)
weather_num_avg= df_weather_avg.values

#generate hour month year index
hour=[]
for i in range(7):
    hour+=range(0,24)
month=list(np.repeat([7,8,9,10,11,12,1],24))
year=[2016 for x in range(144)]+[2017 for x in range(24)]

# construct monthwide weather statistics average dataframe

df_weather_avg=pd.DataFrame(weather_num_avg,columns=col_name)
df_weather_avg["year"]=year
df_weather_avg["month"]=month
df_weather_avg["hour"]=hour
df_weather_avg=df_weather_avg[["year","month","hour"]+col_name]
df_weather_avg.head()

Unnamed: 0,year,month,hour,apparentTemperature,cloudCover,dewPoint,humidity,precipIntensity,precipProbability,pressure,temperature,time,visibility,windBearing,windSpeed
0,2016,7,0,64.600333,0.329333,52.124667,0.646,0.0,0.0,1014.185333,64.600333,1468670000.0,9.3,259.133333,10.575667
1,2016,7,1,64.182667,0.249084,52.982,0.676333,0.0,0.0,1013.887333,64.182667,1468674000.0,9.337,260.033333,9.641667
2,2016,7,2,62.015,0.243764,52.487667,0.714667,0.0,0.0,1013.81,62.015,1468678000.0,9.1,258.333333,9.463667
3,2016,7,3,59.809667,0.247511,52.555333,0.773,0.0,0.0,1013.907333,59.809667,1468681000.0,8.985667,251.4,7.860333
4,2016,7,4,58.130333,0.241958,52.231,0.810667,0.0,0.0,1014.068667,58.130333,1468685000.0,9.085333,248.066667,7.167


In [9]:
# join the weather condition dataframe with the previous feature data frame 
# on year, month and hour

df_feature=pd.merge(df_feature,df_weather_avg, on=["year","month","hour"],how='left' )
df_feature['label']=data_monthwide['Label'].values[4:]

# Final feature space
df_feature.head()

Unnamed: 0,year_mon,year,month,hour,minute,C1_1h,C1_45m,C1_30m,C1_15m,C2_1h,...,humidity,precipIntensity,precipProbability,pressure,temperature,time,visibility,windBearing,windSpeed,label
0,20160700,2016,7,1,0,1.323747,17.499878,54.053738,37.324973,15.999578,...,0.676333,0.0,0.0,1013.887333,64.182667,1468674000.0,9.337,260.033333,9.641667,0
1,20160700,2016,7,1,15,4.999939,9.633529,366.20178,313.344634,222.080779,...,0.676333,0.0,0.0,1013.887333,64.182667,1468674000.0,9.337,260.033333,9.641667,0
2,20160700,2016,7,1,30,4.999939,0.0,161.285451,145.53202,91.639445,...,0.676333,0.0,0.0,1013.887333,64.182667,1468674000.0,9.337,260.033333,9.641667,0
3,20160700,2016,7,1,45,3.37487,0.0,338.933895,264.666146,179.289524,...,0.676333,0.0,0.0,1013.887333,64.182667,1468674000.0,9.337,260.033333,9.641667,0
4,20160700,2016,7,2,0,7.499939,0.0,118.474115,98.481982,57.822576,...,0.714667,0.0,0.0,1013.81,62.015,1468678000.0,9.1,258.333333,9.463667,0


# Part II. Model building

## First Glance and Basic Ideas

This part will be divided into 2 parts.The first part would be explore the **supervised** learning solutions to try to explore optimal classifier to predict the peak. The second part would be to try to implement **unsupervised** solutions. We will try to fit data on both the whole dataset and the top 50 most important feature dataset. 

## Main Challenge
As we can see, since the peak is the largest average energy use over 15 min interval by month. So there would be only one peak at each month. Since there are seven month in our data, so there will be only 7 peak data in our datasets. The dataset is highly imbalanced. So I came up with three models that I want to try. **The first one is Adaboost and then weighted logistic regression and also one-class SVM.**

## Model Assumptions:
1. The system of this model is to use the 1 hour before data to predict whether there will be a peak in a 15 min time period. 
2. The time in our model should be encoded as a 15-min time period. For example, this model could predict whether there will be peak during 4:00am to 4:15am.
3. The weather data for a specific hour should be obtained beforehand. That is, the data in our feature space should all known data.

## Performance measure
** Confusion Matrix **: Confusion matrix is one way to evaluate performance of the the classifer. It is a table that used to display whether the system mislabelled each class or confused two classes. The rows and the columns are all the categories of the response variable. However, the columns are under the prediction condition and the rows are under the actual conditions. 

| Predicted/ Actual | Class 0         | Class 1        |
|-------------------|-----------------|----------------|
| Class 0           | True Negatives  | False Negative |
| Class 1           | False Positives | True Positives |


## Adaboost

The **reason** that we want to choose adaboost is that it is a ensemble method. And most importantly, it is a **cost sensitive method**. Adaboost will start from a weak learner, say, stump tree. And subsequently add weak learners to the ensembles to build a strong learner. The weight vector will be adjusted according to the points that are missclassified. 

## Whole datasets

In [29]:
# Split the data into train and test
# First, we need to split our peaks randomly into train and test
# Since the data is highly imbalanced data and the number of peaks are very small so we have to split it first
x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(df_feature[df_feature['label']==1].ix[:,:-1].values, df_feature[df_feature['label']==1].ix[:,-1].values, test_size=0.3)

#Then we split the normal data, whose labels are 0
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(df_feature[df_feature['label']==0].ix[:,:-1].values, df_feature[df_feature['label']==0].ix[:,-1].values, test_size=0.3)

In [30]:
y_train_ib=np.concatenate((y_train_1, y_train_2), axis=0)
x_train_ib=np.concatenate((x_train_1, x_train_2), axis=0)
y_test_ib=np.concatenate((y_test_1, y_test_2), axis=0)
x_test_ib=np.concatenate((x_test_1, x_test_2), axis=0)

In [31]:
ada=AdaBoostClassifier()
ada.fit(x_train_ib,y_train_ib)
y_pred=ada.predict_proba(x_test_ib)
y_score=y_pred[:,1]
y_p_ada=ada.predict(x_test_ib)
fpr, tpr, threshold = metrics.roc_curve(y_test_ib, y_score)
roc_auc = metrics.auc(fpr, tpr)
roc_auc

0.75711892797319935

In [32]:
# build confusion matrix
cm_ada=metrics.confusion_matrix(y_test_ib,y_p_ada)
df_cm_ada= pd.DataFrame(cm_ada, columns=['0','1'])
df_cm_ada

Unnamed: 0,0,1
0,199,0
1,3,0


## Top 50 most important feature dataset

In [33]:
# Calsulate the most important features score by Random Forest
X=df_feature.ix[:,:-1].values
y=df_feature.ix[:,-1].values
forest = ExtraTreesClassifier(n_estimators=250,random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_
df_feature_importance=pd.DataFrame({'Feature':list(df_feature.columns)[:-1], 'Importance':importances})
df_feature_importance=df_feature_importance.sort_values(by='Importance',ascending=0)

feature_50=list(df_feature_importance.ix[:51,0].values)
df_feature_50=df_feature[feature_50+['label']]

# Split the data into train and test
# First, we need to split our peaks randomly into train and test
# Since the data is highly imbalanced data and the number of peaks are very small so we have to split it first
x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(df_feature_50[df_feature_50['label']==1].ix[:,:-1].values, df_feature_50[df_feature_50['label']==1].ix[:,-1].values, test_size=0.3)

#Then we split the normal data, whose labels are 0
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(df_feature_50[df_feature_50['label']==0].ix[:,:-1].values, df_feature_50[df_feature_50['label']==0].ix[:,-1].values, test_size=0.3)

y_train_50=np.concatenate((y_train_1, y_train_2), axis=0)
x_train_50=np.concatenate((x_train_1, x_train_2), axis=0)
y_test_50=np.concatenate((y_test_1, y_test_2), axis=0)
x_test_50=np.concatenate((x_test_1, x_test_2), axis=0)

In [34]:
ada=AdaBoostClassifier()
ada.fit(x_train_50,y_train_50)
y_pred=ada.predict_proba(x_test_50)
y_score=y_pred[:,1]
y_p_ada=ada.predict(x_test_50)
fpr, tpr, threshold = metrics.roc_curve(y_test_50, y_score)
roc_auc = metrics.auc(fpr, tpr)
roc_auc

0.79396984924623115

In [35]:
# build confusion matrix
cm_ada=metrics.confusion_matrix(y_test_50,y_p_ada)
df_cm_ada= pd.DataFrame(cm_ada, columns=['0','1'])
df_cm_ada

Unnamed: 0,0,1
0,198,1
1,3,0


## Result analysis:

As we can see from the confusion matrix, it fails to predict all the peaks in the test set for both the whole dataset and 50-feature dataset. Peaks are the main concern in our problem. So I believe Adaboost is not a good method here.

## Weighted logistic regression


In [36]:
wlr=LogisticRegression(penalty='l1',class_weight='balanced')
wlr.fit(x_train_ib,y_train_ib)
y_pred=wlr.predict_proba(x_test_ib)
y_score=y_pred[:,1]
y_p_lr=wlr.predict(x_test_ib)
fpr, tpr, threshold = metrics.roc_curve(y_test_ib, y_score)
roc_auc = metrics.auc(fpr, tpr)
roc_auc

0.74204355108877718

In [37]:
cm_lr=metrics.confusion_matrix(y_test_ib,y_p_lr)
df_cm_lr= pd.DataFrame(cm_lr, columns=['0','1'])
df_cm_lr

Unnamed: 0,0,1
0,199,0
1,3,0


In [38]:
wlr=LogisticRegression(penalty='l1',class_weight='balanced')
wlr.fit(x_train_50,y_train_50)
y_pred=wlr.predict_proba(x_test_50)
y_score=y_pred[:,1]
y_p_lr=wlr.predict(x_test_50)
fpr, tpr, threshold = metrics.roc_curve(y_test_50, y_score)
roc_auc = metrics.auc(fpr, tpr)
roc_auc

0.6649916247906198

In [39]:
cm_lr=metrics.confusion_matrix(y_test_50,y_p_lr)
df_cm_lr= pd.DataFrame(cm_lr, columns=['0','1'])
df_cm_lr

Unnamed: 0,0,1
0,193,6
1,2,1


## Result analysis:

As we can see from the confusion matrix, it fails to predict all the peaks in the test set for both the whole dataset and 50-feature dataset. Peaks are the main concern in our problem. So I believe weighted logistic regression is not a good method here.

## One-class SVM

One-class SVM is widely applied in time-series anomaly detection. It is a unsupervised learning method. It assumes that all training sets have the same labels. This classifier will learn a soft boundary for the normal dataset. Any observations that did not fall into the soft boundary will be declared as anomaly. One can choose different kernels, such as RBF, polynomial and linear kernels. Here in our project, I found that **Linear Kernel** outperformed other kernels.

### Whole Dataset

In [40]:
clf = svm.OneClassSVM(kernel="linear")
clf.fit(x_train_ib)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='auto',
      kernel='linear', max_iter=-1, nu=0.5, random_state=None,
      shrinking=True, tol=0.001, verbose=False)

In [41]:
y_pred_test = clf.predict(x_test_ib)
y_pred_test[y_pred_test==-1]=0
y_pred_test

array([ 0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,
        0.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,
        1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,
        0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,
        0.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,
        0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,
        0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,
        1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  1

In [42]:
y_test_ib

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [43]:
#confusion matrix
cm_oneSVM=metrics.confusion_matrix(y_test_ib,y_pred_test)
df_cm_oneSVM= pd.DataFrame(cm_oneSVM, columns=['0','1'])
df_cm_oneSVM

Unnamed: 0,0,1
0,105,94
1,2,1


### 50-Feature Dataset

In [44]:
clf = svm.OneClassSVM(kernel="linear")
clf.fit(x_train_50)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='auto',
      kernel='linear', max_iter=-1, nu=0.5, random_state=None,
      shrinking=True, tol=0.001, verbose=False)

In [45]:
y_pred_test = clf.predict(x_test_50)
y_pred_test[y_pred_test==-1]=0
y_pred_test

array([ 1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,
        1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,
        0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,
        0.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  0.,
        0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,
        1.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,
        0.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  0

In [46]:
y_test_50

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [47]:
#confusion matrix
cm_oneSVM=metrics.confusion_matrix(y_test_50,y_pred_test)
df_cm_oneSVM= pd.DataFrame(cm_oneSVM, columns=['0','1'])
df_cm_oneSVM

Unnamed: 0,0,1
0,104,95
1,0,3


## Result Analysis:
We can find that one-class SVM trained on the most important 50 feature space successfully detected all the peaks on the test (Unseen) dataset. So we conclude that **one-class SVM outperformed other models and could be our final model on predicting the monthly peak energy use**.

## Summary and Conclusion:

We found that **One-Class SVM** trained on the 50 most important feature space performs pretty well on predicting the peaks for the future. What one-class SVM does is that it assumes that all training sets have the same labels. This classifier will learn a soft boundary for the normal dataset. Any observations that did not fall into the soft boundary will be declared as peaks or anomalys. 

However, there are still some problems in this final model. This model still does not perform pretty well on predicting normal points. According to the customer's definition, if they could find peaks in advance, they could take preventative measures. If the cost of the preventative measures do not have very high cost, then this model is a pretty good model because it could detect all the peaks and could help the customer save a lot of money. However, if taking the preventative measures will cost a lot, then this model may not be an appropriate option. In this case, we may want to collect more data to train our model. Or we may want to resample our highly imbalanced data by methods like **bootstrapping**. Through bootstrapping, we can resampe more abnomal points(say peaks) and re-fit all the models above to see whether there are any improvements.
