# <strong>Analysis Notebook</strong>

<em>By: Loh Zhi Shen</em>

<em>Last updated: 25 August 2022</em>

<strong>Summary:</strong>
* Anaylsed solar power generation data in Python to identify underperforming solar panels.

---

## <strong>Analysis Of The Problem</strong>

<strong>Problem statement:</strong>

    >> Can we identify faulty or suboptimally performing equipment?

In the context of energy generation, faulty or suboptimally performing equipment refer to equipment that produce significantly less power than normally functioning equipement. 

Thus, the problem is an anomaly detection problem.

However, an anomaly could refer to a solar panel that is over producing or under producing. Therefore, there needs to be a secondary method to filter out the over producing solar panels. 

This is where statistical hypothesis testing comes in. We can perform a hypothesis test with the following hypothesis:

H0 (null hypothesis) : power generated is equal to the mean power generated.

H1 (alternative hypotehsis): power generated is less than the mean power generated.

If we were to reject the null hypothesis, there would be strong evidence to show that the solar panel is underperforming.

## <strong>Imports</strong>


In [None]:
# linear algebra library
import numpy as np

# data processing library
import pandas as pd

# data visualization library
import plotly.express as px
import plotly.graph_objects as go

# outlier detection model
from sklearn.linear_model import HuberRegressor

# hypothesis testing
import scipy.stats as st

In [None]:
# import the dataset
generation_df1 = pd.read_csv("dataset/Plant_1_Generation_Data.csv", parse_dates = ["DATE_TIME"], dayfirst = True)
weather_df1 = pd.read_csv("dataset/Plant_1_Weather_Sensor_Data.csv", parse_dates = ["DATE_TIME"], dayfirst = True)

generation_df2 = pd.read_csv("dataset/Plant_2_Generation_Data.csv", parse_dates = ["DATE_TIME"], dayfirst = True)
weather_df2 = pd.read_csv("dataset/Plant_2_Weather_Sensor_Data.csv", parse_dates = ["DATE_TIME"], dayfirst = True)

In [None]:
generation_df1.head()

In [None]:
weather_df1.head()

In [None]:
generation_df2.head()

In [None]:
weather_df2.head()

## <strong>Univariate Analysis</strong>

In [None]:
generation_df1.describe()

In [None]:
generation_df2.describe()

In [None]:
weather_df1.describe()

In [None]:
weather_df2.describe()

Based on these summary statisitcs, the datasets from the 2 plants should be separately analysed as there are significant differences between the 2 datasets - in terms of their summary statistics.

As such from now on, the analysis will focus on plant 1 and a similar method should be able to work on the data from plant 2 as well.

## <strong>Bivariate Analysis</strong>

In [None]:
generation_df1['TIME'] = generation_df1['DATE_TIME'].dt.time
generation_df1['DATE'] = generation_df1['DATE_TIME'].dt.date
generation_df1['POWER'] = generation_df1['DC_POWER'] + generation_df1['AC_POWER']

In [None]:
# power generated against time of the day
fig = px.line(generation_df1, x = 'TIME', y = 'POWER', color = 'DATE', symbol = 'SOURCE_KEY')
fig.update_layout(showlegend = False)
fig.show()

Based on this plot, we can see that the solar power generation occurs from 6am to 6.30pm. Thus, we should only be feeding our models data from these times to reduce the demands on computational power.

In [None]:
# power generated against date
fig = px.line(generation_df1, x = 'DATE', y = 'POWER', color = 'TIME', symbol = 'SOURCE_KEY')
fig.update_layout(showlegend = False)
fig.show()

It seems like the power does not vary predictably with the date so this should be exclueded from the anomaly detection model.

However, the data only spans a 34 day period and there could be long term trends in the data which are not evident in the current set of data. If such a trend were to exist, we ought to include the date into the model.

In [None]:
# joinning the 2 dataframes
df = generation_df1.merge(weather_df1, on= 'DATE_TIME', how = 'left', suffixes = ('_GENERATION', '_WEATHER'))
df.head(10)

In [None]:
# power against ambient temperature
fig = px.scatter(df, x = 'AMBIENT_TEMPERATURE', y = 'POWER', color = 'TIME')
fig.update_layout(showlegend = False)
fig.show()

In [None]:
px.scatter(df, x = 'TIME', y = 'AMBIENT_TEMPERATURE')

Although it looks like there is a relationship between power and ambient temperature, its effect appears to be highly correlated to time. 

Due to the highly correlated nature, it might not be an excellent variable to use.

In [None]:
# power against module temperature
fig = px.scatter(df, x = 'MODULE_TEMPERATURE', y = 'POWER', color = 'TIME')
fig.update_layout(showlegend = False)
fig.show()

In [None]:
px.scatter(df, x = 'TIME', y = 'MODULE_TEMPERATURE')

All that was said about ambient temperature, can also be applied to module temperature.

In [None]:
# power against module temperature
fig = px.scatter(df, x = 'IRRADIATION', y = 'POWER', color = 'TIME')
fig.update_layout(showlegend = False)
fig.show()

In [None]:
px.scatter(df, x = 'TIME', y = 'IRRADIATION')

Just like the other 2 temperature readings, irradiation also seems to be correlated to time.

However, it exhibits the most linear relationship with time making it the best variable to use to predict power generation.

## <strong>Modelling</strong>

Based on the exploratory data analysis above, the model will consist of a linear regression model to predict the expected power generation given a irradiation level. 

> power = a1 * irradiation + b + error

The difference between the actual and predicted values will be the outlier score that we will use in our hypothesis testing to find underperforming solar panels.

The error will have a normal distribution with mean 0 and unknown variance.

In [None]:
class Model:
    def __init__(self, confidence_level = 0.9999):

        self.model = HuberRegressor()
        self.significance_level = 1 - confidence_level

    def fit(self, X, Y):

        self.critical_value = st.norm.ppf(self.significance_level)

        self.model.fit(X, Y)
        residues = Y - self.model.predict(X)

        self.variance = np.sum(residues**2) / (len(X) - 1)
    
    def predict(self, X, Y):

        residues = Y - self.model.predict(X)
        test_statistic = residues / self.variance**0.5
        outlier = test_statistic < self.critical_value
        return outlier

In [None]:
# clean up dataframe
def to_seconds(value):
    hour = value.hour
    minutes = value.minute
    return int(hour) * 60 + int(minutes)

df = pd.concat(
    [df['DATE_TIME'], df['PLANT_ID_GENERATION'], df['SOURCE_KEY_GENERATION'], 
    df['POWER'], df['IRRADIATION']], axis = 1
    )
    
df = df.dropna()
df = df.drop(index = df.loc[df['DATE_TIME'].apply(to_seconds) < 6 * 60].index)
df = df.drop(index = df.loc[df['DATE_TIME'].apply(to_seconds) > 18 * 60].index)
df.head()

In [None]:
model = Model()
model.fit(df[['IRRADIATION']], df['POWER'])
results = model.predict(df[['IRRADIATION']], df['POWER'])
px.scatter(df, x = 'IRRADIATION', y = 'POWER', color = ["outlier" if result else "inlier" for result in results], 
    hover_data=['DATE_TIME', 'SOURCE_KEY_GENERATION'])

Based on the graph, the model can identify data points which are too far below the regression line. These data points are likely to be from underperforming solar panels, so the goal of identifying faulty or suboptimal solar panels has been achieved.