In [None]:
"""
Data Source:
https://github.com/mohammad-q-cells/test-data

Items to addrress:

1.	Data preprocessing and cleaning: What are the preprocessing steps you have taken to clean the data?
How do you handle missing values?. Please describe and list all the methods and rationale behind them.
Include image where it is necessary

2.	Explore the data and generate insights from data. It is open ended. Please describe and list all the methods.
Include image where it is necessary

3.	What statistical test did you perform on the data to check its stationarity, co-integration etc.
Please state your reasoning behind a particular test. (please include results and pictures)

4.	Develop a prediction model to forecast energy production for the next 30 days.
a. what type of model you used and describe the reasons behind using this particular model 
b. what are features, how do you come up with these features 
c. how do you validate the results?

"""

In [None]:
# Importing needed modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import warnings
warnings.filterwarnings('ignore')

In [None]:
"""
I personally do a quick data exploration to get a feeling about the data before dealing with data cleansing and missing data points.
In this way, I gain valuable hints for data cleansing and I can think of ideas for feature engineering.
Using Pandas let me do this preliminary data exploration easily and quickly.
"""

In [None]:
# Reading the csv file via Pandas and checking the format
url='https://raw.githubusercontent.com/mohammad-q-cells/test-data/main/data-for-test2.csv'
df= pd.read_csv(url)  
print(df.head(20))

In [None]:
# To get a feeling about the data types and their length
df.info()

In [None]:
df.plot()
plt.show()

In [None]:
"""
outliers were detected!
"""

In [None]:
# I also visualize the data to get a feeling about the data (features and the target variable(s))
f, axs = plt.subplots(2, 2, figsize=(12, 12))
axs[0,0].hist(df['irradiance'])
axs[0,0].set(xlabel='irradiance (W/m^2)', ylabel='count')
axs[0,1].hist(df['humidity'])
axs[0,1].set(xlabel='humidity', ylabel='count')
axs[1,0].hist(df['temperature'])
axs[1,0].set(xlabel='temperature (C)', ylabel='count')
axs[1,1].hist(df['energy'])
axs[1,1].set(xlabel='energy (kW)', ylabel='count')

In [None]:
"""
Obviously some large values (outliers) are preventing the histogram to effectively represent the population distribution.
"""

In [None]:
df.describe()

In [None]:
"""
I always check minimum and maximum values and use my domain knowledge to see if the data makes sense.
If I do not have the domain knowledge, I try to gain some insight about the quantities represented in the data and their limits)
Obviously, as seen in the histograms, the maximum values for irradiance, temperature, and energy has a huge difference with the mean values.
So those data points need to be taken care of. Other centers of data such as median and mode also can be used.
In addition, if we assume the data is normal, we identify the data points with more than 3 standard deviations from the mean as outliers.

To handle missing data or wrong data values different methods can be used such as:
1) filling them with a specific value which is not in the range of the data (for example -99999)
2) filling them with mean or median values
3) filling them with the previous values or next values
4) filling them with interpolation (I used linear interpolation to handle some wrong measurements in this project)
5) Keeping them, but adding a column with a binary value (true or false) showing that something was not right about this data point.
6) Removing the row/column with at least one missing/wrong value

"""

In [None]:
# To see if there is any missing values
df[df.isnull().T.any()]

In [None]:
"""
obviously here we do not have to deal with NaN or null values
"""

In [None]:
index1=df['irradiance'].idxmax(axis=0, skipna=True)
index2=df['temperature'].idxmax(axis=0, skipna=True)
index3=df['energy'].idxmax(axis=0, skipna=True)

print(df[index1:index1+1])
print(df[index2:index2+1])
print(df[index3:index3+1])

In [None]:
"""
based on previous comments about maximum values, data points on vicinity of 12/2/16 10:15 and 4/28/16 11:50 need attention.
"""
df[index1-5:index1+5]


In [None]:
df[index3-5:index3+5]

In [None]:
"""
In this stage, 5 values need to be taken care of. 2 temperature and 2 irradiance values on 12/2/16 10:15 and
1 energy value on 4/28/16 11:50.
As discussed above, previous values, next values, or linear regression can be used to estimate the right values.
Here I used the average of the previous and next values.
This is an iterative process, I will visualize the data and check min and max values to take care of other possible data issues.
"""

df['temperature'][index1]=(df['temperature'][index1-1]+df['temperature'][index1+2])/2
df['temperature'][index1+1]=(df['temperature'][index1-1]+df['temperature'][index1+2])/2

df['irradiance'][index1]=(df['irradiance'][index1-1]+df['irradiance'][index1+2])/2
df['irradiance'][index1+1]=(df['irradiance'][index1-1]+df['irradiance'][index1+2])/2

df['energy'][index3]=(df['energy'][index3-1]+df['energy'][index3+1])/2

In [None]:
f, axs = plt.subplots(2, 2, figsize=(12, 12))
axs[0,0].hist(df['irradiance'])
axs[0,0].set(xlabel='irradiance (W/m^2)', ylabel='count')
axs[0,1].hist(df['humidity'])
axs[0,1].set(xlabel='humidity', ylabel='count')
axs[1,0].hist(df['temperature'])
axs[1,0].set(xlabel='temperature (C)', ylabel='count')
axs[1,1].hist(df['energy'])
axs[1,1].set(xlabel='energy (kW)', ylabel='count')

In [None]:
df.describe()

In [None]:
"""
Another large energy value is observed and needs to be taken care of.
"""
index4=df['energy'].idxmax(axis=0, skipna=True)

print(df[index4:index4+1])

In [None]:
df[index4-5:index4+5]

In [None]:
df['energy'][index4]=(df['energy'][index4-1]+df['energy'][index4+1])/2
df[index4-5:index4+5]

In [None]:
f, axs = plt.subplots(2, 2, figsize=(12, 12))
axs[0,0].hist(df['irradiance'])
axs[0,0].set(xlabel='irradiance (W/m^2)', ylabel='count')
axs[0,1].hist(df['humidity'])
axs[0,1].set(xlabel='humidity', ylabel='count')
axs[1,0].hist(df['temperature'])
axs[1,0].set(xlabel='temperature (C)', ylabel='count')
axs[1,1].hist(df['energy'])
axs[1,1].set(xlabel='energy (kW)', ylabel='count')

In [None]:
df.describe()

In [None]:
"""
It seems the manual process of dealing with wrong data points is not effective. so it is a good idea to evaluate the 
situation and select an automatic approach if number of wrong data points are considerable.
"""

df[df['energy'] > 30]

In [None]:
"""
I picked 30 by checking the number of values greater than it, which is a small number.
It is a design choice and other values could be used. Some people used 1.5*IQR to detect the outliers.
"""

In [None]:
temp=df[df['energy'] > 30].index
for index_temp in temp:
    df['energy'][index_temp]=(df['energy'][index_temp-1]+df['energy'][index_temp+1])/2

In [None]:
f, axs = plt.subplots(2, 2, figsize=(12, 12))
axs[0,0].hist(df['irradiance'])
axs[0,0].set(xlabel='irradiance (W/m^2)', ylabel='count')
axs[0,1].hist(df['humidity'])
axs[0,1].set(xlabel='humidity', ylabel='count')
axs[1,0].hist(df['temperature'])
axs[1,0].set(xlabel='temperature (C)', ylabel='count')
axs[1,1].hist(df['energy'])
axs[1,1].set(xlabel='energy (kW)', ylabel='count')

In [None]:
df.describe()

In [None]:
"""
It seems at this point the issue of outliers is fairly resolved. Let's visualize the data and see it again.
"""

In [None]:
df.plot()
plt.show()

In [None]:
df[149500:156500].plot()
plt.show()

In [None]:
"""
Now we can see that the feature vector is missing in a time period. all of the methods for dealing with missing data points can be used here.
But based on the size of the missing data, I start with removing (dropping) that period.
"""

In [None]:
df.drop(df.loc[150000:155999].index, inplace=True)

In [None]:
df.describe()

In [None]:
"""
It seems now we can start working with this data. Previously we checked for Null values and there were none.
At this stage data cleansing can be considered done (of course we may come back and make changes if needed)
"""

In [None]:
"""
Again by using domain knowledge or checking the correlation coefficients we can figure out the relationship
between features and the target variable (PV generation). This part is going to address the follwing questions:

"2.	Explore the data and generate insights from data. It is open ended. Please describe and list all the methods.
Include image where it is necessary.
"""

In [None]:
df['irradiance'].corr(df['energy']) 

In [None]:
df['humidity'].corr(df['energy'])

In [None]:
df['temperature'].corr(df['energy']) 

In [None]:
sns.heatmap(df.corr(), annot = True)

In [None]:
"""
the results make sense. Increasing the irradiance, increases the PV Generation.
Increasing the humidity, decreases the PV Generation.
Increasing the temperature (due to sun position), increases the PV Generation. The impact of temperature on solar panels efficiency is not
significant compared to impact of irradiance on PV generation.
"""

"""
Kendall and Pearson correlation can also be used based on the distribution of the features.
"""

In [None]:
sns.heatmap(df.corr(method='spearman'), annot = True)

In [None]:
"""
In many cases, feature scaling speeds up learning algorithms for example gradient descent.
Standardizing and normalizing can be used for feature scaling. 
However, for this example, I am going to keep the original data scale.
"""

In [None]:
"""
What statistical test did you perform on the data to check its stationarity, co-integration etc.
Please state your reasoning behind a particular test. (please include results and pictures)
"""

"""
Obviously in time-series analysis working with stationary data is much easier. Due to the seasonality of
irradiance, temperature, and humidity, it seems the data in different parts of the year is not stationary. Generally
climate data samples are considered cyclo-stationary. For example if we only consider the data in one month, then it can be considered stationary.
obviously if there is a trend, again data is not stationary. Here I do not consider climate change as a trend.
"""


In [None]:
"""
4.	Develop a prediction model to forecast energy production for the next 30 days.
a. what type of model you used and describe the reasons behind using this particular model 
b. what are features, how do you come up with these features 
c. how do you validate the results?
"""

In [None]:
"""
Cross validation while dealing with time series data to validate the results
"""
from IPython.display import Image
Image("https://habrastorage.org/files/f5c/7cd/b39/f5c7cdb39ccd4ba68378ca232d20d864.png")