## __Solar Generation Prediction (Linear Regression model)__

### __1. Importing the data__

In [1]:
import pandas as pd

In [2]:
#Importing the merged historic weather dataset and the solar generation consumption
merged_df = pd.read_csv("merged_df.csv")

In [3]:
merged_df.head(5)

Unnamed: 0,Temperature,Humidity,Shortwave_Radiation,Date,Sunlight_hours,Time,gen
0,6,81,30.33,2018-01-01,7.95,2018-01-01,23.362676
1,7,86,8.92,2018-01-02,7.97,2018-01-02,18.812548
2,8,70,40.96,2018-01-03,7.98,2018-01-03,29.057803
3,8,82,25.0,2018-01-04,8.0,2018-01-04,22.020129
4,5,88,36.08,2018-01-05,8.03,2018-01-05,29.631941


In [4]:
#Importing the forecast data
new_weather_df = pd.read_csv("weather_forecast.csv")

In [5]:
new_weather_df.head()

Unnamed: 0,Temperature,Humidity,Shortwave_Radiation,Date,Sunlight_hours
0,20.333333,73.5,125.541667,2023-07-08,16.4
1,18.604167,66.791667,298.541667,2023-07-09,16.366667
2,16.775,82.083333,111.75,2023-07-10,16.333333
3,17.254167,84.875,227.791667,2023-07-11,16.316667
4,16.579167,73.333333,219.958333,2023-07-12,16.283333


### __2. Remember the Correlation__

The following variables were examined for their correlation with solar generation (Gen) in the past project::

| Variable              | Gen Correlation |
|-----------------------|-----------------|
| Temperature           | 0.549054        |
| Humidity              | -0.666227       |
| Shortwave Radiation   | 0.841948        |
| Sunlight Hours        | 0.788099        |

For the purpose of the machine learning prediction model, we are using the 4 variables to predict the Gen variable using a multiple linear regression model.

### __3. Multicollinearity Check__

To ensure reliable model interpretation, I checked for multicollinearity using the Variance Inflation Factor (VIF) to identify highly correlated variables that might affect the stability of the regression model.

In [6]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a DataFrame with the predictor variables
df = new_weather_df[["Shortwave_Radiation", "Sunlight_hours", "Humidity", "Temperature"]]

# Calculate VIF for each predictor variable
vif = pd.DataFrame()
vif["Variable"] = df.columns
vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

vif

Unnamed: 0,Variable,VIF
0,Shortwave_Radiation,13.632239
1,Sunlight_hours,434.370141
2,Humidity,130.237479
3,Temperature,191.685948


#### __Interpreation__
The VIF values indicate the presence of multicollinearity between the predictor variables in your regression model. 

### __4. PCA__

#### To address multicollinearity, I applied PCA in the "merged_df" (which has the historic weather and solar generation dataset) to transform the correlated variables into a smaller set of uncorrelated principal components, reducing dimensionality while retaining most of the variance in the data.

In [7]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np


# Create a DataFrame with the predictor variables
df = merged_df[["Shortwave_Radiation", "Sunlight_hours", "Humidity", "Temperature"]]

# Standardize the variables
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

# Perform PCA
pca = PCA()
principal_components = pca.fit_transform(scaled_df)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Cumulative explained variance
cumulative_variance = np.cumsum(explained_variance_ratio)

# Determine the number of components to retain
num_components = np.argmax(cumulative_variance >= 0.95) + 1

# Retain the desired number of components
final_principal_components = principal_components[:, :num_components]

# Construct the DataFrame with the retained components
principal_df = pd.DataFrame(data=final_principal_components, columns=["PC{}".format(i) for i in range(1, num_components + 1)])

# Add the target variable to the principal components DataFrame, if applicable
principal_df["gen"] = merged_df["gen"]

# Print the updated DataFrame
print(principal_df)

           PC1       PC2       PC3         gen
0    -1.746986  0.612923  0.704883   23.362676
1    -2.052412  0.062855  0.582532   18.812548
2    -0.922561  1.141297  1.515872   29.057803
3    -1.650162  0.248363  0.857921   22.020129
4    -2.152991  0.265642  0.152362   29.631941
...        ...       ...       ...         ...
1978  2.875188  0.948877 -0.260325   97.246218
1979  2.950528  0.655541 -0.424783  121.742163
1980  3.258441  0.707077 -0.272353  120.950493
1981  2.824124  0.673289 -0.687518   68.336305
1982  2.280247  0.532311 -0.197712   48.491734

[1983 rows x 4 columns]


### __5. Creation of Linear Regression model with the PCA variables generated above.__

With the transformed principal components, I built a multiple linear regression model to predict the target variable (gen) using the selected predictors. I split the data into training and testing sets and trained the model using the training data.

In [8]:
X = principal_df.drop("gen", axis=1)
y = principal_df["gen"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()

regression_model.fit(X_train, y_train)
y_pred = regression_model.predict(X_test)

In [9]:
y_pred

array([ 94.214692  , 123.66919408, 106.80462704,  15.26563568,
       116.26378985,  40.48759888,  42.74288129,  51.83609371,
       117.6016612 , 105.79461422, 155.97832031, 113.69162359,
        11.97532249,  37.85631851,  13.30166008,  12.89109495,
       135.54062413,  63.66415709,  67.62506886,  96.10307703,
        81.63132189,  91.19443787,  25.44024323,  66.04391116,
       102.58844375,  65.03501818,  55.04104279, 102.60053385,
        54.63085623,  31.62010555, 134.69944755,  65.00345322,
        28.45702973,  36.36621709,  73.99282614, 150.85200553,
       109.07046543,  74.72057441,  66.10330781,  95.61947909,
        44.01684368,  41.03055433,  51.85343069,  58.91669673,
       106.32916043,  11.91195707, 105.28981738,  84.39873117,
        36.64463419, 132.3164786 ,  25.22824467,  82.66246405,
        16.79496451,  34.46825241,  14.3726291 ,  32.60777534,
       151.08777267,  35.89790292,  78.07036601,  95.42025397,
        73.24557552,  26.03954172,  83.23627063,  44.64

### __6. Evaluation of the prediction model__

I evaluated the performance of the linear regression model by calculating the R-squared value on the test data. The R-squared value measures the proportion of variance in the target variable explained by the predictors.

In [10]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

R-squared: 0.744412626901324


#### __Interpretation__
The R-squared of 0.7444 indicated that the predictors (principal components) explained approximately 74.44% of the variance in the target variable.

### __7. Cross validation__

To further assess the model's performance and generalization ability, I employed cross-validation, which splits the data into multiple folds and evaluates the model's performance on each fold. I calculated the mean R-squared score to estimate how well the model is expected to perform on unseen data.

In [11]:
from sklearn.model_selection import cross_val_score

# Create the regression model
regression_model = LinearRegression()

# Perform cross-validation
cv_scores = cross_val_score(regression_model, X_train, y_train, cv=5, scoring='r2')

# Print the cross-validation scores
print("Cross-Validation R-squared scores:", cv_scores)
print("Mean R-squared score:", cv_scores.mean())


Cross-Validation R-squared scores: [0.71958235 0.71818134 0.73790061 0.71523099 0.75129599]
Mean R-squared score: 0.7284382566094132


#### __Interpretation__
The mean R-squared score of 0.7284 represented the average performance across five folds. Cross-validation helped mitigate overfitting and provided a reliable estimate of the model's predictive performance.

By considering both the R-squared of the linear regression model and the R-squared from cross-validation, I provide a more comprehensive understanding of the model's performance on the specific data used for evaluation and its expected performance on unseen data.

### __8. Using the prediction model, we predict the gen consumption using the 14-day Forecast weather data__

In [12]:
new_df = new_weather_df[["Shortwave_Radiation", "Sunlight_hours", "Humidity", "Temperature"]]
scaled_new_df = scaler.transform(new_df)
principal_components_new = pca.transform(scaled_new_df)
final_principal_components_new = principal_components_new[:, :num_components]

In [13]:
principal_df_new = pd.DataFrame(data=final_principal_components_new, columns=["PC{}".format(i) for i in range(1, num_components + 1)])
regression_model.fit(X_train, y_train)  # Fit the regression model with appropriate arguments
y_pred_new = regression_model.predict(principal_df_new)

In [14]:
y_pred_new

array([ 99.27972054, 142.12536173,  90.49853492, 113.17424215,
       120.2681687 , 132.79254776,  99.36770282, 132.8272434 ,
       143.42910081,  77.21888482, 121.84901444,  94.39887879,
       112.95033779,  85.46807993])

In [16]:
# Add the predicted values to the DataFrame
new_weather_df['Gen Prediction'] = y_pred_new

# Print the updated DataFrame
new_weather_df

Unnamed: 0,Temperature,Humidity,Shortwave_Radiation,Date,Sunlight_hours,Gen Prediction
0,20.333333,73.5,125.541667,2023-07-08,16.4,99.279721
1,18.604167,66.791667,298.541667,2023-07-09,16.366667,142.125362
2,16.775,82.083333,111.75,2023-07-10,16.333333,90.498535
3,17.254167,84.875,227.791667,2023-07-11,16.316667,113.174242
4,16.579167,73.333333,219.958333,2023-07-12,16.283333,120.268169
5,16.8875,69.208333,264.916667,2023-07-13,16.25,132.792548
6,16.6625,78.833333,144.875,2023-07-14,16.216667,99.367703
7,16.170833,67.875,261.916667,2023-07-15,16.166667,132.827243
8,15.495833,68.375,312.75,2023-07-16,16.133333,143.429101
9,15.333333,88.0,76.333333,2023-07-17,16.1,77.218885


#### __Final Interpretation__
The linear regression model I utilized offers predictions for the target variable (gen). It's important to acknowledge that these predictions come with a degree of uncertainty and should not be regarded as entirely definitive.

The model's performance is assessed using the R-squared value, which measures the proportion of variability in the predicted variable that can be attributed to the model's predictors (Temperature, Humidity, Shortwave Radiation, and Sunlight Hours).

In this instance, the R-squared value of 74.44% indicates that approximately 74.44% of the variability in the predicted variable can be accounted for by the chosen predictors. While this represents a noteworthy level of explanation, it's worth noting that there is still some residual variability that remains unexplained by the model.

In summary, the linear regression model provides valuable insights and predictions, but it's important to exercise caution and recognize that there are factors beyond the model's scope that may influence the predicted variable. Regular evaluation and updates to the model can help ensure its continued reliability and performance in practical applications.