# Linear regression


Step 1 (Understanding the Dataset):
- import the file csv with the time-series of a pv station.
- see the Corr. matrix of the dataset for check the correlation beetween variables
- save the x and y variables.

In [4]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('station00.csv')
print(data.head())

             date_time  nwp_globalirrad  nwp_directirrad  nwp_temperature   
0  2018-08-15 16:00:00              0.0              0.0            22.78  \
1  2018-08-15 16:15:00              0.0              0.0            22.75   
2  2018-08-15 16:30:00              0.0              0.0            22.71   
3  2018-08-15 16:45:00              0.0              0.0            22.64   
4  2018-08-15 17:00:00              0.0              0.0            22.57   

   nwp_humidity  nwp_windspeed  nwp_winddirection  nwp_pressure   
0         96.85           4.28             339.41       1007.27  \
1         96.91           4.30             337.27       1007.27   
2         96.95           4.28             334.47       1007.48   
3         97.12           4.28             331.52       1007.39   
4         97.15           4.33             329.78       1007.09   

   lmd_totalirrad  lmd_diffuseirrad  lmd_temperature  lmd_pressure   
0               0                 0        25.900000   1006.2999

In [6]:
#data.corr()

- We focus on the last row of our matrix. As we can see, the power is close to perfect linearity correlation with some variables.

In [7]:
x = pd. DataFrame (data.iloc [:,1:-1])
y = pd. DataFrame (data.iloc [:, -1])
print(x)
print(y)

       nwp_globalirrad  nwp_directirrad  nwp_temperature  nwp_humidity   
0                  0.0              0.0            22.78         96.85  \
1                  0.0              0.0            22.75         96.91   
2                  0.0              0.0            22.71         96.95   
3                  0.0              0.0            22.64         97.12   
4                  0.0              0.0            22.57         97.15   
...                ...              ...              ...           ...   
28891              0.0              0.0            26.04         49.55   
28892              0.0              0.0            25.80         51.03   
28893              0.0              0.0            25.54         52.43   
28894              0.0              0.0            25.23         53.84   
28895              0.0              0.0            24.88         55.21   

       nwp_windspeed  nwp_winddirection  nwp_pressure  lmd_totalirrad   
0               4.28             339.4

Step 2 (Splitting for training and testing the model ):
- Let's split the data with the train_test_split() function from from sklearn.linear_selection library

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)      

In [10]:
# see the shape of our train and test sets
print(f'x_train shape: {x_train.shape}',
      f'x_test shape: {x_test.shape}',
      f'y_train shape: {y_train.shape}',
      f'y_test shape: {y_test.shape}',
     )

x_train shape: (23116, 13) x_test shape: (5780, 13) y_train shape: (23116, 1) y_test shape: (5780, 1)


Step 3 (fit the model):
- We fit the model with LinearRegression(function) from sklearn.linear_model library

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

In [13]:
# check the fit, we want to see the intercept of our linear regression and the coef
print(regressor.intercept_)
print(regressor.coef_)

[-2.1192357]
[[ 2.07915268e-03 -1.65279386e-03 -2.65739360e-03  1.16026962e-03
   5.67826591e-03 -8.41491748e-05  1.19413712e-02  4.38162643e-03
   4.98592285e-06  2.26782295e-03 -9.94917922e-03  1.30087638e-04
   8.57860189e-03]]


Step 3 (Evaluate the model):
- first, we predict the values with x_test

In [14]:
y_pred = regressor.predict(x_test)

In [15]:
from sklearn import metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'Mean absolute error: {mae:.2f}')
print(f'Mean squared error: {mse:.2f}')
print(f'Root mean squared error: {rmse:.2f}')

Mean absolute error: 0.11
Mean squared error: 0.08
Root mean squared error: 0.28


- The mse is close to 0. Good!!
- The rmse of our model is 0.28, wich is small. That means our model might get its prediction wrong by adding or subtracting 0.28 from the actual value. 

In [16]:
regressor.score(x_test, y_test)

0.9516386787302777

- Also the R^2 is high, it means  This means that the predictor variables explain about 95% of the variance in the response variable.

Rest of the work:
- save the model with dump

In [17]:
print(regressor.coef_)

[[ 2.07915268e-03 -1.65279386e-03 -2.65739360e-03  1.16026962e-03
   5.67826591e-03 -8.41491748e-05  1.19413712e-02  4.38162643e-03
   4.98592285e-06  2.26782295e-03 -9.94917922e-03  1.30087638e-04
   8.57860189e-03]]
