In [1]:
'''
This is the list of all the libraries used to built Linear Regression from the scratch.
pandas is imported to Read the csv file and do some preprocessing before actually training the model.
train_test_split is imported to split the dataset into two parts :  1. Training Dataset
                                                                    2. Test Dataset
mean_absolute_error is imported to calculate the error in the predictions of our Linear Regression Model developed.
'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [2]:
'''
The dataset is downloaded from Kaggle website link : 
The dataset has one small problem that the decimal points are actually ',' and not '.' thus I change it.
'''
df = pd.read_csv('beer-consumption.csv',decimal=',')
df.columns

Index(['Data', 'Temperatura Media (C)', 'Temperatura Minima (C)',
       'Temperatura Maxima (C)', 'Precipitacao (mm)', 'Final de Semana',
       'Consumo de cerveja (litros)'],
      dtype='object')

In [3]:
'''
Since actual data has more than one column (features) but we are doing Simple Linear Regression, we only need ONE feature. 
I chose AvgTemp as my feature or x as you will realise in the later code.
Since this is a regression proble we need to find a Column that we will be predicting. For this purpose we chose BeerConsumption.
BeerConsumption column is going to are y.  

Actual data can consist of Null values in certain columns.
There are different ways to handle it. 
    One of the value is to drop it.  
    Another is to replace the Null values with a different value. Something like Mean or Median or Min/Max values of the column.
Whatever we chose to do affects our Model training to a significant amount. 
For the sake on simplicity I am replacing the null values with Mean.
'''

df.columns = ['Date','AvgTemp','MinTemp','MaxTemp', 'Precipitation','Weekend','BeerConsumption']
df.AvgTemp = df.AvgTemp.fillna(df.AvgTemp.mean())
df.BeerConsumption =df.BeerConsumption.astype('float')
df.BeerConsumption = df.BeerConsumption.fillna(df.BeerConsumption.mean())

In [4]:
'''
Over here, i am defining a variable and loading the required data into it. (That is the AvgTemp and BeerConsumption)
Note i have done this is two different ways. Its just a way to write code. Either one is acceptable.
Since we have only one column for X and y  the datatype of X and y is not DataFrame but Series.
Check the Datatype of X and y but uncommenting the last time
'''
x = df['AvgTemp']
y = df.BeerConsumption
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
#print('this type of x is {} and for y is {}'.format(type(X_test),type(y_test)))

In [5]:
#Cost Function : For Calculating about the cost (distance/error) from the line to the points (x)
'''
Cost functions help us to understand how we need to change the parameters so as to build a line which has the minimum cost of all the points from the Line which we are builing.

Equation for the line is : y = mx+c
y -> output 
m -> slope
x -> input
c -> intercept
You can print the Cost Function by uncommenting the last line
However while using the cost function later on I am feeding in the value of Slope and intercept
'''
def cost_function(X_train,y_train,slope=1.17,intercept=4.15):
    N = len(x)
    error_sum = 0.0
    for i in range(N):
        error_sum += (y[i]-(slope*x[i]+intercept))**2
    avg_error = (error_sum/N)
    return avg_error  
print(cost_function(X_train,y_train))

18.405954152620264


In [6]:
# Gradient Descent : For finding the correct value of m (slope) and c (intercept) 
'''
For the sake of simplicity we are keeping the Learning rate as 0.001 (lr)
Equation for Gradient Descent is : 
    For slope m      : 1/N Σ -2*x (y- (m*x+c))
    For intercept    : 1/N Σ -2*(y- (m*x+c)) 
lr or Learning rate is the change along the line we make to find the optimum value of the Slope and Intercept.
You can print the Gradient Descent by uncommenting the last line
'''

def gradient_descent(X_train,y_train,m=1.17,c=4.15,lr=0.001):
    intercept_derivative = 0
    slope_derivative = 0
    N = len(x)
    for i in range(N):
        slope_derivative +=  -2*x[i]*(y[i]-(m*x[i]+c))
        intercept_derivative +=  -2*(y[i]-(m*x[i]+c))
    m -= (slope_derivative/N) * lr
    c -= (intercept_derivative/N)*lr
    return (m,c)
#print(gradient_descent(X_train,y_train))

In [7]:
'''
Training the linear regression model is about finding the values of Slope m and Intercept c,
such that the cost function for the line formed with the slope and intercept is the lowest.
To observe how the cost function, slope and intercept get changed, uncomment the if statement.
Over here we are creating a function that will train our data by calling the cost function and gradient descent function internally.
'''
def train(X_train,y_train,slope,intercept,lr,iters):
    cost_history = []
    for i in range(iters):
        slope,intercept = gradient_descent(x,y,slope,intercept,lr)
        cost = cost_function(x,y,slope,intercept)
        cost_history.append([slope,intercept,cost])
#        if i%50==0:
#            print("The error is :{:.4f}, slope is: {:.4f} and intercept is : {:.4f}".format(cost,slope,intercept))
    traindf = pd.DataFrame(cost_history, columns=['slope','intercept','error'])
    slope_val = traindf.error.idxmin()
    (traindf.slope.iloc[slope_val],traindf.intercept.iloc[slope_val])
    return ((traindf.slope.iloc[slope_val],traindf.intercept.iloc[slope_val]))

In [11]:
'''
Here we are calling the train function to get the calculated value of Slope and Intercept,
which i found out by processing the Data.
You can uncomment the print statement to find the values of Slope and Intercept returned.
'''
m,c = train(X_train,y_train,0.805,8.35,0.002,1000)
#print("m : {} and c : {}".format(m,c))

In [12]:
'''Prediction Function
Based on the values returned by the train function we will now predict the Beer Consumption based on equating the follownig equation:
y = m*x +c
x -> Now the x that we will pass is X_test. We have trained the data and want to do predictions

'''
def predict_sales(m,x,c):
    print("print m :{} and c :{}".format(m,c))
    return (m*x+c)
lets_predict = predict_sales(m,X_test,c)

print m :0.8029601790134907 and c :8.355962955264562


In [13]:
'''
We are now comparing our predictions based on calculations performed by the predict_sales with the actual values of the 
Mean Absolute Error is one way of doing it. 
'''
my_mae = mean_absolute_error(lets_predict,y_test)
print(my_mae)

1.3078108342324049
