# Linear Regression

Term 1 2020 - Instructor: Teerapong Leelanupab

Teaching Assistant: 
1. Tiwipab Meephruek (Mil)
2. Jiratkul Wangsiripaisarn (Brooklyn)
3. Hataichanok Sakkara (Pond)
***

*Regression analysis* is a common statistical process for estimating the relationships between variables. This can allow us to make numeric predictions based on past data. *Simple Linear Regression* predicts a numeric response variable based on a single input variable (feature).

### Example 1: Simple Linear Regression

To demonstrate the use of simple linear regression with sci-kit learn, we will first create sample data in the form of NumPy arrays.

In [None]:
import numpy as np
np.random.seed(0)
x = np.random.random(size=(15, 1))
y = 3 * x.flatten() + 2 + np.random.randn(15)

First, let's plot the data using Matplotlib:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x, y, 'o')

Apply simple linear regression to learn (fit) the model, where *x* is our input variable and *y* is the target variable that we would like to learn how to predict:

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)

Display the model parameters that we have learned: 

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_[0])

This model can now be use to make predictions for *y* given new values of *x*:

In [None]:
x_unseen =  np.array([0.78])
#Note that for predicting a single sample you can either use array.reshape(1, -1), but array.reshape(-1, 1) can be used for either a single sample or a single feature.   
model.predict(x_unseen.reshape(-1, 1)) 

In [None]:
x_unseen =  np.array([0.78, 0.92])
model.predict(x_unseen.reshape(-1, 1))

Plot the data and the model prediction (i.e. the regression line):

In [None]:
# create predictions which we will use to generate our line
X_fit = np.linspace(0, 1, 100)[:, np.newaxis]
y_fit = model.predict(X_fit)
# plot the data
plt.plot(x.flatten(), y, 'o')
# plot the line
plt.plot(X_fit, y_fit)

### Example 2: Simple Linear Regression
As a second example, we will examine a dataset of 244 meals, with details of total meal bill and tip amount.

In [None]:
import pandas as pd
df = pd.read_csv("data/tips.csv")
len(df)

In [None]:
df.head(5)

First, let's plot the data using Matplotlib:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
p = df.plot.scatter("total_bill","tip", s=60, figsize=(8,6), fontsize=14)
plt.xlabel('Total Bill', fontsize=14)
plt.ylabel('Tip', fontsize=14);

From the above, it seems there is a reasonably strong relationship. Let's quantify the level of correlation between the two variables.

In [None]:
df.corr()

We could also look at a boxplot of the data, to see if there are outlying values:

In [None]:
df.boxplot(figsize=(5,6), fontsize=14);

Now, apply simple linear regression to learn (fit) the model, where *x* (the total bill) is our independent variable and *y* (the tip amount) is the target variable that we would like to learn how to predict:

In [None]:
# Note, we need to extract the columns as DataFrames, not series
x = df[["total_bill"]].values
y = df[["tip"]].values

In [None]:
# Now build the regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)

Look at the parameters of the model we learned (the regression line):

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_[0])

Now let's plot the data again, adding our regression line

In [None]:
# plot the data
plt.figure(figsize=(8,6))
plt.scatter(x, y)
# plot the regression line
m = model.coef_[0]
b = model.intercept_
plt.plot([min(x), max(x)], [b, m*max(x) + b], 'r')
plt.xlabel('Total Bill', fontsize=14)
plt.ylabel('Tip', fontsize=14);

We can make predictions from this model:

In [None]:
bills = np.arange( 10, 70, 5 )
predict_tips = model.predict(bills.reshape(-1, 1))
for i in range(bills.size):
    print("Predicted tip for meal costing %.2f = %.2f" % ( bills[i],  predict_tips[i] ) )

In [None]:
#Or    
for bill in bills:
    predict_tip = model.predict(bill.reshape(-1,1))
    print("Predicted tip for meal costing %.2f = %.2f" % ( bill,  predict_tip ) )

We can also compare the outputs of our model, with the original data to see if it agrees (note: normally we would use a separate test dataset in a real evaluation).

In [None]:
# Let's just look at the first few rows
for i in range(10):
    test_x = x[i][0]
    actual_y = y[i][0]
    predict_y = model.predict(test_x.reshape(-1,1))
    print("For meal costing %.2f. Predicted tip = %.2f, Actual tip = %.2f"  %( test_x, predict_y, actual_y ) )

### Example 3: Simple Linear Regression
As a third example of simple linear regression, we will load a CSV dataset related to product advertising. Would like to analyse the relationship between budget spent on different advertising media and product sales.

In [None]:
import pandas as pd
df = pd.read_csv("data/advertising.csv", index_col=0)
df.head()

Will will try building a simple linear model to predict Sales based on the TV budget spend:

In [None]:
model = LinearRegression()
# create a copy of the data frame, with a single input variable
x = df[["TV"]]
# fit the model based on the original response variable
model.fit(x,df["Sales"])

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_[0])

Let's try to predict the first five values of the original data (note: normally we would use a separate test dataset in a real evaluation).

In [None]:
test_x = x[0:5]
model.predict(test_x)

When we compare the predictons to the actual sales values for the first 5 rows, we see there are some errors:

In [None]:
df["Sales"][0:5]

We can create a plot that shows how the regression line fits to the data for this feature:

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(df["TV"], df["Sales"])
plt.xlabel("TV Budget Spend",fontsize=14)
plt.ylabel("Sales",fontsize=14)
# add the predictions from regression
plt.plot(df["TV"], model.predict(x), color="red")
plt.show()

We can now use this to make predictions

In [None]:
budgets = np.arange( 0, 400, 50 )
for spend in budgets:
    predict_sales = model.predict(spend.reshape(-1,1))
    print("Predicted sales for TV advertising spend of %.2f = %.2f" % ( spend,  predict_sales ) )

We can calculate the overall *mean squared error* between the predictions and the actual sales data. This gives us an idea of how well the model based on TV budget predicts sales.

In [None]:
np.mean((df["Sales"] - model.predict(x)) ** 2)

We can repeat the same process using a different features, such as newspaper budget spend:

In [None]:
# extract the relevant column
x = df[["Newspaper"]]
# build the model
model = LinearRegression()
model.fit(x,df["Sales"])

When we calculate the overall *mean squared error* between the predictions and the actual sales data, we see that making predictions based on the newspaper spend leads to a higher error - i.e. this feature is a less reliable predictor.

In [None]:
np.mean((df["Sales"] - model.predict(x)) ** 2)

For real evaluations we would use a separate *test set* to measure the quality of predictions:

In [None]:
# separate the training test data - normally we would do this randomly
train_df = df[0:160]
test_df = df[160:200]
train_x = train_df[["Newspaper"]]
test_x = test_df[["Newspaper"]]

In [None]:
# only build a model on the training set
model = LinearRegression()
model.fit(train_x,train_df["Sales"])

In [None]:
model.predict(test_x)

In [None]:
np.mean((test_df["Sales"] - model.predict(test_x)) ** 2)

### Example 4: Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features, where we try to learn a model with one coefficient per input feature.

We will use the previous advertising dataset, which had 3 independent features: TV, Radio, Newspaper.

In [None]:
df = pd.read_csv("data/advertising.csv", index_col=0)
# we remove the sales column that we are going to predict
x = df.drop("Sales",axis=1)
x.head()

Now use all 3 input variables to fit linear regression model:

In [None]:
model = LinearRegression()
model.fit(x,df["Sales"])

When we build the model, note that each input feature has its own slope coefficient:

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_)

Again we can make predictions for sales based on new values for the 3 input features:

In [None]:
test_x = x[0:1]
print(test_x)
print("Predicted Sales = %.2f" % model.predict(test_x))
print("Actual Sales = %.2f" % df["Sales"].iloc[0])

We can make predictions for multiple new unseen examples in the same way:

In [None]:
unseen_X = np.array( [ [ 140.0, 45.3, 70.5 ], [ 70.0, 84.62, 98.95 ] ] )
unseen_X

In [None]:
model.predict( unseen_X )

### Example 5.1: Simple Linear Regression

More example from [Towards Data Science](https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f) using Weather data

In [None]:
import seaborn as sb 
from sklearn.model_selection import train_test_split 
from sklearn import metrics

dataset = pd.read_csv('data/Weather.csv')
dataset.shape

Let’s explore the data a little bit by checking the number of rows and columns in our datasets.

In [None]:
dataset.describe()

To see the statistical details of the dataset, we can use describe():

And finally, let’s plot our data points on a 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data using the below script :

In [None]:
dataset.plot(x='MinTemp', y='MaxTemp', style='o')  
plt.title('MinTemp vs MaxTemp')  
plt.xlabel('MinTemp')  
plt.ylabel('MaxTemp')  
plt.show()

Let’s check the average max temperature and once we plot it we can observe that the Average Maximum Temperature is Between Nearly 25 and 35.

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
sb.distplot(dataset['MaxTemp'])

Our next step is to divide the data into “attributes” and “labels”.
Attributes are the independent variables while labels are dependent variables whose values are to be predicted. In our dataset, we only have two columns. We want to predict the MaxTemp depending upon the MinTemp recorded. Therefore our attribute set will consist of the “MinTemp” column which is stored in the X variable, and the label will be the “MaxTemp” column which is stored in y variable.

In [None]:
X = dataset['MinTemp'].values.reshape(-1,1)
y = dataset['MaxTemp'].values.reshape(-1,1)

Next, we split 80% of the data to the training set while 20% of the data to test set using below code.
The test_size variable is where we actually specify the proportion of the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

After splitting the data into training and testing sets, finally, the time is to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.

In [None]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm

As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slope calculated by the linear regression algorithm for our dataset, execute the following code.

In [None]:
#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)

The result should be approximately 10.66185201 and
0.92033997 respectively.

This means that for every one unit of change in Min temperature, the change in the Max temperature is about 0.92%.

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. To make predictions on the test data, execute the following script:

In [None]:
y_pred = regressor.predict(X_test)

Now compare the actual output values for X_test with the predicted values, execute the following script:

In [None]:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df

We can also visualize comparison result as a bar graph using the below script :

Note: As the number of records is huge, for representation purpose I’m taking just 25 records.

In [None]:
df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

Though our model is not very precise, the predicted percentages are close to the actual ones.

Let's plot our straight line with the test data :

In [None]:
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

The straight line in the above graph shows our algorithm is correct.

The final step is to evaluate the performance of the algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For regression algorithms, three evaluation metrics are commonly used:

1. **Mean Absolute Error (MAE)** is the mean of the absolute value of the errors. It is calculated as:
![Mean Absolute Error Formula](images/MAE.png)
#### <center>Mean Absolute Error</center>


2. **Mean Squared Error (MSE)** is the mean of the squared errors and is calculated as:
![Mean Squared Error Formula](images/MSE.png)
#### <center>Mean Squared Error</center>

3. **Root Mean Squared Error (RMSE)** is the square root of the mean of the squared errors:
![Root Mean Squared Error Formula](images/RMSE.gif )
#### <center>Root Mean Squared Error</center>

We don’t have to perform these calculations manually. The Scikit-Learn library comes with pre-built functions that can be used to find out these values for us.

Let’s find the values for these metrics using our test data.

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

As you can see, the value of root mean squared error is 4.19, which is more than 10% of the mean value of the percentages of all the temperature i.e. 22.41. This means that our algorithm was not very accurate but can still make reasonably good predictions.

### Example 5.2: Multiple Linear Regression

![Multiple Linear Regression](images/MultipleLinearRegression.png )
#### <center>Multiple Linear Regression</center>


We just performed linear regression in the above section (5.1) involving two variables. Almost all the real-world problems that you are going to encounter will have more than two variables. Linear regression involving multiple variables is called “multiple linear regression” or multivariate linear regression. The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.

In this section, I have downloaded red wine quality dataset. The dataset related to red variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

We will take into account various input features like fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. Based on these features we will predict the quality of the wine.

Now, let's start our coding :

In [None]:
#The following command imports the dataset from the file:
dataset = pd.read_csv('data/winequality.csv')

Let’s explore the data a little bit by checking the number of rows and columns in it.

In [None]:
dataset.shape

To see the statistical details of the dataset, we can use describe():

In [None]:
dataset.describe()

Let us clean our data little bit, So first check which are the columns the contains NaN values in it :

In [None]:
dataset.isnull().any()

Once the above code is executed, all the columns should give False, In case for any column you find True result, then remove all the null values from that column using below code.

In [None]:
dataset = dataset.fillna(method='ffill')

Our next step is to divide the data into “attributes” and “labels”. X variable contains all the attributes/features and y variable contains labels.

In [None]:
X = dataset[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']]
y = dataset['quality']

Let's check the average value of the “quality” column.

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
sb.distplot(dataset['quality'])

As we can observe that most of the time the value is either 5 or 6.

Next, we split 80% of the data to the training set while 20% of the data to test set using below code.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now lets train our model.

In [None]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

As said earlier, in the case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:

In [None]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df

This means that for a unit increase in “density”, there is a decrease of 31.51 units in the quality of the wine. Similarly, a unit decrease in “Chlorides“ results in an increase of 1.87 units in the quality of the wine. We can see that the rest of the features have very little effect on the quality of the wine.

Now let's do prediction on test data.

In [None]:
y_pred = regressor.predict(X_test)

Check the difference between the actual value and predicted value.

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(25)

Now let's plot the comparison of Actual and Predicted values

In [None]:
df.head(25).plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

As we can observe here that our model has returned pretty good prediction results.

The final step is to evaluate the performance of the algorithm. We’ll do this by finding the values for MAE, MSE, and RMSE. Execute the following script:

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

You can see that the value of root mean squared error is 0.62, which is slightly greater than 10% of the mean value which is 5.63. This means that our algorithm was not very accurate but can still make reasonably good predictions.

There are many factors that may have contributed to this inaccuracy, for example :

**Need more data**: We need to have a huge amount of data to get the best possible prediction.

**Bad assumptions**: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.

**Poor features**: The features we used may not have had a high enough correlation to the values we were trying to predict.

#### Conclusion
In this article, we studied the most fundamental machine learning algorithms i.e. linear regression. We implemented both simple linear regression and multiple linear regression with the help of the Scikit-Learn machine learning library.