# Linear Regression



Starting off with usual imports, stuff we have seen before already

In [38]:
#Dataframe and array manipulation
import pandas as pd
import numpy as np

#For visualization
import plotly
import plotly.express as px

Here are the imports that we will need to implement a linear regression model.

In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

The stuff above should all look very familiar, namely the `train_test_split` and the `accuracy_score` functions. 

The LinearRegression model will be under the import `sklearn.linear_model`. This is what contains the actual model we will be working with.

## Importing the Income Data

To start, let's import the income data that we were looking at earlier. For simplicity, let's *only* look at the data ranging from $15k to $70k.

Let's start with creating a dataframe to visualize the data.

In [40]:
# Importing the data into a pd dataframe
URL = "https://raw.githubusercontent.com/ishaandey/node/master/week-8/workshop/lin-reg/income.csv"
data = pd.read_csv(URL)
data.head()

Unnamed: 0.1,Unnamed: 0,income,happiness
0,1,3.862647,2.314489
1,2,4.979381,3.43349
2,3,4.923957,4.599373
3,4,3.214372,2.791114
4,5,7.196409,5.596398


Let's go ahead and drop the unnecessary column and multiply the income by 10000 to match dollars. (Throwback to data cleaning/manipulation)

In [41]:
# Dropping the first column
data = data.drop("Unnamed: 0", axis=1)
data['income'] = data['income']*10000
data.head()

Unnamed: 0,income,happiness
0,38626.474184,2.314489
1,49793.813825,3.43349
2,49239.569362,4.599373
3,32143.724388,2.791114
4,71964.092511,5.596398


That's much better.

## Let's visualize the data

In [42]:
# Simple scatter plot
px.scatter(data, x='income', y='happiness',
    labels = {"income" : "Income (in Euros)",
              'happiness' : 'Happiness Score (0 to 10)'
              }
)

Linear regression works very well with data that has a correlation with each other. Since both of the columns are already in numerical form, we don't have to do much in terms of cleaning/modifying the data.

Let's get it ready for the model now.

## Implementing Linear Regression

In [43]:
# Split into X and y and do train_test_split (this should be familiar)
X = data.drop(columns=['happiness'])
y = data.happiness

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [44]:
# Fit a LinearRegression to the data
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression()

In [45]:
# Predict on the testing data and compare it with the actual data
predicted = clf.predict(X_test)
actual = np.array(y_test)

print('Look at first 5 predictions:')
print('Predicted: ',predicted[:5])
print('Actual:    ',actual[:5])

Look at first 5 predictions:
Predicted:  [3.21752987 2.21792257 3.20280332 1.79020704 1.74394749]
Actual:     [4.7541684  4.1596093  2.29570014 2.31155424 2.86127444]


As you can see from the prediction/actual, none of these are exactly correct. It's kind of unrealistic to expect the model to accurately predict a value exactly. Let's check out what the model looks like through a scatter plot.

$$ y = \beta_{1} x + \beta_{0} $$

There's actually a way to get the coefficients that the model creates.

In [75]:
# Get the coefficients and y-intercept
coef = clf.coef_
intercept = clf.intercept_
print("beta_1 = ", coef)
print("beta_0 = ", intercept)

# Find the first and second point
# The first point will just be (0, intercept)
x_0 = 15000
y_0 = coef[0]*x_0 + intercept
x_1 = 75000
y_1 = coef[0]*x_1 + intercept
print("Point 1: [", x_0, ",", y_0, "]")
print("Point 2: [", x_1, ",", y_1, "]")

beta_1 =  [7.24768702e-05]
beta_0 =  0.14170387223712044
Point 1: [ 15000 , 1.228856924664379 ]
Point 2: [ 75000 , 5.5774691343734135 ]


In [76]:
# Graphs the data and the line on plotly
fig = px.scatter(data, x="income", y="happiness")
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0, y0 = y_0, x1 = x_1, y1 = y_1,
    line = dict(
        color = "red",
        width = 4,
    )   
)
fig.show()

## Metrics

You can't really look for the accuracy of a regression model like you would for classification models. A common way to look at how good a regression model is, is through the **Mean Squared Error**.

$$  \frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_a -y_p\Big)^2} $$

In [85]:
# Get the mean squared error of the linear regression model
predicted = clf.predict(X_test)
actual = np.array(y_test)
mse = mean_squared_error(predicted, actual)
print(mse)

0.5901463943118468


# Trying it out on real world data

Let's check out a different dataset. This one looks at the different medical charges a patient got from their visit to the hospital.

In [107]:
# Importing data into a dataframe
URL = "https://raw.githubusercontent.com/ishaandey/node/master/week-8/workshop/lin-reg/medical_charges.csv"
med = pd.read_csv(URL)
med.head()

In [104]:
px.scatter(med, x = "age", y = "charges", color = "children", opacity=.5)