# Linear Regression



Starting off with usual imports, stuff we have seen before already

In [None]:
#Dataframe and array manipulation
import pandas as pd
import numpy as np

#For visualization
import plotly
import plotly.express as px

Here are the imports that we will need to implement a linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

The stuff above should all look very familiar, namely the `train_test_split` function. You may notice that we aren't importing the `accuracy_score` metric from `sklearn.metrics`, but are rather importing `mean_squared_error`. More on that later...

The LinearRegression model will be under the import `sklearn.linear_model`. This is what contains the actual model we will be working with.

## Importing the Income Data

To start, let's import the income data that we were looking at earlier. For simplicity, let's *only* look at the data ranging from $15k to $70k.

Let's start with creating a dataframe to visualize the data.

In [None]:
# Importing the data into a pd dataframe and take a look at it
URL = "https://raw.githubusercontent.com/ishaandey/node/master/week-8/workshop/lin-reg/income.csv"


Let's go ahead and drop the unnecessary column and multiply the income by 10000 to match dollars. (Throwback to data cleaning/manipulation)

In [None]:
# Basic Cleaning


That's much better.

## Let's visualize the data

In [None]:
# Simple scatter plot
px.scatter(data, x='income', y='happiness',
    labels = {"income" : "Income (in Euros)",
              'happiness' : 'Happiness Score (0 to 10)'
              }
)

Linear regression works very well with data that has a correlation with each other. Since both of the columns are already in numerical form, we don't have to do much in terms of cleaning/modifying the data.

Let's get it ready for the model now.

## Implementing Linear Regression

In [None]:
# Split into X and y and do train_test_split (this should be familiar)


In [None]:
# Fit a LinearRegression to the data


In [None]:
# Predict on the testing data and compare it with the actual data


As you can see from the prediction/actual, none of these are exactly correct. It's kind of unrealistic to expect the model to accurately predict a value exactly. Let's check out what the model looks like through a scatter plot.

$$ y = \beta_{1} x + \beta_{0} $$

There's actually a way to get the coefficients that the model creates.

In [None]:
# Get the coefficients and y-intercept

print("beta_1 = ", coef[0])
print("beta_0 = ", intercept)

# Find the first and second point
# The first point will just be (0, intercept)

print("Point 1: [", x_0, ",", y_0, "]")
print("Point 2: [", x_1, ",", y_1, "]")

In [None]:
# Graphs the data and the line on plotly
fig = px.scatter(data, x="income", y="happiness")
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0, y0 = y_0, x1 = x_1, y1 = y_1,
    line = dict(
        color = "red",
        width = 4,
    )   
)
fig.show()

## Metrics

You can't really look for the accuracy of a regression model like you would for classification models. A common way to look at how good a regression model is, is through the **Mean Squared Error**.

$$  \frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_a -y_p\Big)^2} $$

In [None]:
# Get the mean squared error of the linear regression model


# Trying it out on complex data

Let's check out a different dataset. This one looks at the different medical charges a patient got from their visit to the hospital.

In [None]:
# Importing data into a dataframe
URL = "https://raw.githubusercontent.com/ishaandey/node/master/week-8/workshop/lin-reg/med_charges.csv"


# Clean the dataset
med = med.drop("Unnamed: 0", axis=1)
med.head()

Let's try to predict the medical charge someone would have with certain characteristics (age, bmi, etc.)

In [None]:
# Correlation matrix


In [None]:
# Let's look at age vs. charges


It looks like there is a clear separation between smokers and non-smokers. It would be a good idea to split the dataset on that to have a more accurate model for one or the other.

In [None]:
# Get only the non-smokers


# Go ahead and drop the smoker column (redudancy)


There's some categorical data in there. Let's change it to numerical with the `pd.get_dummies` function

In [None]:
# Change the categorical data to numerical


Now, let's go ahead and put this into a model, starting with only looking at one variable: age. The code here should look quite familiar

In [None]:
# Divide into X and y


# Split the data

# Create and train the model



In [None]:
# Generate Predictions


# Get MSE
print("MSE:",mse_age)

# Get RMSE
print("RMSE:",rmse_age)

## Let's visualize the trend

In [None]:
# Getting coefficients and intercept


# Creating the line


# Graphs the data and the line on plotly
fig = px.scatter(num_ns, x="age", y="charges")
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0, y0 = y_0, x1 = x_1, y1 = y_1,
    line = dict(
        color = "red",
        width = 4,
    )   
)
fig.show()

# Using Multiple Features for Linear Regression

What if we wanted to look at ALL of the different columns within the dataset (age, bmi, children, sex)?

We just add coefficients!

$$ charges = \beta_{1} * age + \beta_{2} * bmi + \beta_{3} * children + \beta_{4} * sex\_male + \beta_{5} * sex\_female + \beta_{0}$$

In [None]:
# Divide into X and y


# Split the data

# Create and train the model


In [None]:
# Generate Predictions


# Get MSE for both
print("MSE for model with just age:", mse_age)
print("MSE for model with all features:", mse_mult)

# Get RMSE for both
print("MSE for model with just age:", rmse_age)
print("MSE for model with all features:", rmse_mult)

It's going to be a little hard to graph this... Why do you think that is?