# Week 4 Lab (Linear Regression)
COSC 3337 Dr. Rizk

## About The Data
Our goal for this lab is construct a model that can take a certain set of housing features and give us back a price estimate. Since price is a continuous variable, linear regression may be a good place to start from.
The dataset that we'll be using for this task comes from kaggle.com and contains the following attributes:
* 'Avg. Area Income': Avg. income of residents of the city house is located in.
* 'Avg. Area House Age': Avg age of houses in same city
* 'Avg. Area Number of Rooms': Avg number of rooms for houses in same city
* 'Avg. Area Number of Bedrooms': Avg number of bedrooms for houses in same city
* 'Area Population': Population of city house is located in
* 'Price': Price that the house sold at (target)
* 'Address': Address for the house

## Exploratory Data Analysis
Let's begin by importing some necessary libraries that we'll be using to explore the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Our first step is to load the data into a pandas DataFrame

In [None]:
housing_data = pd.read_csv('USA_Housing.csv')
housing_data.head()

In [None]:
housing_data.describe()

In [None]:
housing_data.info()

A quick pairplot lets us get an idea of the distributions and relationships in our dataset. From here, we could choose any interesting features that we'd like to later explore in greater depth. Warning: The more features in our dataset, the harder our pairplot will be to interpret.

In [None]:
sns.pairplot(housing_data)
plt.show()

Taking a closer look at price, we see that it's normally distributed with a peak around 1.232073e+06, and 75% of houses sold were at a price of 1.471210e+06 or lower.

In [None]:
sns.histplot(housing_data['Price'])
plt.show()
print(housing_data['Price'].describe())

In [None]:
sns.scatterplot(x='Price', y='Avg. Area Income', data=housing_data)
plt.show()

In [None]:
sns.boxplot(x='Avg. Area Number of Bedrooms', data=housing_data)
plt.show()

In [None]:
sns.heatmap(housing_data.corr(), annot=True)
plt.show()

## Creating Our Linear Model
We're now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn's train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you'd like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

X = housing_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = housing_data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train,y_train)

In [None]:
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
plt.show()

In [None]:
residuals = y_test - predictions
sns.histplot(residuals)
plt.show()

Here are the most common evaluation metrics for regression problems:
* **Mean Absolute Error (MAE)** is the mean of the absolute value of the errors:
* **Mean Squared Error (MSE)** is the mean of the squared errors:
* **Root Mean Squared Error (RMSE)** is the square root of the mean of the squared errors:

Comparing these metrics:
* MAE is the easiest to understand, because it's the average error.
* MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
* RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

Luckily, sklearn can calculate all of these metrics for us. All we need to do is pass the true labels (y_test) and our predictions to the functions below. What's more important is that we understand what each of these means. Root Mean Square Error (RMSE) is what we'll most commonly use, which is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells us how concentrated the data is around the line of best fit. Determining a good RMSE depends on your data. You can find a great example here, or refer back to the power points.

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

In [None]:
from sklearn.metrics import r2_score

print('R2 Score:', r2_score(y_test, predictions))

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df