# Dataset

THis is the [dataset](https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv) used for the linear regression.

---

## 1. Describing the Data

#### a. Description
The dataset is describing a series of homes and some statistics regarding the house, such as the sell price, amount of rooms or the age of the house.

#### b. Features
The features of the dataset represent: the sell price of the house, the listed price of the house, the size of the house, the amount of rooms in the house, the amount of bathrooms in the house, the age of the house, the size in acres of the house, the amount of taxes on the house.

#### c. Target Variable 
The target variable is selling cost.

## 2. Splitting into Sets

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score as accuracy

### Importing the Data

In [20]:
dataset = pd.read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv")

dataset["Bias"] = 1

dataset.head()

Unnamed: 0,Sell,"""List""","""Living""","""Rooms""","""Beds""","""Baths""","""Age""","""Acres""","""Taxes""",Bias
0,142,160,28,10,5,3,60,0.28,3167,1
1,175,180,18,8,4,1,12,0.43,4033,1
2,129,132,13,6,3,1,41,0.33,1471,1
3,138,140,17,7,3,1,22,0.46,3204,1
4,232,240,25,8,4,3,5,2.05,3613,1


In [21]:
X = dataset.drop(["Sell"], axis = 1)
Y = dataset.Sell

In [22]:
X_train, X_middle, Y_train, Y_middle = train_test_split(X, Y, test_size = 0.3)

X_train.shape, X_middle.shape, Y_train.shape, Y_middle.shape

((35, 9), (15, 9), (35,), (15,))

In [23]:
X_test, X_val, Y_test, Y_val = train_test_split(X_middle, Y_middle, test_size = 0.66)

X_test.shape, X_val.shape, Y_test.shape, Y_val.shape

((5, 9), (10, 9), (5,), (10,))

## 3. Linear Regression with SkLearn

In [38]:
regr = linear_model.LinearRegression()

regr.fit(X_train, Y_train)

prediction = regr.predict(X_test)

theta = regr.coef_

MSE = mean_squared_error(Y_test, prediction)
variance = r2_score(Y_test, prediction)


plt.show()

print("Variance: ", variance)
print("Mean Squared Error: ", MSE)

Variance:  0.9926806685424128
Mean Squared Error:  8.708247794979012


In [37]:
# Printing the weights of the regression
print(theta)

[ 1.00208587e+00 -2.03573715e-01 -3.69620301e-01  1.15245794e+00
 -1.58360154e-01 -3.32537316e-02 -4.05963004e-01 -1.14992621e-03
  0.00000000e+00]


## 4. Describing the Data

#### a. Does the data have a linear relationship?
When comparing the features: List Price and Sell Price, there is a linear relationship between the two. As the list price increases, the actual sell price also increases. The variance of the data is 0.99 so the data is pretty linear.

#### b. Coefficients
In terms of the coefficents, they are shown above, but in terms of describing the data, the list price and the amount of beds carry the most amount of weight, whereas the bias carries no weight with the cost function.

## 5. L2 Regularization

If you were to want to use L2 regularization on the data and include it in our model, we would have to include the ridge function.

In [15]:
from sklearn.linear_model import Ridge

clf = Ridge(alpha=0.1)

## 6. Statement of Collaboration 

#### a. Whom you worked with

For this assignment, I worked with a lot with Matt, Kolby, and Tucker on most of it. However, most of the by hand was helped by you as well as James.

#### b. Resources Used

The resources I used were a few websites explaining linear regression a bit more and then some helping me with sklearn, but I do not have the websites that I used.

## 7. Linear Regression by Hand

In [None]:
mse_record = []
thetaMatrix = np.random.normal(size = (9, 1))
alpha = 0.01

for t in range(1, 10):
    yHat = np.dot(np.array(X_train), thetaMatrix)
    err = np.array(Y_train) - yHat
    gradient = -1 * (np.dot(np.array(X_train).T, err))
    thetaMatrix = thetaMatrix - (alpha * gradient)
    mse = np.dot(err.T, err) * (1 / 50)
    mse_record.append([t, mse])
    
    yHat_val = np.dot(np.array(X_val), thetaMatrix)
    mse_val = np.dot((np.array(Y_val) - yHat_val).T, (np.array(Y_val) - yHat_val)) * (1 / 50)