## Importing the Libraries

This imports pandas, numpy, and matplotlib.pyplot so that they can be used to manipulate and analyze the data from the csv file.

In [376]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Reading the CSV File

`housing` is created here by using pandas to read the csv file and pull all the data into a single place.

In [377]:
housing = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/CS 430 Machine Learning/InClass_Assignment2/newhousing.csv")

## Exploring the Dataset

`housing.keys()` lists the columns in the dataset in an array. `housing.head()` allows the user to see the first 5 rows of the dataset. `housing.info()` allows the user to see the total number of entries, the datatypes of each column, how many entries are null / not null, and each column in the dataset. `value_counts()` is called to determine if `housing` is evenly distributed. Since it is, y does not need to be stratified later.

In [378]:
housing.keys()

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'semi-furnished', 'unfurnished',
       'areaperbedroom', 'bbratio'],
      dtype='object')

In [379]:
housing.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,semi-furnished,unfurnished,areaperbedroom,bbratio
0,5250000,5500,3,2,1,1,0,1,0,0,0,0,1,0,1833.333333,0.666667
1,4480000,4040,3,1,2,1,0,0,0,0,1,0,1,0,1346.666667,0.333333
2,3570000,3640,2,1,1,1,0,0,0,0,0,0,0,1,1820.0,0.5
3,2870000,3040,2,1,1,0,0,0,0,0,0,0,0,1,1520.0,0.5
4,3570000,4500,2,1,1,0,0,0,0,0,0,0,0,0,2250.0,0.5


In [380]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   price            545 non-null    int64  
 1   area             545 non-null    int64  
 2   bedrooms         545 non-null    int64  
 3   bathrooms        545 non-null    int64  
 4   stories          545 non-null    int64  
 5   mainroad         545 non-null    int64  
 6   guestroom        545 non-null    int64  
 7   basement         545 non-null    int64  
 8   hotwaterheating  545 non-null    int64  
 9   airconditioning  545 non-null    int64  
 10  parking          545 non-null    int64  
 11  prefarea         545 non-null    int64  
 12  semi-furnished   545 non-null    int64  
 13  unfurnished      545 non-null    int64  
 14  areaperbedroom   545 non-null    float64
 15  bbratio          545 non-null    float64
dtypes: float64(2), int64(14)
memory usage: 68.2 KB


In [381]:
print(housing.value_counts())

price     area   bedrooms  bathrooms  stories  mainroad  guestroom  basement  hotwaterheating  airconditioning  parking  prefarea  semi-furnished  unfurnished  areaperbedroom  bbratio 
1750000   2910   3         1          1        0         0          0         0                0                0        0         0               0            970.000000      0.333333    1
5229000   7085   3         1          1        1         1          1         0                0                2        1         1               0            2361.666667     0.333333    1
5110000   11410  2         1          2        1         0          0         0                0                0        1         0               0            5705.000000     0.500000    1
5145000   3410   3         1          2        0         0          0         0                1                0        0         1               0            1136.666667     0.333333    1
          7980   3         1          1        1       

## Splitting and Training the Dataset

The `x` and `y` variables are created using the drop method. `train_test_split` is called from sklearn to split the data 85/15 into 4 different parts: `x_train`, `x_test`, `y_train`, and `y_test`. `StandardScaler()` is called from `sklearn.preprocessing` to take the x and y sets and scales them to fit the rest of the data and make the model more accurate. This is useful because PCA requires scaling or else the model will not be trained correctly. It also will not impact the other regression techniques, only PCA.

In [382]:
x = housing.drop(['price'], axis = 1)
y = housing[['price']]

In [383]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x_scaled = sc_x.fit_transform(x)
y_scaled = sc_y.fit_transform(y)

In [384]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size = 0.3, random_state = 42, shuffle = True)

## Linear Regression

`LinearRegression()` is called from sklearn and takes no parameters. It is then fitted to both x and y trains and the score is found. This model is not great since it is only about 68% accurate. Although this is decent, it could be better. The model is also trained and split well since the data is not overfitted or underfitted.

In [385]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)

LinearRegression()

In [386]:
lr.score(x_test, y_test)

0.6814896416143945

In [387]:
lr.score(x_train, y_train)

0.6760953066606616

## Random Forest Regression

`RandomForestRegressor()` is called from sklearn and is fitted to x and y trains. This model is better than the Linear Regression model with an accuracy score of 72%. It is also overfitted with the training sets having an accuracy score of 94%. 

In [388]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestRegressor()

In [389]:
rf.score(x_test, y_test)

0.7148308328917126

In [390]:
rf.score(x_train, y_train)

0.9330081177414645

## PCA with Linear Regression

`PCA` is called here from sklearn's decomposition library. It takes the parameters of 10 n_components and is fitted to `x_scaled`. It is then trained again with the new `x_pca` with the same parameters. The `lr` variable is called again and fitted with the new x and y training sets. The scores show that the pca model worked since the scores stayed roughly the same and the model is not overfitted or underfitted.

In [391]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 10)
pca.fit(x_scaled)
x_pca = pca.transform(x_scaled)

In [392]:
x_pca.shape

(545, 10)

In [393]:
x_pca

array([[ 0.6608588 , -0.17104646,  0.29030255, ..., -0.72534517,
         0.1355855 , -0.8905853 ],
       [-0.75750595, -0.45325575, -1.24020117, ..., -0.19650464,
        -0.59505964, -0.46421049],
       [-1.91995217,  1.54975914,  1.24787064, ...,  0.09681975,
        -0.41720109, -0.27603045],
       ...,
       [-0.09605631,  1.65985667, -0.49436167, ...,  0.14449279,
         0.74299408, -1.15045941],
       [ 2.54323222,  4.34098798,  0.0854805 , ...,  0.19610311,
         2.89864034, -0.45194424],
       [-1.94832471, -0.90014852, -0.83130394, ..., -0.0165853 ,
         1.4428123 ,  0.0454547 ]])

In [394]:
x_train, x_test, y_train, y_test = train_test_split(x_pca, y_scaled, test_size = 0.3, random_state = 42, shuffle = True)

In [395]:
lr.fit(x_train, y_train)

LinearRegression()

In [396]:
lr.score(x_test, y_test)

0.6771424578041668

In [397]:
lr.score(x_train, y_train)

0.6699990539740093

## PCA with Random Forest Regression

Since the pca was already retrained previously, `rf` needs to be called and fitted again. This model was actually worse since the pca caused the scores to drop. It is also very overfitted so this is not the best model for this dataset.

In [398]:
rf.fit(x_train, y_train)

  """Entry point for launching an IPython kernel.


RandomForestRegressor()

In [399]:
rf.score(x_test, y_test)

0.6661216872941594

In [400]:
rf.score(x_train, y_train)

0.945211478800498

# The best results for this dataset was the Linear Regression model since it had the best score. The PCA model worked well with this since it kept the data roughly the same for the Linear Regression model.