# Data Exploration, Preprocessing, and Visualization

**Week04, Data Preparation**

ISM6136

&copy; 2023 Dr. Tim Smith


<a target="_blank" href="https://colab.research.google.com/github/prof-tcsmith/dm-f23/blob/main/W04/W04.07-Data-Exploration-and-Proprocessing.ipynb#offline=1">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

---

Introduction:
    
The dataset we are using in this notebook is a set of data on house sales in West Roxbuy, MA.

We are using this dataset to demonstrate the following common tasks:
1. Loading data into a dataframe
2. Explore the number of rows and columns found in the data
3. Rename column names
4. Drop any columns we are not interested in
5. Identify missing data
6. Drop rows with too many missing data measures
7. Impute the missing values for rows that are only missing very small number of values
8. Normalize (or Standardize) values in the dataframe



## 0. Import required packages

We use the pandas, the Python data anlysis library, for handling data. The API of this library is very similar to R data frames. See https://pandas.pydata.org/ for details.

In [235]:
# if running on colab, you may need to uncomment the following lines
# if you get an error saying that the module is not found
#!pip install matplotlib
#!pip install numpy
#!pip install pandas
#!pip install sklearn
#!pip install pathlib#!pip install ipympl

In [236]:
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

import matplotlib.pylab as plt

np.random.seed(42)

## 1. Load data

The data is given in a CSV file, let's try loading the data into a dataframe

In [237]:
housing_df = pd.read_csv('https://raw.githubusercontent.com/prof-tcsmith/data/master/WestRoxbury.csv') 

> Note that sometimes you may need to look at other read_csv parameters to deal with data that has anomolies, such as commas seperating thousands. See the documentation for read_csv found [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

Check if data is loaded into the dataframe by looking at the first 10 rows...

In [238]:
housing_df.head(10)

Unnamed: 0,TOTAL VALUE,TAX,LOT SQFT,YR BUILT,GROSS AREA,LIVING AREA,FLOORS,ROOMS,BEDROOMS,FULL BATH,HALF BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,4330,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
1,412.6,5190,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent
2,330.1,4152,7500,1890,2294,1371,2.0,8,4,1,1,1,0,
3,498.6,6272,13773,1957,5032,2608,1.0,9,5,1,1,1,1,
4,331.5,4170,5000,1910,2370,1438,2.0,7,3,2,0,1,0,
5,337.4,4244,5142,1950,2124,1060,1.0,6,3,1,0,1,1,Old
6,359.4,4521,5000,1954,3220,1916,2.0,7,3,1,1,1,0,
7,320.4,4030,10000,1950,2208,1200,1.0,6,3,1,0,1,0,
8,333.5,4195,6835,1958,2582,1092,1.0,5,3,1,0,1,1,Recent
9,409.4,5150,5093,1900,4818,2992,2.0,8,4,2,0,1,0,


## 2. Explore number of rows and columns 

Let's look at the shape of the data frame. 

In [239]:
housing_df.shape

(5802, 14)

As we can see, the data consists of 5802 rows (observations) and 14 columns (variables). If we wanted to store these values into variables, we could do this as follows:

In [240]:
rows = housing_df.shape[0]
cols = housing_df.shape[1]
print(f"Rows={rows} and Cols={cols}")

Rows=5802 and Cols=14


## 3. Rename Columns

Often times the column names can have blank spaces, or other issues. We often change column names because of this.

We can get the columns names by displaying the columns attribute of the dataframe...

In [241]:
housing_df.columns

Index(['TOTAL VALUE ', 'TAX', 'LOT SQFT ', 'YR BUILT', 'GROSS AREA ',
       'LIVING AREA', 'FLOORS ', 'ROOMS', 'BEDROOMS ', 'FULL BATH',
       'HALF BATH', 'KITCHEN', 'FIREPLACE', 'REMODEL'],
      dtype='object')

Note that some of the column titles end with spaces and some consist of two space separated words. For further analysis it's more convenient to have column names which are single words and with no extra trailing or leading blank spaces.

If you want, you can rename one column name at a time using the rename command.

In the rename command you can specify individual columns by name and provide their new name using a dictionary. 

> Note that we use the `inplace` argument here. This means that the data frame is modified directly. By default, the modification is done on a copy and the copy returned by the method.

In [242]:
housing_df = housing_df.rename(columns={'TOTAL VALUE ': 'TOTAL_VALUE'})
housing_df.columns

Index(['TOTAL_VALUE', 'TAX', 'LOT SQFT ', 'YR BUILT', 'GROSS AREA ',
       'LIVING AREA', 'FLOORS ', 'ROOMS', 'BEDROOMS ', 'FULL BATH',
       'HALF BATH', 'KITCHEN', 'FIREPLACE', 'REMODEL'],
      dtype='object')

But, often we may want to process all the column names at once based on a pattern. The problem we have with the current columns names are the spaces. Let's change all the spaces to underscores using a list comprehension (for more on list comprehensions, see DataCamp Python DataScience Toolbox Part 2)

Below, instead of using the `rename` method, we create a modified copy of `columns` and assign to the `columns` field of the dataframe. We use a list comprehension

In [243]:
housing_df.columns = [s.strip().replace(' ', '_') for s in housing_df.columns] # note that we can change column names be simply assigning a new list of column names
housing_df.columns

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT', 'YR_BUILT', 'GROSS_AREA',
       'LIVING_AREA', 'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH',
       'KITCHEN', 'FIREPLACE', 'REMODEL'],
      dtype='object')

> NOTE: ```s.strip()``` returns a string with any trailing or leading whitespace removes. ```s.strip().replace(' ', '_')``` would take this returned string and replace at remaining blank spaces (the ones between words would be the only ones left) and replaces them with an underscore.

## 4. Drop any columns I'm not interested in

In this dataset, we've decided to remove the tax variable. To remove an columns use drop.


In [244]:
housing_df.drop(columns=['TAX']) # note: we can drop multiple columns by including more column names in the list

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
1,412.6,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent
2,330.1,7500,1890,2294,1371,2.0,8,4,1,1,1,0,
3,498.6,13773,1957,5032,2608,1.0,9,5,1,1,1,1,
4,331.5,5000,1910,2370,1438,2.0,7,3,2,0,1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5797,404.8,6762,1938,2594,1714,2.0,9,3,2,1,1,1,Recent
5798,407.9,9408,1950,2414,1333,2.0,6,3,1,1,1,1,
5799,406.5,7198,1987,2480,1674,2.0,7,3,1,1,1,1,
5800,308.7,6890,1946,2000,1000,1.0,5,2,1,0,1,0,


An important thing to notice is that housing_df has not been altered by the ab ove drop command. The dop method returns a new dataframe with the column dropped, and this new dataframe is displayed but not stored over the original. So, if we look at the dataframe, we see that the TAX column still exists...

In [245]:
housing_df

Unnamed: 0,TOTAL_VALUE,TAX,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,4330,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
1,412.6,5190,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent
2,330.1,4152,7500,1890,2294,1371,2.0,8,4,1,1,1,0,
3,498.6,6272,13773,1957,5032,2608,1.0,9,5,1,1,1,1,
4,331.5,4170,5000,1910,2370,1438,2.0,7,3,2,0,1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5797,404.8,5092,6762,1938,2594,1714,2.0,9,3,2,1,1,1,Recent
5798,407.9,5131,9408,1950,2414,1333,2.0,6,3,1,1,1,1,
5799,406.5,5113,7198,1987,2480,1674,2.0,7,3,1,1,1,1,
5800,308.7,3883,6890,1946,2000,1000,1.0,5,2,1,0,1,0,


> **When altering pandas dataframes, you often need to 'save' the results over the original. So, in order for us to drop the TAX column from the housing_df, we need to do the following...**

In [246]:
housing_df = housing_df.drop(columns=['TAX'])

In [247]:
housing_df

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
1,412.6,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent
2,330.1,7500,1890,2294,1371,2.0,8,4,1,1,1,0,
3,498.6,13773,1957,5032,2608,1.0,9,5,1,1,1,1,
4,331.5,5000,1910,2370,1438,2.0,7,3,2,0,1,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5797,404.8,6762,1938,2594,1714,2.0,9,3,2,1,1,1,Recent
5798,407.9,9408,1950,2414,1333,2.0,6,3,1,1,1,1,
5799,406.5,7198,1987,2480,1674,2.0,7,3,1,1,1,1,
5800,308.7,6890,1946,2000,1000,1.0,5,2,1,0,1,0,


## 5. Identify any categorical data in the dataframe, and be sure that the data is loaded correctly

Categorical data is data that represents categories. Categorical data shouldn't be treated like a number, as it isn't one 

Notice in our data we have the variable (column) REMODEL, and this variable contains one of three possible strings None, Recent, or Old.

First' let's look at this REMODEL variable...

In [248]:
housing_df['REMODEL'].unique()

array([nan, 'Recent', 'Old'], dtype=object)

Let's see what data types pandas has chosen for our variables (columns)?

In [249]:
housing_df.dtypes

TOTAL_VALUE    float64
LOT_SQFT         int64
YR_BUILT         int64
GROSS_AREA       int64
LIVING_AREA      int64
FLOORS         float64
ROOMS            int64
BEDROOMS         int64
FULL_BATH        int64
HALF_BATH        int64
KITCHEN          int64
FIREPLACE        int64
REMODEL         object
dtype: object

The REMODEL column is set as an 'object'. This is a general type and we want this to be translated into a categorical type, therefore need to change the comumn REMODEL from object to categorical.

In [250]:
print(housing_df.REMODEL.dtype) # double check and confirm that the column is an 'object' type
housing_df.REMODEL = housing_df.REMODEL.astype('category') # change the column type to a categorical variable
print(housing_df.REMODEL.cat.categories)  # Print the categories found
print(housing_df.REMODEL.dtype)  # now, let's confirm that the column type has been changed to categorical

object
Index(['Old', 'Recent'], dtype='object')
category


## 6. Identify and Handle Any Missing Data

First, let's identify any columns that contain missing data...

In [251]:
housing_df.isna().sum()

TOTAL_VALUE       0
LOT_SQFT          0
YR_BUILT          0
GROSS_AREA        0
LIVING_AREA       0
FLOORS            0
ROOMS             0
BEDROOMS          0
FULL_BATH         0
HALF_BATH         0
KITCHEN           0
FIREPLACE         0
REMODEL        4346
dtype: int64

There doesn't seem to be any missing values in this data. For demonstrations purposes, let's randomly select 10 values for BEDROOMS and make them nan (which means, not a number, but this is used by pandas to indicate a missing value).

In [252]:
print(f"Number of rows with valid BEDROOMS values before: {housing_df['BEDROOMS'].count()}") 
add_missing_rows = housing_df.sample(10).index # create a random selection of rows that we will use as to overwrite with a NAN for BEDROOMS
housing_df.loc[add_missing_rows, 'BEDROOMS'] = np.nan  # change these rows to have BEDROOM values NAN
print(f"Number of rows with valid BEDROOMS values after setting to NAN: {housing_df['BEDROOMS'].count()}") 


Number of rows with valid BEDROOMS values before: 5802
Number of rows with valid BEDROOMS values after setting to NAN: 5792


There are many ways we could identify any missing values (nan). If we want to check all variables to find if any have missing values, we can do the following.

In [253]:
housing_df.isnull().sum()

TOTAL_VALUE       0
LOT_SQFT          0
YR_BUILT          0
GROSS_AREA        0
LIVING_AREA       0
FLOORS            0
ROOMS             0
BEDROOMS         10
FULL_BATH         0
HALF_BATH         0
KITCHEN           0
FIREPLACE         0
REMODEL        4346
dtype: int64

As we expected, we find that BEDROOMS has 10 missing values

In [254]:
housing_df['BEDROOMS'].count() # note that we can count the number of bedroom values, and find that it's less than the number of rows in the dataframe, therefore must now contain missing values

5792

When we find NAN's (Missing numbers/values), we have two general approaches we can take. 

1. Replace the NAN with the mean of the other values in the column (this is often referred to as **'imputing'** a value).
2. Drop the row (observation) that contains the NAN

### Imputing missing values

Often, we may have a row of values but only a small portion have a missing value. For instance, if a row has 10 values and only one of them is empty, it would be advantageous to keep the other 9 values and 'impute' the missing value. Imputing a value typically is done by replaing the missing value with the mean (average) of all values for the variable.

In the code below, I replace the missing values using the median of the remaining values.

NOTE: By default, the `median` method of a pandas dataframe ignores NA values. This is in contrast to R where this must be specified explicitly.

In [255]:
medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS = housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(f"Number of rows with valid BEDROOMS values after filling NA values: {housing_df['BEDROOMS'].count()}")

Number of rows with valid BEDROOMS values after filling NA values: 5802


### Droping rows that have over a certain number of missing values

For demonstration purposes, let's add more NAN's into our data. This time we will pick a sample of rows and change 3 of the values to NAN.

In [256]:
add_missing_values = housing_df.iloc[0:5,:].index # let's overwrite the first 10 rows with NAN's for bedrooms, fireplace and rooms
housing_df.loc[add_missing_values, 'BEDROOMS'] = np.nan  # change these rows to have BEDROOM values NAN
housing_df.loc[add_missing_values, 'FIREPLACE'] = np.nan  # change these rows to have BEDROOM values NAN
housing_df.loc[add_missing_values, 'ROOMS'] = np.nan  # change these rows to have BEDROOM values NAN

In [257]:
housing_df.head(10)

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,9965,1880,2436,1352,2.0,,,1,1,1,,
1,412.6,6590,1945,3108,1976,2.0,,,2,1,1,,Recent
2,330.1,7500,1890,2294,1371,2.0,,,1,1,1,,
3,498.6,13773,1957,5032,2608,1.0,,,1,1,1,,
4,331.5,5000,1910,2370,1438,2.0,,,2,0,1,,
5,337.4,5142,1950,2124,1060,1.0,6.0,3.0,1,0,1,1.0,Old
6,359.4,5000,1954,3220,1916,2.0,7.0,3.0,1,1,1,0.0,
7,320.4,10000,1950,2208,1200,1.0,6.0,3.0,1,0,1,0.0,
8,333.5,6835,1958,2582,1092,1.0,5.0,3.0,1,0,1,1.0,Recent
9,409.4,5093,1900,4818,2992,2.0,8.0,4.0,2,0,1,0.0,


Now, let's remove any rows that have more than one missing value. 

In [258]:
housing_df = housing_df[housing_df.isnull().sum(axis=1) < 3]
housing_df

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL
5,337.4,5142,1950,2124,1060,1.0,6.0,3.0,1,0,1,1.0,Old
6,359.4,5000,1954,3220,1916,2.0,7.0,3.0,1,1,1,0.0,
7,320.4,10000,1950,2208,1200,1.0,6.0,3.0,1,0,1,0.0,
8,333.5,6835,1958,2582,1092,1.0,5.0,3.0,1,0,1,1.0,Recent
9,409.4,5093,1900,4818,2992,2.0,8.0,4.0,2,0,1,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5797,404.8,6762,1938,2594,1714,2.0,9.0,3.0,2,1,1,1.0,Recent
5798,407.9,9408,1950,2414,1333,2.0,6.0,3.0,1,1,1,1.0,
5799,406.5,7198,1987,2480,1674,2.0,7.0,3.0,1,1,1,1.0,
5800,308.7,6890,1946,2000,1000,1.0,5.0,2.0,1,0,1,0.0,


## 7. Converting categorical data into dummy variables 

Pandas provides a method to convert factors into dummy variables. 

If you're not familiar with what a dummy variable is, notice that in our data the REMODEL variable is categorical and has three categories. These categories shouldn't be intgerpreted as ordered, or equal distance apart. When we find such variables, we need to encode them using either a dummy variable method, or 'one hot encoding'. This is a topic we will discuss more later. For now, let's look how we can encode the REMODEL variable into dummy variables....

In [259]:
# recall that we have three values, recent, old, or nan (which means no data)
housing_df['REMODEL'].unique()

['Old', NaN, 'Recent']
Categories (2, object): ['Old', 'Recent']

In [260]:
housing_df = pd.get_dummies(housing_df, prefix_sep='_', dummy_na=False, drop_first=True, columns=['REMODEL'])
housing_df.columns

Index(['TOTAL_VALUE', 'LOT_SQFT', 'YR_BUILT', 'GROSS_AREA', 'LIVING_AREA',
       'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH', 'KITCHEN',
       'FIREPLACE', 'REMODEL_Recent'],
      dtype='object')

Note that we had three possible categories for REMODEL: Old, Recent or None. 

Using the dummy variable approach, we now represent this data using two 'dummy variables'. 

If REMODEL is Old, the REMODEL_Old will be 1 and REMODEL_Recent will be 0. 

Otherwise, if the REMODEL is recent, then REMODEL_RECENT will be 1 and REMODEL_old will be 0. 

Finally, if REMODEL is None, they will both be zero. 


In [261]:
print(housing_df.loc[:,'REMODEL_Recent'].head(5))

5    False
6    False
7    False
8     True
9    False
Name: REMODEL_Recent, dtype: bool


## 8. Partitioning data into training and validation sets

Split the dataset into training (70%) and validation (30%) sets. 

Randomly sample 70% of the dataset into a new data frame `train_data`. The remaining 30% serve as validation.

In [262]:
# random_state is set to a defined value to get the same partitions when re-running the code
train_data= housing_df.sample(frac=0.7, random_state=1)
# assign rows that are not already in the training set, into validation 
valid_data = housing_df.drop(train_data.index)

print('Training   : ', train_data.shape)
print('Validation : ', valid_data.shape)
print()

# alternative way using scikit-learn
train_data, valid_data = train_test_split(housing_df, test_size=0.40, random_state=1)
print('Training   : ', train_data.shape)
print('Validation : ', valid_data.shape)

Training   :  (4058, 13)
Validation :  (1739, 13)

Training   :  (3478, 13)
Validation :  (2319, 13)


## 9. Table - scaling data (standardizing)

We often rescale data within a dataframe. Rescaling is often necessary to remove the influence of difference measurement scales. Some variables can have very large values, and others have rather small. This is a difference in scale, and some machine learning algorithms work best if we 'normalize' the data so that scale is not a factore.

In [263]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [264]:
# let's create a list of column names, excluding the REMODEL_Recent column (which is our dummy encoded column)
col_names_2b_transformed = train_data.columns.drop('REMODEL_Recent')
col_names_2b_transformed

Index(['TOTAL_VALUE', 'LOT_SQFT', 'YR_BUILT', 'GROSS_AREA', 'LIVING_AREA',
       'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH', 'KITCHEN',
       'FIREPLACE'],
      dtype='object')

In [265]:
train_data_scaled = train_data[col_names_2b_transformed]
valid_data_scaled = valid_data[col_names_2b_transformed]

In [266]:
train_data_scaled.head()

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE
4595,695.6,5805,1987,5442,3667,2.5,10.0,5.0,3,0,1,1.0
2029,365.6,6617,1920,2233,1350,2.0,7.0,3.0,1,0,1,0.0
5141,492.6,6139,1935,2864,1852,2.0,8.0,4.0,1,1,1,1.0
4954,307.1,5835,1957,2418,1306,1.5,7.0,3.0,1,1,1,0.0
426,339.0,10370,1960,2836,1435,1.5,7.0,3.0,2,0,1,0.0


In [267]:
# More common scaling technique is to normalize the data (each variable value is scaled using the variables standard deviation)
scaler = StandardScaler()
train_data_scaled = pd.DataFrame(scaler.fit_transform(train_data_scaled),
                       index=train_data_scaled.index, columns=train_data_scaled.columns)
valid_data_scaled = pd.DataFrame(scaler.transform(valid_data_scaled),
                       index=valid_data_scaled.index, columns=valid_data_scaled.columns)

# Other rescaling techniques - min_max scaling


In [268]:
train_data_scaled['REMODEL_Recent'] = train_data['REMODEL_Recent']
valid_data_scaled['REMODEL_Recent'] = valid_data['REMODEL_Recent']


In [269]:
train_data_scaled.head()

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL_Recent
4595,3.117638,-0.176282,1.20802,2.900597,3.79996,1.842958,2.102601,2.153959,3.28176,-1.160655,-0.123199,0.468685,False
2029,-0.263624,0.130247,-0.403536,-0.779716,-0.563553,0.714272,0.015403,-0.255261,-0.569104,-1.160655,-0.123199,-1.32262,False
5141,1.03765,-0.050198,-0.042739,-0.05604,0.381844,0.714272,0.711136,0.949349,-0.569104,0.73989,-0.123199,0.468685,False
4954,-0.86303,-0.164957,0.486428,-0.567545,-0.646416,-0.414414,0.015403,-0.255261,-0.569104,0.73989,-0.123199,-1.32262,False
426,-0.536175,1.547,0.558587,-0.088152,-0.403476,-0.414414,0.015403,-0.255261,1.356328,-1.160655,-0.123199,-1.32262,False


In [270]:
train_data.head()

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL_Recent
4595,695.6,5805,1987,5442,3667,2.5,10.0,5.0,3,0,1,1.0,False
2029,365.6,6617,1920,2233,1350,2.0,7.0,3.0,1,0,1,0.0,False
5141,492.6,6139,1935,2864,1852,2.0,8.0,4.0,1,1,1,1.0,False
4954,307.1,5835,1957,2418,1306,1.5,7.0,3.0,1,1,1,0.0,False
426,339.0,10370,1960,2836,1435,1.5,7.0,3.0,2,0,1,0.0,False


In [271]:
train_data_scaled.head()

Unnamed: 0,TOTAL_VALUE,LOT_SQFT,YR_BUILT,GROSS_AREA,LIVING_AREA,FLOORS,ROOMS,BEDROOMS,FULL_BATH,HALF_BATH,KITCHEN,FIREPLACE,REMODEL_Recent
4595,3.117638,-0.176282,1.20802,2.900597,3.79996,1.842958,2.102601,2.153959,3.28176,-1.160655,-0.123199,0.468685,False
2029,-0.263624,0.130247,-0.403536,-0.779716,-0.563553,0.714272,0.015403,-0.255261,-0.569104,-1.160655,-0.123199,-1.32262,False
5141,1.03765,-0.050198,-0.042739,-0.05604,0.381844,0.714272,0.711136,0.949349,-0.569104,0.73989,-0.123199,0.468685,False
4954,-0.86303,-0.164957,0.486428,-0.567545,-0.646416,-0.414414,0.015403,-0.255261,-0.569104,0.73989,-0.123199,-1.32262,False
426,-0.536175,1.547,0.558587,-0.088152,-0.403476,-0.414414,0.015403,-0.255261,1.356328,-1.160655,-0.123199,-1.32262,False


## Developing a Predictive Model for House Prices

In [272]:
## Developing a predictive model for House Price

# Let's start by using a simple linear regression model to predict the house price from the number of rooms

# First, let's create a simple linear regression model using scikit-learn

# Step 1: Create a model object
model = LinearRegression()

#input = ['ROOMS'] # RMSE 0.6465176523942427
#input = ['ROOMS', 'GROSS_AREA'] # RMSE 0.35713382179628483
#input = ['ROOMS', 'GROSS_AREA', 'LOT_SQFT']  # RMSE 0.30844613251318903
input = ['ROOMS', 'GROSS_AREA', 'LOT_SQFT', 'BEDROOMS', 'FULL_BATH', 'REMODEL_Recent'] # RMSE 0.28657130119558755

target = ['TOTAL_VALUE']
predicted = ['PREDICTED']

X_train = train_data_scaled[input].copy()
y_train = train_data_scaled[target].copy()

X_valid = valid_data_scaled[input].copy()
y_valid = valid_data_scaled[target].copy()

# Step 2: Train the model (on the training data)
model.fit(X_train, y_train)

# Step 3: Use the model to predict the target values
y_valid[predicted] = model.predict(X_valid)
y_valid.head()

Unnamed: 0,TOTAL_VALUE,PREDICTED
1028,-0.656056,-0.644447
3497,-0.763641,-1.055954
4833,1.679065,0.792201
1912,0.500746,0.196845
1756,-0.530027,-0.412276


In [273]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_valid[target], y_valid[predicted]) 


0.28657130119558755