### Welcome to the introductory notebook to the real estate predictor!

If you haven't seen the steps on how to install the library, you can visit the main github page [here](https://github.com/julian-fong/real-estate-predictor)

In this notebook, we'll explain the overview of the project and display some sample code.

In [6]:
import warnings
warnings.filterwarnings('ignore')

## Quick prediction sample

Note:

For the below section - you'll need a [Repliers API KEY](https://repliers.com/) in order to properly generate a prediction using the sample models, as the `predict` function accepts an `mlsNumber` as a parameter

In [7]:
from real_estate_predictor.predict import predict_sale_listing

Lets take a sample listing in the GTA market. The listing "277 Calvert Rd" is a detached house in the GTA with 4 bedrooms and 5 bathrooms with 2 parking sports. It is currently listed for 2,180,000 dollars on the open market. Its corresponding mlsNumber is "N12174739".

We can directly feed the mlsNumber into the predict method to have our machine learning model estimate its closing price

In [8]:
prediction = predict_sale_listing("N12174739")
prediction

{'mlsNumber': 'N12174739', 'prediction': 2352996.25}

## Exploring some of the features in the project

Below is a quick list of some of the things you can do with this library.

- **Data Cleaning**: Handles missing values, removing bad/invalid values via pandas

- **Feature Engineering**: Creates new features using existing predictors via pandas

- **Data Preprocessing**: Scaling numerical predictors, encodes categorical variables, and splits the dataset into training, validation, and test sets via sklearn.

We'll use some mock data in order to try some of the functions used in the library

### Data Cleaning

In [21]:
from real_estate_predictor.tests.test_preprocessor import data
import pandas as pd

mock_data = pd.DataFrame(data)
mock_data.head()

Unnamed: 0,class,type,listPrice,listDate,soldPrice,soldDate,city,area,district,neighborhood,...,numBathrooms,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,sqft,propertyType,numGarageSpaces,numDrivewaySpaces
0,CondoProperty,Lease,2200.0,2023-12-01 00:00:00+00:00,2200.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C01,Neighborhood A,...,1,1,Apartment,1,5.0,0,500-599,Condo Apt,0,
1,CondoProperty,Lease,2400.0,2023-12-02 00:00:00+00:00,2400.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C02,Neighborhood B,...,2,2,2-Storey,1,4.0,1,700-799,Detached,1,
2,ResidentialProperty,Sale,850000.0,2023-11-05 00:00:00+00:00,820000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C03,Neighborhood C,...,3,3,Bungalow,1,8.0,4,600-699,Comm Element Condo,1,
3,CondoProperty,Lease,2700.0,2023-10-15 00:00:00+00:00,2600.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C04,Neighborhood D,...,1,1,Apartment,1,4.0,2,0-499,Condo Apt,0,
4,ResidentialProperty,Sale,,2023-10-20 00:00:00+00:00,570000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C05,Neighborhood E,...,2,2,Apartment,1,5.0,3,,Detached,1,


In [22]:
from real_estate_predictor.processing.processor import DataCleaner

In [23]:
cleaner = DataCleaner(mock_data)
cleaner.handle_missing_values(strategy = "columns", columns = ["numDrivewaySpaces"])
# we can see that the missing values in the 'numDrivewaySpaces' column have been removed
cleaner.df.head()

Unnamed: 0,class,type,listPrice,listDate,soldPrice,soldDate,city,area,district,neighborhood,...,longitude,numBathrooms,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,sqft,propertyType,numGarageSpaces
0,CondoProperty,Lease,2200.0,2023-12-01 00:00:00+00:00,2200.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C01,Neighborhood A,...,-79.4,1,1,Apartment,1,5.0,0,500-599,Condo Apt,0
1,CondoProperty,Lease,2400.0,2023-12-02 00:00:00+00:00,2400.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C02,Neighborhood B,...,-79.41,2,2,2-Storey,1,4.0,1,700-799,Detached,1
2,ResidentialProperty,Sale,850000.0,2023-11-05 00:00:00+00:00,820000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C03,Neighborhood C,...,-79.42,3,3,Bungalow,1,8.0,4,600-699,Comm Element Condo,1
3,CondoProperty,Lease,2700.0,2023-10-15 00:00:00+00:00,2600.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C04,Neighborhood D,...,-79.43,1,1,Apartment,1,4.0,2,0-499,Condo Apt,0
4,ResidentialProperty,Sale,,2023-10-20 00:00:00+00:00,570000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C05,Neighborhood E,...,-79.44,2,2,Apartment,1,5.0,3,,Detached,1


In [25]:
#this function will return the rows that are less than the threshold value
cleaner.filter_rows_by_threshold(columns = ["listPrice", "soldPrice"], threshold = 100000, strategy = "lt")
cleaner.df.head()

Unnamed: 0,class,type,listPrice,listDate,soldPrice,soldDate,city,area,district,neighborhood,...,longitude,numBathrooms,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,sqft,propertyType,numGarageSpaces
0,CondoProperty,Lease,2200.0,2023-12-01 00:00:00+00:00,2200.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C01,Neighborhood A,...,-79.4,1,1,Apartment,1,5.0,0,500-599,Condo Apt,0
1,CondoProperty,Lease,2400.0,2023-12-02 00:00:00+00:00,2400.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C02,Neighborhood B,...,-79.41,2,2,2-Storey,1,4.0,1,700-799,Detached,1
3,CondoProperty,Lease,2700.0,2023-10-15 00:00:00+00:00,2600.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C04,Neighborhood D,...,-79.43,1,1,Apartment,1,4.0,2,0-499,Condo Apt,0
7,CondoProperty,Lease,2900.0,2023-12-07 00:00:00+00:00,2800.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C08,Neighborhood H,...,-79.47,2,2,Bungalow,1,,3,500-599,Condo Apt,1
9,CondoProperty,Lease,3100.0,2023-12-08 00:00:00+00:00,3000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C10,Neighborhood J,...,-79.49,2,3,Apartment,1,4.0,1,600-699,Condo Apt,0


### Feature Engineering

The GTA datasets have base columns that can be used to define other columns that are derived from them. For example using the column name `sqft`, we can then feature engineer a common predictor that most agents/realtors use to gauge value called the price per square footage i.e `ppsqft`.

In [26]:
from real_estate_predictor.processing.processor import FeatureEngineering

In [29]:
feature = FeatureEngineering(cleaner.df)
feature.create_features_old(["sqft_avg", "ppsqft"])

In [30]:
feature.df.head()

Unnamed: 0,class,type,listPrice,listDate,soldPrice,soldDate,city,area,district,neighborhood,...,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,sqft,propertyType,numGarageSpaces,sqft_avg,ppsqft
0,CondoProperty,Lease,2200.0,2023-12-01 00:00:00+00:00,2200.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C01,Neighborhood A,...,1,Apartment,1,5.0,0,500-599,Condo Apt,0,549.5,4.00364
1,CondoProperty,Lease,2400.0,2023-12-02 00:00:00+00:00,2400.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C02,Neighborhood B,...,2,2-Storey,1,4.0,1,700-799,Detached,1,749.5,3.202135
3,CondoProperty,Lease,2700.0,2023-10-15 00:00:00+00:00,2600.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C04,Neighborhood D,...,1,Apartment,1,4.0,2,0-499,Condo Apt,0,249.5,10.821643
7,CondoProperty,Lease,2900.0,2023-12-07 00:00:00+00:00,2800.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C08,Neighborhood H,...,2,Bungalow,1,,3,500-599,Condo Apt,1,549.5,5.277525
9,CondoProperty,Lease,3100.0,2023-12-08 00:00:00+00:00,3000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C10,Neighborhood J,...,3,Apartment,1,4.0,1,600-699,Condo Apt,0,649.5,4.772902


### Preprocessing

Before we can train our model, we need to prepare our cleaned dataset into a format that is acceptable by the model.

We can use the `Processor` module to help us.

In [31]:
from real_estate_predictor.tests.test_preprocessor import data as preprocessor_data

In [32]:
preprocessor_mock_data = pd.DataFrame(preprocessor_data)
preprocessor_mock_data.head()

Unnamed: 0,class,type,listPrice,listDate,soldPrice,soldDate,city,area,district,neighborhood,...,numBathrooms,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,sqft,propertyType,numGarageSpaces,numDrivewaySpaces
0,CondoProperty,Lease,2200.0,2023-12-01 00:00:00+00:00,2200.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C01,Neighborhood A,...,1,1,Apartment,1,5.0,0,500-599,Condo Apt,0,
1,CondoProperty,Lease,2400.0,2023-12-02 00:00:00+00:00,2400.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C02,Neighborhood B,...,2,2,2-Storey,1,4.0,1,700-799,Detached,1,
2,ResidentialProperty,Sale,850000.0,2023-11-05 00:00:00+00:00,820000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C03,Neighborhood C,...,3,3,Bungalow,1,8.0,4,600-699,Comm Element Condo,1,
3,CondoProperty,Lease,2700.0,2023-10-15 00:00:00+00:00,2600.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C04,Neighborhood D,...,1,1,Apartment,1,4.0,2,0-499,Condo Apt,0,
4,ResidentialProperty,Sale,,2023-10-20 00:00:00+00:00,570000.0,2023-12-31 00:00:00+00:00,Toronto,Toronto,Toronto C05,Neighborhood E,...,2,2,Apartment,1,5.0,3,,Detached,1,


First, lets setup our columns that we can to transform before feeding it into the model.

In [33]:
feature_transform_numerical1 = ["listPrice"]
feature_impute_encode_categorical1 = ["sqft"]

Now we can initialize our processor

In [34]:
from real_estate_predictor.processing.processor import Processor
processor = Processor(preprocessor_mock_data)

In [35]:
processor.transform_numerical(
    strategy="default", columns=feature_transform_numerical1
)
processor.encode_categorical(
    columns=feature_impute_encode_categorical1, strategy="onehot"
)

The code above only assigns the column to the particular strategy, but we haven't transformed anything yet. Once we use the `apply_transformer` function, the `processor` module will automatically create the necessary `sklearn` pipelines to transform the data

In [36]:
processor.apply_transformer()

In [37]:
processor.fit(preprocessor_mock_data)

We can see the newly processed data below

In [39]:
processed_data = processor.transform(preprocessor_mock_data)
processed_data.head()

Unnamed: 0,listPrice,sqft_0-499,sqft_500-599,sqft_600-699,sqft_700-799,sqft_800-899,sqft_nan,class,type,listDate,...,longitude,numBathrooms,numBedrooms,style,numKitchens,numRooms,numParkingSpaces,propertyType,numGarageSpaces,numDrivewaySpaces
0,-0.553172,0.0,1.0,0.0,0.0,0.0,0.0,CondoProperty,Lease,2023-12-01 00:00:00+00:00,...,-79.4,1,1,Apartment,1,5.0,0,Condo Apt,0,
1,-0.55294,0.0,0.0,0.0,1.0,0.0,0.0,CondoProperty,Lease,2023-12-02 00:00:00+00:00,...,-79.41,2,2,2-Storey,1,4.0,1,Detached,1,
2,0.429098,0.0,0.0,1.0,0.0,0.0,0.0,ResidentialProperty,Sale,2023-11-05 00:00:00+00:00,...,-79.42,3,3,Bungalow,1,8.0,4,Comm Element Condo,1,
3,-0.552593,1.0,0.0,0.0,0.0,0.0,0.0,CondoProperty,Lease,2023-10-15 00:00:00+00:00,...,-79.43,1,1,Apartment,1,4.0,2,Condo Apt,0,
4,,0.0,0.0,0.0,0.0,0.0,1.0,ResidentialProperty,Sale,2023-10-20 00:00:00+00:00,...,-79.44,2,2,Apartment,1,5.0,3,Detached,1,
