# Modelling

### PreProcessing

#### Data Preparation: Importing Libraries and Loading the Dataset

In [48]:
## Importing Libraries ##

import pandas as pd                    # For data manipulation and analysis
import matplotlib.pyplot as plt        # For data visualization
import seaborn as sns                  # For enhanced data visualization
from sklearn.model_selection import train_test_split  # For data splitting
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # For text feature extraction
import nltk                            # Natural Language Toolkit for text processing
from nltk.stem import WordNetLemmatizer  # Lemmatization for text data
from nltk.corpus import stopwords      # Stopword removal for text data
from nltk.stem import PorterStemmer     # Text stemming for text data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler  # Data scaling
from sklearn.feature_selection import f_regression, SelectKBest, f_classif  # Feature selection
from sklearn.metrics import r2_score    # Evaluation metric for regression models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV  # Hyperparameter tuning
import string                          # String manipulation functions
import transformers
from sklearn.decomposition import PCA  # Import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor 


In [2]:
## Loading the Dataset ##

file_path = '../../Datasets/cleaned_data.csv'
real_estate = pd.read_csv(file_path)

# A quick check by sampling 10 rows from the 'real_estate' DataFrame
real_estate = real_estate.sample(10000)

  exec(code_obj, self.user_global_ns, self.user_ns)


Checking the Initial Dataset:

In [3]:
# Dataset Overview

print(real_estate.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 665415 to 365264
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        10000 non-null  int64  
 1   Serial Number     10000 non-null  int64  
 2   List Year         10000 non-null  int64  
 3   Date Recorded     10000 non-null  object 
 4   Town              10000 non-null  object 
 5   Address           10000 non-null  object 
 6   Assessed Value    10000 non-null  float64
 7   Sale Amount       10000 non-null  float64
 8   Sales Ratio       10000 non-null  float64
 9   Property Type     6193 non-null   object 
 10  Residential Type  6137 non-null   object 
 11  Non Use Code      2425 non-null   float64
 12  Assessor Remarks  1535 non-null   object 
 13  OPM remarks       90 non-null     object 
 14  Location          1971 non-null   object 
 15  Full Address      9847 non-null   object 
 16  latitude          4907 non-null   

Dataset Overview:
- The dataset contains a substantial 997,193 entries across 20 columns.
- The columns encompass a variety of data types, such as integers, floats, and objects.
- It's important to note that several columns have missing values, as indicated by non-null counts.
- The total memory usage for this dataset is approximately 152.2 MB.

#### Streamlining the Dataset

Columns to be dropped:
- 'Unnamed: 0': unnecessary column from Geocoding
- 'Serial Number': not applicable in modelling
- 'Date Recorded': feature-engineered as month and year
- 'Sales Ratio': since it is a ratio between assessed and sales values, it would cause data-leakage
- 'Address' & 'Full Address': used in geocoding, not applicable in modelling
- 'Property Type': Redundant with "Residential Type"
- 'Location': feature-engineered as longitude and latitude

In [4]:
# Drop the following unnecessary columns

real_estate_nonull = real_estate.drop(columns=['Unnamed: 0', 'Serial Number', 'Date Recorded', 'Sales Ratio', 'Address', 'Full Address', 'Property Type', 'Location'], axis=1)

# Check the DataFrame's structure after removing the columns

print(real_estate_nonull.info())

# Remove rows with missing values in the 'latitude' and 'Residential Type' columns
real_estate_nonull = real_estate_nonull.dropna(subset=["latitude", "Residential Type"])

# Check the DataFrame's structure after dropping rows
print(real_estate_nonull.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 665415 to 365264
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   List Year         10000 non-null  int64  
 1   Town              10000 non-null  object 
 2   Assessed Value    10000 non-null  float64
 3   Sale Amount       10000 non-null  float64
 4   Residential Type  6137 non-null   object 
 5   Non Use Code      2425 non-null   float64
 6   Assessor Remarks  1535 non-null   object 
 7   OPM remarks       90 non-null     object 
 8   latitude          4907 non-null   float64
 9   longitude         4907 non-null   float64
 10  month_recorded    10000 non-null  float64
 11  year_recorded     10000 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 1015.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3281 entries, 542698 to 931958
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtyp

#### Filling Missing Text Data

Address missing values in text columns by inserting "na."

In [5]:
# Fill missing values in text columns with "na"

real_estate_nonull = real_estate_nonull.fillna("na")

Examine the 'real_estate_nonull' DataFrame to ensure that the columns are dropped and "na" is inserted in place of missing values.

In [6]:
# Sample 5 random rows from the DataFrame

sample_rows = real_estate_nonull.sample(5)
print(sample_rows)

        List Year       Town  Assessed Value  Sale Amount Residential Type  \
750015       2008  Waterbury        146790.0      90000.0     Three Family   
891853       2016     Berlin        121200.0     204000.0    Single Family   
735443       2007   Cheshire        213860.0     390000.0    Single Family   
496942       2018    Shelton        210280.0     329500.0            Condo   
510576       2019   Hartford        163065.0     582500.0    Single Family   

       Non Use Code Assessor Remarks OPM remarks   latitude  longitude  \
750015         14.0               na          na  41.546869 -73.046510   
891853           na   qualified sale          na -72.797350  41.590300   
735443           na               na          na -72.853030  41.535790   
496942           na               na          na  47.627007   2.177747   
510576           na               na          na  41.784468 -72.710912   

        month_recorded  year_recorded  
750015             9.0         2009.0  
891853

Missing Values Check:

In [7]:
# To confirm that all text columns have been filled with "na," check the DataFrame's info.

print(".info of All Columns\n", real_estate_nonull.info())
print("Null Values in Total\n", real_estate_nonull.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3281 entries, 542698 to 931958
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   List Year         3281 non-null   int64  
 1   Town              3281 non-null   object 
 2   Assessed Value    3281 non-null   float64
 3   Sale Amount       3281 non-null   float64
 4   Residential Type  3281 non-null   object 
 5   Non Use Code      3281 non-null   object 
 6   Assessor Remarks  3281 non-null   object 
 7   OPM remarks       3281 non-null   object 
 8   latitude          3281 non-null   float64
 9   longitude         3281 non-null   float64
 10  month_recorded    3281 non-null   float64
 11  year_recorded     3281 non-null   float64
dtypes: float64(6), int64(1), object(5)
memory usage: 333.2+ KB
.info of All Columns
 None
Null Values in Total
 List Year           0
Town                0
Assessed Value      0
Sale Amount         0
Residential Type    0
Non U

#### Data Preprocessing and Feature Engineering

##### 1. Feature Matrix and Target Extraction: create the feature matrix 'X' and extract the target variable 'y.'

In [8]:
# Create the feature matrix 'X' by dropping the 'Sale Amount' column
X = real_estate_nonull.drop('Sale Amount', axis=1)

# Extract the target variable 'y'
y = real_estate_nonull["Sale Amount"]

##### 2. One-hot Encoding Categorical Columns: perform one-hot encoding on categorical columns to prepare the data for modeling.
 

In [9]:
# Perform one-hot encoding on categorical columns in X
X_coded = pd.get_dummies(X, columns=['List Year', 'Town', 'Residential Type', 'Non Use Code'])

##### 3. Data Splitting into Training, Validation, and Test Sets: split into well-defined training, validation, and test sets for model evaluation and generalization


Split the data into well-defined training, validation, and test sets to ensure reliable model evaluation and generalization.


In [10]:
# Split the data into train, validation, and test sets
X_rem, X_test, y_rem, y_test = train_test_split(X_coded, y, test_size=0.3, random_state=22)
X_train, X_val, y_train, y_val = train_test_split(X_rem, y_rem, test_size=0.5, random_state=22)

4. Splitting Features and Printing Shapes (Text and Numerical Data Separation): the division allows for specialized preprocessing of different feature types.

Split the features into text and numerical components for the training set.

In [11]:
X_train_Assessor_Remarks = X_train['Assessor Remarks']
X_train_OPM_remarks = X_train['OPM remarks']
X_train_numerical = X_train[["Assessed Value", "latitude", "longitude"]]
X_train_coded = X_train.drop(["Assessor Remarks", "OPM remarks", "Assessed Value", "latitude", "longitude"], axis=1)

Repeat the above steps for the validation set.

In [12]:
X_val_Assessor_Remarks = X_val['Assessor Remarks']
X_val_OPM_remarks = X_val['OPM remarks']
X_val_numerical = X_val[["Assessed Value", "latitude", "longitude"]]
X_val_coded = X_val.drop(["Assessor Remarks", "OPM remarks", "Assessed Value", "latitude", "longitude"], axis=1)

Define a function to split features into text, numerical, and coded components and print their shapes

In [13]:
def split_features(dataset, prefix, text_features, numerical_features):
    text_data = dataset[text_features]
    numerical_data = dataset[numerical_features]
    coded_data = dataset.drop(text_features + numerical_features, axis=1)
    # Extract text and numerical data
    print(f"{prefix}_Assessor_Remarks.shape: {text_data.shape}")
    print(f"{prefix}_OPM_remarks.shape: {text_data.shape}")
    print(f"{prefix}_numerical.shape: {numerical_data.shape}")
    print(f"{prefix}_coded.shape: {coded_data.shape}")

Define the text and numerical features.

In [14]:
text_features = ["Assessor Remarks", "OPM remarks"]
numerical_features = ["Assessed Value", "latitude", "longitude"]

Call the function for both training and validation sets.

In [15]:
split_features(X_train, 'train', text_features, numerical_features)
split_features(X_val, 'val', text_features, numerical_features)

train_Assessor_Remarks.shape: (1148, 2)
train_OPM_remarks.shape: (1148, 2)
train_numerical.shape: (1148, 3)
train_coded.shape: (1148, 213)
val_Assessor_Remarks.shape: (1148, 2)
val_OPM_remarks.shape: (1148, 2)
val_numerical.shape: (1148, 3)
val_coded.shape: (1148, 213)


Checking the shape of each dataframe is important for validation and quality control:
- It helps ensure that the data splitting process has been executed correctly.- It allows us to verify that the data dimensions align with our expectations.
- It allows us to verify that the data dimensions align with our expectations.
- Identifying unexpected shapes can be an early indicator of data issues or errors.

In [16]:
# Check the shape of each dataframe:
print(f"X_train text shape: {X_train[text_features].shape}")
print(f"X_val text shape: {X_val[text_features].shape}")
print(f"X_train numerical shape: {X_train[numerical_features].shape}")
print(f"X_val numerical shape: {X_val[numerical_features].shape}")
print(f"X_train_coded shape: {X_train_coded.shape}")
print(f"X_val_coded shape: {X_val_coded.shape}")

X_train text shape: (1148, 2)
X_val text shape: (1148, 2)
X_train numerical shape: (1148, 3)
X_val numerical shape: (1148, 3)
X_train_coded shape: (1148, 213)
X_val_coded shape: (1148, 213)


#### Feature Engineering for Natural Language Processing

In [17]:
# Feature Engineering for Natural Language Processing

## 1. Create Tokenizer
# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Define a custom tokenizer function
from nltk.stem import PorterStemmer
from transformers import BertTokenizer, BertModel
import torch

def custom_tokenizer(s):
    # Remove punctuation
    for char in s:
        if char in string.punctuation:
            s = s.replace(char, '')

    # Make the string lowercase
    s = s.lower()
    
    # Split the string at each space
    tokens = s.split()

    # Create a set of NLTK stop words for faster lookup
    stop_words = set(stopwords.words('english'))

    # Stem tokens and filter out stop words
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]

    return stemmed_tokens
    
# Initialize a BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def custom_tokenizer_bert(s, max_length=64):  # Add 'max_length' parameter
    # Tokenize the input text using the BERT tokenizer
    tokens = bert_tokenizer(s, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')

    # Convert tokens to numerical IDs
    input_ids = tokens['input_ids']
    attention_mask = tokens['attention_mask']

    return input_ids, attention_mask  # Return both input_ids and attention_mask

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/haesunjung/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/haesunjung/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/haesunjung/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [18]:
## 2. Instantiate CountVectorizer, TfidfVectorizer, and BERT Model
# Initialize CountVectorizer, TfidfVectorizer, and BERT model
count_vectorizer = CountVectorizer(
    tokenizer=custom_tokenizer,
    stop_words='english',
    min_df=10,
    max_features=1000
)

tfidf_vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words='english',
    min_df=10,
    max_features=1000
)

# Load a pre-trained BERT model
bert_model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
## 3. Vectorize & Fit the Values
# Vectorize text data

# For Assessor Remarks (using CountVectorizer)
X_train_Assessor_Remarks_vectorized = count_vectorizer.fit_transform(X_train_Assessor_Remarks.values)
X_val_Assessor_Remarks_vectorized = count_vectorizer.transform(X_val_Assessor_Remarks.values)


X_train_Assessor_Remarks_df = pd.DataFrame(X_train_Assessor_Remarks_vectorized.toarray(), columns=count_vectorizer.get_feature_names_out())
X_val_Assessor_Remarks_df = pd.DataFrame(X_val_Assessor_Remarks_vectorized.toarray(), columns=count_vectorizer.get_feature_names_out())

# For OPM Remarks (using BERT)
# Tokenize and encode OPM Remarks using BERT
max_length = 32  # Define the maximum sequence length
X_train_OPM_remarks_tokenized = X_train_OPM_remarks.apply(lambda x: custom_tokenizer_bert(x, max_length))
X_val_OPM_remarks_tokenized = X_val_OPM_remarks.apply(lambda x: custom_tokenizer_bert(x, max_length))

  % sorted(inconsistent)


In [20]:
# Convert tokenized data to tensors
X_train_OPM_remarks_input = torch.stack([item[0] for item in X_train_OPM_remarks_tokenized])
X_train_OPM_remarks_mask = torch.stack([item[1] for item in X_train_OPM_remarks_tokenized])
X_val_OPM_remarks_input = torch.stack([item[0] for item in X_val_OPM_remarks_tokenized])
X_val_OPM_remarks_mask = torch.stack([item[1] for item in X_val_OPM_remarks_tokenized])

In [21]:
print("X_train_OPM_remarks_input shape:", X_train_OPM_remarks_input.shape)
print("X_train_OPM_remarks_mask shape:", X_train_OPM_remarks_mask.shape)
print("X_train_OPM_remarks_input data type:", X_train_OPM_remarks_input.dtype)
print("X_train_OPM_remarks_mask data type:", X_train_OPM_remarks_mask.dtype)

# Remove the extra dimension with .squeeze(1)
X_train_OPM_remarks_input = X_train_OPM_remarks_input.squeeze(1)
X_train_OPM_remarks_mask = X_train_OPM_remarks_mask.squeeze(1)

X_train_OPM_remarks_input shape: torch.Size([1148, 1, 32])
X_train_OPM_remarks_mask shape: torch.Size([1148, 1, 32])
X_train_OPM_remarks_input data type: torch.int64
X_train_OPM_remarks_mask data type: torch.int64


In [22]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader


# Assuming you have the tensors 'X_train_OPM_remarks_input' and 'X_train_OPM_remarks_mask' defined
# You can create a dataset like this:
dataset = TensorDataset(X_train_OPM_remarks_input, X_train_OPM_remarks_mask)

# Then, you can use the DataLoader as shown in your code:
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Now, the 'dataset' variable is defined and can be used in the DataLoader.

In [23]:
# Extract embeddings from the model's output
X_train_OPM_remarks_embeddings = []

# Iterate through the data in batches
for batch in dataloader:
    input_ids, attention_mask = batch
    
    # Process the batch with BERT
    batch_outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Extract embeddings for this batch
    batch_embeddings = batch_outputs.last_hidden_state
    
    # Append the batch embeddings to the list
    X_train_OPM_remarks_embeddings.append(batch_embeddings)


In [24]:
# Concatenate the batch embeddings to obtain the embeddings for the entire dataset
X_train_OPM_remarks_embeddings = torch.cat(X_train_OPM_remarks_embeddings, dim=0)

# Calculate the mean of embeddings along the sequence length
X_train_OPM_remarks_vectorized = X_train_OPM_remarks_embeddings.mean(dim=1).detach().numpy()


In [25]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader


# Assuming you have the tensors 'X_train_OPM_remarks_input' and 'X_train_OPM_remarks_mask' defined
# You can create a dataset like this:
val_dataset = TensorDataset(X_val_OPM_remarks_input, X_val_OPM_remarks_mask)

# Then, you can use the DataLoader as shown in your code:
batch_size = 8
val_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)



In [26]:

# Now, the 'dataset' variable is defined and can be used in the DataLoader.

# Assuming you have a DataLoader for validation data named 'val_dataloader'
X_val_OPM_remarks_embeddings = []

# Iterate through the data in batches for validation data
for batch in val_dataloader:
    input_ids, attention_mask = batch
    
    # Process the batch with BERT for validation data
    batch_outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Extract embeddings for this batch
    batch_embeddings = batch_outputs.last_hidden_state
    
    # Append the batch embeddings to the list for validation data
    X_val_OPM_remarks_embeddings.append(batch_embeddings)

In [27]:

# Concatenate the batch embeddings for the entire validation dataset
X_val_OPM_remarks_embeddings = torch.cat(X_val_OPM_remarks_embeddings, dim=0)

# Calculate the mean of embeddings along the sequence length for validation data
X_val_OPM_remarks_vectorized = X_val_OPM_remarks_embeddings.mean(dim=1).detach().numpy()

In [28]:
import pandas as pd

# Assuming X_val_OPM_remarks_vectorized is a numpy array
data = X_val_OPM_remarks_vectorized  # Replace with your actual data

# Create a DataFrame with columns if you have them
# For example, you can create columns like 'feature_1', 'feature_2', 'feature_3', etc.
column_names = [f'feature_{i}' for i in range(X_train_OPM_remarks_vectorized.shape[1])]

X_train_OPM_remarks_df = pd.DataFrame(data, columns=column_names)

# Now, df is a DataFrame containing your vectorized data


In [29]:
import pandas as pd

# Assuming X_val_OPM_remarks_vectorized is a numpy array
data = X_val_OPM_remarks_vectorized  # Replace with your actual data

# Create a DataFrame with columns if you have them
# For example, you can create columns like 'feature_1', 'feature_2', 'feature_3', etc.
column_names = [f'feature_{i}' for i in range(X_val_OPM_remarks_vectorized.shape[1])]

X_val_OPM_remarks_df = pd.DataFrame(data, columns=column_names)

# Now, df is a DataFrame containing your vectorized data

In [30]:
## 4. Scaling Numerical Data (Scaling is important for various machine learning algorithms)
# For scaling numerical data, you can use StandardScaler, MinMaxScaler, or RobustScaler.

# Example using StandardScaler:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_numerical_scaled = scaler.fit_transform(X_train_numerical)
X_val_numerical_scaled = scaler.transform(X_val_numerical)

# Scaling numerical data is essential in machine learning for reasons like maintaining consistent units and preventing features with large values from dominating the model.

# Additional documentation and model training steps are typically required after this feature engineering process.

##### 1. Create Tokenizer

Instantiate the count_vectorizer and tf-idf_vectorizer.

Vectorize & Fit the values:

Scaling numerical data is essential in machine learning for the following reasons:

1. To ensure that all features have the same units, preventing features with larger values from dominating the model.
2. Speed up optimization algorithms like gradient descent.
3. Regularization is sensitive to scales.
4. For distance-based algorithms.

##### Data Integration and Scaling

In [31]:
# Scale numerical data
my_minmax_scaler = MinMaxScaler()
X_train_numerical_scaled = my_minmax_scaler.fit_transform(X_train_numerical)
X_val_numerical_scaled = my_minmax_scaler.transform(X_val_numerical)

# Convert scaled numerical data to DataFrames
X_train_numerical_scaled_df = pd.DataFrame(X_train_numerical_scaled, columns=X_train_numerical.columns)
X_val_numerical_scaled_df = pd.DataFrame(X_val_numerical_scaled, columns=X_val_numerical.columns)

Prepare the dataframes for concatenation by resetting their indices.

In [32]:
# Reset the indices for all DataFrames
X_train_Assessor_Remarks_df = X_train_Assessor_Remarks_df.reset_index(drop=True)
X_train_OPM_remarks_df = X_train_OPM_remarks_df.reset_index(drop=True)
X_train_numerical_scaled_df = X_train_numerical_scaled_df.reset_index(drop=True)
X_train_coded = X_train_coded.reset_index(drop=True)

X_val_Assessor_Remarks_df = X_val_Assessor_Remarks_df.reset_index(drop=True)
X_val_OPM_remarks_df = X_val_OPM_remarks_df.reset_index(drop=True)
X_val_numerical_scaled_df = X_val_numerical_scaled_df.reset_index(drop=True)
X_val_coded = X_val_coded.reset_index(drop=True)

Combining Data Components for Training and Validation Sets: Merging Textual, Numerical, and Dummy-Coded Categorical Data:

In [33]:
# Concatenate data for training and validation sets
X_train_remarks_merged = pd.concat([X_train_Assessor_Remarks_df, X_train_OPM_remarks_df], axis=1)
X_train_all_merged = pd.concat([X_train_remarks_merged, X_train_numerical_scaled_df, X_train_coded], axis=1)

X_val_remarks_merged = pd.concat([X_val_Assessor_Remarks_df, X_val_OPM_remarks_df], axis=1)
X_val_all_merged = pd.concat([X_val_remarks_merged, X_val_numerical_scaled_df, X_val_coded], axis=1)

Print the shape of the final merged data:

In [34]:
# Print the shapes of the merged data
print("Shape after merging all components (train set):", X_train_all_merged.shape)
print("Shape after merging all components (val set):", X_val_all_merged.shape)

Shape after merging all components (train set): (1148, 1005)
Shape after merging all components (val set): (1148, 1005)


Calculate the total count of missing values in the merged training and validation datasets.

In [35]:
# Check for missing values
train_missing_values = X_train_all_merged.isna().sum().sum()
val_missing_values = X_val_all_merged.isna().sum().sum()

# Display the total count of missing values in both datasets
print(f"Total missing values in X_train_all_merged: {train_missing_values}")
print(f"Total missing values in X_val_all_merged: {val_missing_values}")

Total missing values in X_train_all_merged: 0
Total missing values in X_val_all_merged: 0


Perform a sanity check by examining data types since there are numerous columns, and .info() is not used.

In [36]:
# Check the data types in the merged training dataset
data_types_train = X_train_all_merged.select_dtypes(include=['object'])
print("Data Types in Merged Training Dataset:\n", data_types_train)

# Check the data types in the merged validation dataset
data_types_val = X_val_all_merged.select_dtypes(include=['object'])
print("\nData Types in Merged Validation Dataset:\n", data_types_val)

# Check for missing values in the merged validation dataset
val_missing_values = X_val_all_merged.isna().sum().sum()
print(f"\nTotal missing values in Merged Validation Dataset: {val_missing_values}")

Data Types in Merged Training Dataset:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[1148 rows x 0 columns]

Data Types in Merged Validation Dataset:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[1148 rows x 0 columns]

Tota

In [37]:
import numpy as np

# Check for NaN or infinity values in X_val_all_merged
problematic_rows = np.isnan(X_val_all_merged).any(axis=1) | np.isinf(X_val_all_merged).any(axis=1)

# Get the indices of the rows with problematic values
problematic_row_indices = np.where(problematic_rows)[0]

# Display the indices of the problematic rows
print("Indices of problematic rows:", problematic_row_indices)

Indices of problematic rows: []


### Lasso Baseline Model & Random Search

In [44]:
# Baseline Lasso Model
from sklearn.linear_model import Lasso

# Create a Lasso model with alpha (regularization strength) set to 1.0
lasso_model = Lasso(alpha=1.0)

# Fit the Lasso model to the training data
lasso_model.fit(X_train_all_merged, y_train)

# Calculate the R-squared (R2) score for the training data
train_score = lasso_model.score(X_train_all_merged, y_train)
print(f"R-squared (R2) score on the training data: {train_score}")

# Calculate the R-squared (R2) score for the validation data
val_score = lasso_model.score(X_val_all_merged, y_val)
print(f"R-squared (R2) score on the validation data: {val_score}")

R-squared (R2) score on the training data: 0.5709270591499089
R-squared (R2) score on the validation data: 0.3624268519128706


  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


Estimator + random search for Lasso:

In [45]:
# Define a pipeline with multiple components: scaling, dimensionality reduction, and regression model
estimators = [
    ('scaling', StandardScaler()),  # Scale features
    ('reduce_dim', PCA()),          # Reduce dimensionality using PCA
    ('model', LinearRegression())   # Use Linear Regression as the base model
]

pipe = Pipeline(estimators)

# Define the hyperparameter search space
param_dist = [
    {
        'scaling': [StandardScaler(), RobustScaler(), MinMaxScaler()],  # Scaling options
        'reduce_dim': [PCA()],          # Use PCA for dimensionality reduction
        'reduce_dim__n_components': [1, 100, 10],  # Number of PCA components
        'model': [Lasso()],             # Lasso regression as the model
        'model__alpha': [0.1, 1, 10, 100]  # Alpha values for Lasso
    }
]


# Create a Randomized Search CV object with 10 iterations, 5-fold cross-validation
random_search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=10, 
    cv=5,
    verbose=2,
    n_jobs=-1
)

# Fit the random search to the training data to find the best hyperparameters
random_search.fit(X_train_all_merged, y_train)

# Get the best estimator with optimized hyperparameters
best_estimator = random_search.best_estimator_

# Use the best estimator to make predictions on the training and validation data
y_pred_train = best_estimator.predict(X_train_all_merged)
y_pred_val = best_estimator.predict(X_val_all_merged)

# Calculate the R-squared (R2) scores for training and validation data
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)

print(f"R-squared (R2) score on the training data: {r2_train}")
print(f"R-squared (R2) score on the validation data: {r2_val}")

# Get the best parameters and best estimator
print("Best Parameters:", random_search.best_params_)
print("Best Estimator:", best_estimator)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
R-squared (R2) score on the training data: 0.4324375063562932
R-squared (R2) score on the validation data: 0.18754913702653375
Best Parameters: {'scaling': RobustScaler(), 'reduce_dim__n_components': 1, 'reduce_dim': PCA(n_components=1), 'model__alpha': 100, 'model': Lasso(alpha=100)}
Best Estimator: Pipeline(steps=[('scaling', RobustScaler()),
                ('reduce_dim', PCA(n_components=1)),
                ('model', Lasso(alpha=100))])


### Decision Tree Regressor Baseline Mode:

In [46]:
from sklearn.tree import DecisionTreeRegressor

# Create a DecisionTreeRegressor model with specified hyperparameters
# Hyperparameters: max_depth, min_samples_split, min_samples_leaf, and random_state
DT_regressor_model = DecisionTreeRegressor(
    max_depth=20,
    min_samples_split=100,
    min_samples_leaf=100,
    random_state=22
)

# Fit the DecisionTreeRegressor model to the training data
DT_regressor_model.fit(X_train_all_merged, y_train)

# Make predictions on the training data
y_pred_train = DT_regressor_model.predict(X_train_all_merged)

# Calculate the R-squared (R2) score for the training data
DT_train_score = DT_regressor_model.score(X_train_all_merged, y_train)

print(f"R-squared (R2) Score for Training Data: {DT_train_score}")

# Calculate the R-squared (R2) score for the validation data
DT_val_score = DT_regressor_model.score(X_val_all_merged, y_val)
print(f"R-squared (R2) Score for Validation Data: {DT_val_score}")

R-squared (R2) Score for Training Data: 0.3062033853154149
R-squared (R2) Score for Validation Data: 0.17132077786390998


### KNN baseline model & Random Search Model:

In [49]:
# Create a KNeighborsRegressor instance with specified hyperparameters
# Hyperparameters: n_neighbors, weights, and leaf_size
KNN_regressor_model = KNeighborsRegressor(
    n_neighbors=30,
    weights="distance",
    leaf_size=100
)

# Fit the KNeighborsRegressor model to the training data
KNN_regressor_model.fit(X_train_all_merged, y_train)

# Make predictions on the training data
y_pred_train = KNN_regressor_model.predict(X_train_all_merged)

# Calculate the R-squared (R2) score for the training data
KNN_train_score = KNN_regressor_model.score(X_train_all_merged, y_train)

print(f"R-squared (R2) Score for Training Data: {KNN_train_score}")

# Calculate the R-squared (R2) score for the validation data
KNN_val_score = KNN_regressor_model.score(X_val_all_merged, y_val)
print(f"R-squared (R2) Score for Validation Data: {KNN_val_score}")

R-squared (R2) Score for Training Data: 0.999999999998265
R-squared (R2) Score for Validation Data: 0.6557134839536438


In [51]:
# Create an estimator using a pipeline with data preprocessing steps (scaling, dimension reduction) and KNeighborsRegressor as the model.
estimators = [
    ('scaling', StandardScaler()),
    ('reduce_dim', PCA()),
    ('model', KNeighborsRegressor())
]

pipe = Pipeline(estimators)

# Define the hyperparameter search space
param_dist = [
    {
        'scaling': [StandardScaler(), RobustScaler(), MinMaxScaler()],
        'reduce_dim': [PCA()],
        'reduce_dim__n_components': [1, 100, 10],
        'model': [KNeighborsRegressor()],
        'model__n_neighbors': [3, 10, 30, 50],
        'model__weights': ['uniform', 'distance'],
        'model__leaf_size': [10, 50, 100],
        'model__n_jobs': [-1]
    }
]

# Perform a randomized search with cross-validation
random_search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    verbose=2,
    n_jobs=-1
)

# Fit the grid search to the training data
random_search.fit(X_train_all_merged, y_train)

# Get the best estimator
best_estimator = random_search.best_estimator_

# Make predictions on the training data and validation data using the best estimator
y_pred_train_KNN_estimator = best_estimator.predict(X_train_all_merged)
y_pred_val_KNN_estimator = best_estimator.predict(X_val_all_merged)

# Calculate the R-squared (R2) score for both training and validation data
r2_train_KNN_estimator = r2_score(y_train, y_pred_train_KNN_estimator)
r2_val_KNN_estimator = r2_score(y_val, y_pred_val_KNN_estimator)


Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [52]:

print(f"R-squared (R2) score on the training data: {r2_train}")
print(f"R-squared (R2) score on the validation data: {r2_val}")

# Get the best parameters and best estimator
print("Best Parameters:", random_search.best_params_)
print("Best Estimator:", best_estimator)

R-squared (R2) score on the training data: 0.4324375063562932
R-squared (R2) score on the validation data: 0.18754913702653375
Best Parameters: {'scaling': RobustScaler(), 'reduce_dim__n_components': 1, 'reduce_dim': PCA(n_components=1), 'model__weights': 'distance', 'model__n_neighbors': 10, 'model__n_jobs': -1, 'model__leaf_size': 10, 'model': KNeighborsRegressor(leaf_size=10, n_jobs=-1, n_neighbors=10, weights='distance')}
Best Estimator: Pipeline(steps=[('scaling', RobustScaler()),
                ('reduce_dim', PCA(n_components=1)),
                ('model',
                 KNeighborsRegressor(leaf_size=10, n_jobs=-1, n_neighbors=10,
                                     weights='distance'))])
[CV] END model=Lasso(), model__alpha=0.1, reduce_dim=PCA(), reduce_dim__n_components=1, scaling=StandardScaler(); total time=   0.3s
[CV] END model=Lasso(), model__alpha=0.1, reduce_dim=PCA(), reduce_dim__n_components=1, scaling=MinMaxScaler(); total time=   0.4s
[CV] END model=Lasso(), mode

### Conclusion

The analysis reveals that none of the models achieved a satisfactory level of performance in capturing the underlying patterns within the variance of the data, whether applied to the training dataset or the validation dataset. This outcome is particularly noteworthy, given that one of the model features was "Assessed Value," which, in principle, should exhibit a correlation with the sale price. Had any of the models exhibited significant R-squared (R^2) values, the subsequent steps would have entailed further refinement and optimization, possibly including ensemble techniques such as model bagging.