## DS555.V: Data Science and Business Strategy

### Introduction

**Author:** Utku Acar  
**Department:** Computer Science  
**Role:** Software Engineer / Researcher  

**My Secret Power:** I possess a remarkable ability to efficiently harness prompts and tame the wildest multi-dimensional data, particularly in the captivating realms of Images and Videos. My expertise lies in the magical world of Deep Learning and Computer Vision.

**My Crime-Fighting Identity:** As the illustrious Hyperion Solitude, I embark on an exhilarating journey where the realms of data science merge with supercharged strategies. 🚀🔍📊🦸‍♂️  

Welcome to the enthralling odyssey that lies ahead! Prepare to witness the fusion of data science and business strategy in ways that push the boundaries of exploration and innovation. Join me in this adventure of a lifetime as I unveil insights, strategies, and solutions that spark positive change.

Stay tuned for a remarkable journey that's about to unfold at [https://github.com/hyperionsolitude](https://github.com/hyperionsolitude). Together, we'll explore uncharted territories, conquer challenges, and celebrate the triumphs of data-driven exploration. 🌟🎉

# Accurate Housing Price Prediction with Data Cleaning and Manipulation Detection

In this notebook, we're about to embark on a fascinating journey into the realm of accurate housing price prediction. Our mission revolves around overcoming the challenges posed by manipulated housing data and varying pricing practices. Through a blend of sophisticated data cleaning techniques and insightful manipulation detection, we'll develop a powerful predictive model that can provide reliable housing price estimates.

## The Data

Our dataset, named "KonutGame.csv," unveils the intricate world of housing prices based on title deeds. However, the path to accuracy is fraught with obstacles. Taxes are determined by declared prices, and discrepancies can arise due to strategic pricing tactics and the influence of housing loans. Moreover, we're well aware that some individuals might sidestep the 90% housing loan regulation, further complicating the landscape.

Our objective is to engineer a methodology that accurately calculates housing prices, taking into account potential manipulations and outliers, all while adhering to the regulatory guidelines.

## Cleaning and Manipulation Detection

Our approach to this challenge will encompass the following key phases:

1. **Data Loading and Initial Exploration:** We will start by loading the "KonutGame.csv" dataset and delving into its structure and contents. An initial exploration will provide insights into the features that shape housing prices.

2. **Data Cleaning and Preprocessing:** We'll embark on a comprehensive data cleaning journey. This includes handling missing values, addressing outliers, and ensuring data consistency to build a robust foundation for accurate predictions.

3. **Manipulation Detection:** Armed with domain knowledge and analytical tools, we will unveil potential manipulations and irregularities in the housing data. By detecting instances of non-compliance with regulations and loan-pricing tactics, we'll lay the groundwork for more precise predictions.

4. **Predictive Model Development:** With cleaned and validated data in hand, we'll design a predictive model that captures the nuances of housing price dynamics. Our model will account for the factors affecting prices while accommodating variations and regulatory influences.

5. **Performance Evaluation and Insights:** We'll rigorously evaluate the performance of our predictive model using appropriate metrics. Through insights gained from the model's predictions, we'll gain a deeper understanding of the factors that most significantly impact housing prices.

6. **Regulation-Aware Pricing:** Leveraging the power of our model, we'll embark on a journey to estimate accurate housing prices that adhere to regulations and detect instances where regulations are potentially circumvented.

Our expedition promises a thrilling fusion of data science, domain expertise, and regulatory insights. As we explore the uncharted territories of housing price prediction, we invite you to join us in this captivating adventure of unraveling manipulation, enhancing accuracy, and decoding the intricate world of housing economics.

# Step 1: Data Preprocessing

## Loading and Handling Missing Values

In this initial step, we begin by loading the dataset from the "KonutGame.csv" file using the Pandas library. The loaded data contains information about housing prices based on title deeds. By using the `read_csv` function, we create a DataFrame called `data`.

The first line of code, `data = pd.read_csv("KonutGame.csv")`, loads the data into the DataFrame. The provided output `(17178, 9)` indicates that the dataset initially has 17,178 rows and 9 columns.

We then proceed to handle any missing values in the dataset. Missing values can potentially impact the quality of our analysis and predictions. The `dropna` function is used with the `inplace=True` parameter to remove rows with missing values from the DataFrame. This step is essential for data integrity and accurate results.

Following this, the output `(17178, 9)` confirms that after removing the rows with missing values, the dataset remains the same size.


By performing these preprocessing steps, we set the foundation for further analysis, modeling, and insights from the housing price data.


In [511]:
# Step 1: Data Preprocessing
import pandas as pd

# Load the dataset
data = pd.read_csv("KonutGame.csv")

print(data.shape)

# Handle missing values
data.dropna(inplace=True)

print(data.shape)

(17178, 9)
(17178, 9)


# Data Cleaning and Preprocessing

In this code snippet, we focus on cleaning and preprocessing a housing dataset using the Pandas library. The dataset, initially stored in a DataFrame named `df_org`, contains information about various housing attributes. We create another DataFrame named `df` to work with the cleaned data.

## Cleaning "building_age" and "total_floor_count"

We begin by defining functions to clean the "building_age" and "total_floor_count" columns. The `clean_age` function processes values in the "building_age" column, handling cases where values are expressed as ranges or contain "ve üzeri" (and above). This function ensures that numerical values are extracted and assigned appropriately to the "building_age" column in the DataFrame.

The `clean_floor_count` function serves a similar purpose for the "total_floor_count" column. It processes range-based and "ve üzeri" values, computing a single value that represents the floor count effectively.

## Cleaning "floor_no"

The code then proceeds to clean the "floor_no" column, which represents the floor number of each housing unit. The `clean_floor_no` function takes into account both the "floor_no" and "total_floor_count" columns to ensure accurate cleaning. It handles various cases, including garden and basement floors, special cases like "Müstakil," and computes appropriate floor numbers based on the provided information.

## Applying Cleaning Functions

The cleaning functions are applied to the respective columns using the `apply` method. For "building_age" and "total_floor_count," the functions are directly applied to the columns using the `apply` method. For "floor_no," the `clean_floor_no` function is applied using a lambda function along with the `axis=1` parameter to work row-wise.

As a result of this cleaning process, the DataFrame `df` is transformed to contain cleaned and standardized values for "building_age," "total_floor_count," and "floor_no." These preprocessing steps lay the foundation for subsequent analysis and modeling tasks on the housing dataset.

Stay tuned as we continue to explore and refine the data to extract meaningful insights and build predictive models for housing price estimation.


In [512]:
import pandas as pd
import re

df = pd.DataFrame(data)
df_org = pd.DataFrame(data)

# Clean "building_age" and "total_floor_count"
def clean_age(age):
    if isinstance(age, str):
        if "arası" in age:
            match = re.match(r'(\d+)-(\d+) arası', age)
            if match:
                lower_limit = int(match.group(1))
                upper_limit = int(match.group(2))
                return (lower_limit + upper_limit) // 2
        elif "ve üzeri" in age:
            return int(age.split(" ")[0])  # Extract the lower limit
    return int(age)

df["building_age"] = df["building_age"].apply(clean_age)

# Clean "total_floor_count"
def clean_floor_count(floor_count):
    if isinstance(floor_count, int):
        return floor_count
    elif "arası" in floor_count:
        match = re.match(r'(\d+)-(\d+) arası', floor_count)
        if match:
            lower_limit = int(match.group(1))
            upper_limit = int(match.group(2))
            return (lower_limit + upper_limit) // 2
    elif "ve üzeri" in floor_count:
        return int(re.findall(r'\d+', floor_count)[0])  # Extract the lower limit
    elif "Bahçe katı" in floor_count:
        return 1
    else:
        return int(floor_count)

df["total_floor_count"] = df["total_floor_count"].apply(clean_floor_count)

# Clean "floor_no"
def clean_floor_no(floor_no, total_floor_count):
    if "Bahçe" in floor_no or "Giriş" in floor_no or "Zemin" in floor_no:
        return 1
    elif "Bodrum" in floor_no:
        return 0
    elif "Müstakil" in floor_no or "Üst" in floor_no or "Teras" in floor_no or "Çatı" in floor_no:
        return total_floor_count
    elif "Kot" in floor_no:
        return int(floor_no.split("Kot")[1])
    elif "ve üzeri" in floor_no:
        return int(floor_no.split(" ")[0])  # Extract the lower limit
    else:
        return int(floor_no.split("+")[0])

df["floor_no"] = df.apply(lambda row: clean_floor_no(row["floor_no"], row["total_floor_count"]), axis=1)

# Numerical Encoding of Categorical Variables

In this code segment, we delve into numerical encoding of categorical variables within the housing dataset using the Pandas library.

## Numerize "District" and "Neighborhood"

We take strides toward transforming categorical columns, "District" and "Neighborhood," into numerical representations. The technique employed here is called numerical encoding, which allows us to convert text-based categories into corresponding numerical codes. This process facilitates the inclusion of categorical data in machine learning models that require numerical input.

The code `df["District"] = pd.factorize(df["District"])[0]` showcases the application of the `factorize` function on the "District" column. This function assigns unique numerical codes to each distinct category in the column, transforming it into a numerical format. The same transformation is executed for the "Neighborhood" column using the code `df["Neighborhood"] = pd.factorize(df["Neighborhood"])[0]`.

## Exporting Cleaned DataFrame

We proceed by exporting the resulting DataFrame, which now incorporates numerical encodings for the "District" and "Neighborhood" columns. This refined DataFrame can serve as input for subsequent analysis and modeling tasks.

The code snippet `df.to_csv("cleaned_data.csv", index=False)` accomplishes the export of the cleaned DataFrame to a CSV file named "cleaned_data.csv." The `index=False` parameter ensures that the DataFrame indices are not included in the exported file.

With categorical variables numerically encoded and the data exported in a refined format, we are well-equipped to undertake sophisticated analyses and predictive modeling based on the enriched dataset.


In [513]:
# Numerize "District" and "Neighborhood"
df["District"] = pd.factorize(df["District"])[0]
df["Neighboorhood"] = pd.factorize(df["Neighboorhood"])[0]

# Export the resulting DataFrame to a CSV file
df.to_csv("cleaned_data.csv", index=False)

# Implementation of 90% Regulation Feature

In this section of code, we delve into the implementation of a crucial feature based on the 90% regulation principle. The objective is to determine whether a particular housing unit adheres to the regulation that housing loans cannot exceed 90% of the property's value. We leverage the NumPy library for efficient calculations.

## Adding the "ninetyreg" Feature

The code `df['ninetyreg'] = np.where(df['mortgage'] >= 0.9 * df['value'], 1, 0)` introduces a new feature named "ninetyreg" into the DataFrame. This feature serves as a binary indicator, where a value of 1 indicates compliance with the 90% regulation and 0 signifies non-compliance. The calculation is achieved by comparing the mortgage value to 90% of the property's value, employing the NumPy function `np.where`.

## Displaying the Updated DataFrame

We conclude this section by presenting an overview of the updated DataFrame. The code `df['ninetyreg']` offers a glimpse into the newly added "ninetyreg" feature. Each entry in this column signifies whether the housing unit adheres to the 90% regulation or not.

As we proceed through the data exploration and analysis journey, this "ninetyreg" feature will serve as a critical piece of information, enabling us to better understand the compliance landscape and its potential influence on housing prices.

In [514]:
import numpy as np

# Add the "ninetyreg" feature based on the 90% regulation
df['ninetyreg'] = np.where(df['mortgage'] >= 0.9 * df['value'], 1, 0)

# Display the updated DataFrame
df['ninetyreg']

0        0
1        0
2        0
3        1
4        0
        ..
17173    0
17174    0
17175    0
17176    0
17177    0
Name: ninetyreg, Length: 17178, dtype: int64

# Creation of Regulated Prices Feature

In this section, we delve into the creation of a new feature that captures regulated (true) prices of housing units, taking into account the 90% regulation principle. This process involves utilizing the "ninetyreg" feature to determine the appropriate price based on compliance status.

## Generating the "regulated_value" Feature

The code `df['regulated_value'] = df.apply(lambda row: row['mortgage'] / 0.9 if row['ninetyreg'] == 1 else row['value'], axis=1)` introduces the "regulated_value" feature to the DataFrame. This feature encapsulates the regulated prices of housing units. For units adhering to the 90% regulation (as indicated by the "ninetyreg" feature), the regulated price is calculated by dividing the mortgage value by 0.9. In cases where the unit does not adhere to the regulation, the original "value" is retained as the regulated price.

## Displaying Rows with Price Differences

To gain insights into the impact of regulation on housing prices, we proceed to identify and display rows where the "value" and "regulated_value" differ. The code `price_difference_rows = df[df['value'] != df['regulated_value']]` filters the DataFrame to isolate rows where there's a discrepancy in values.

We then extract and present the columns "value" and "regulated_value" for these specific rows. This visualization provides a clear view of the instances where regulatory considerations lead to adjusted housing prices.

As we continue our exploration, the "regulated_value" feature will enable us to assess the regulatory impact on housing prices and deepen our understanding of the interplay between regulation, perceived value, and actual pricing.

In [515]:
# Creating a new feature to store the regulated(true) prices
df['regulated_value'] = df.apply(lambda row: row['mortgage'] / 0.9 if row['ninetyreg'] == 1 else row['value'], axis=1)

# Display rows where 'value' and 'regulated_value' are different
price_difference_rows = df[df['value'] != df['regulated_value']]
price_difference_rows[['value', 'regulated_value']]


Unnamed: 0,value,regulated_value
3,280000,311111.111111
6,605000,672222.222222
14,190000,211111.111111
15,520000,577777.777778
24,315000,350000.000000
...,...,...
17109,235000,261111.111111
17138,400000,416666.666667
17139,325000,361111.111111
17142,275000,305555.555556


# Deriving Key Metrics: Room Count and Value per Room

In this section, we delve into the derivation of critical metrics related to housing attributes, specifically focusing on room count and value per room. These calculations enable us to gain insights into the housing layout and pricing structure.

## Extracting Room Count from Room Count Descriptor

We commence by defining a function, `extract_room_count(room_count)`, that facilitates the extraction of the number of rooms from the room count descriptor. The code `return sum(int(x) for x in room_count.split('+'))` processes the descriptor to tally the individual room numbers, generating a numerical count.

We proceed to apply this function to the 'room_count' column using the code `df['room_count_num'] = df['room_count'].apply(extract_room_count)`. This application yields a new 'room_count_num' column, housing the derived room counts.

## Calculating Average Room Size

The code `df['avg_room_size'] = df['size'] / df['room_count_num']` embarks on the calculation of the average size per room, encapsulated in the 'avg_room_size' column. The division of the total size by the room count provides us with a metric that represents the spatial extent of individual rooms within a housing unit.

## Determining Value per Room

To gain insights into the housing value structure, we calculate the value per room, utilizing the "regulated_value" and "avg_room_size" columns. The code `df['value_per_room_by_m2'] = df['regulated_value'] / df['avg_room_size']` computes the value attributed to each room on a per-square-meter basis, revealing a nuanced perspective on pricing distribution.

## Displaying the Enriched DataFrame

We conclude by presenting an enriched DataFrame that encapsulates the newly calculated metrics. The code `df[['room_count', 'size', 'regulated_value', 'value_per_room_by_m2']]` offers a glimpse into the room count, total size, regulated value, and value per room metrics for each housing unit.

These metrics open up avenues for understanding housing layouts, pricing dynamics, and their interplay within the context of the dataset.

In [516]:
# Define a function to extract the number of rooms from the room count
def extract_room_count(room_count):
    return sum(int(x) for x in room_count.split('+'))

# Apply the function to the 'room_count' column to create a new 'room_count_num' column
df['room_count_num'] = df['room_count'].apply(extract_room_count)

# Calculate the average size per room and create the 'avg_room_size' column
df['avg_room_size'] = df['size'] / df['room_count_num']

# Calculate value per room
df['value_per_room_by_m2'] = df['regulated_value'] / df['avg_room_size']

# Display the updated DataFrame
df[['room_count', 'size','regulated_value','value_per_room_by_m2' ]]


Unnamed: 0,room_count,size,regulated_value,value_per_room_by_m2
0,3+1,130,380000.000000,11692.307692
1,3+1,90,435000.000000,19333.333333
2,3+2,175,420000.000000,12000.000000
3,2+1,80,311111.111111,11666.666667
4,2+1,88,345000.000000,11761.363636
...,...,...,...,...
17173,3+1,100,265000.000000,10600.000000
17174,3+1,125,170000.000000,5440.000000
17175,3+1,125,150000.000000,4800.000000
17176,2+1,85,275000.000000,9705.882353


# District-Level Analysis: Value Per Room Variation

In this section, we immerse ourselves in a district-level analysis, aiming to uncover the variation in value per room metrics across different districts. By dissecting this variation, we gain insights into the diverse pricing dynamics associated with varying localities.

## Calculating Average Value Per Room for Each District

We embark on our journey by calculating the average value per room within each district, leveraging the `groupby` functionality. The code `district_avg_value_per_room = df.groupby('District')['value_per_room_by_m2'].mean()` aggregates the value per room metrics for each district, furnishing us with a district-wise average.

## Unveiling the Difference from District Average

Our exploration takes an intriguing turn as we compute the difference between the value per room metric and the district average. The code `df['diff_from_district_avg'] = df.apply(lambda row: row['value_per_room_by_m2'] - district_avg_value_per_room[row['District']], axis=1)` equips each housing unit with a new metric, capturing the deviation from the district's average value per room. This deviation serves as a key indicator of how the unit's pricing aligns with or deviates from the district norm.

## Displaying the Enriched DataFrame

We culminate our analysis by presenting an enriched DataFrame, encompassing both the regulated value, value per room, and the insightful "diff_from_district_avg" metric. The code `df[['regulated_value', 'value_per_room_by_m2', 'diff_from_district_avg']]` unveils this composite view, offering a comprehensive snapshot of pricing, room valuation, and district-level variance.

These metrics empower us to discern pricing anomalies, spot areas of high or low relative pricing, and unravel the unique pricing dynamics within each district. As our exploration deepens, these insights will guide us toward data-driven strategies and predictions that harness the essence of district-level variations.

In [517]:
# Calculate the average value_per_avg_room for each District
district_avg_value_per_room = df.groupby('District')['value_per_room_by_m2'].mean()

# Calculate the difference between value_per_avg_room and average value_per_avg_room for the District
df['diff_from_district_avg'] = df.apply(lambda row: row['value_per_room_by_m2'] - district_avg_value_per_room[row['District']], axis=1)

df[['regulated_value','value_per_room_by_m2', 'diff_from_district_avg']]

Unnamed: 0,regulated_value,value_per_room_by_m2,diff_from_district_avg
0,380000.000000,11692.307692,4905.344191
1,435000.000000,19333.333333,-1205.104776
2,420000.000000,12000.000000,5213.036498
3,311111.111111,11666.666667,-3.982314
4,345000.000000,11761.363636,-5823.761033
...,...,...,...
17173,265000.000000,10600.000000,-9938.438109
17174,170000.000000,5440.000000,-6230.648980
17175,150000.000000,4800.000000,-6870.648980
17176,275000.000000,9705.882353,-514.262271


# Corrected Prices and Normalization: Unraveling the Housing Pricing Puzzle

In this segment, we embark on a journey of recalibrating housing prices through the lens of corrected prices, normalization, and the strategic integration of randomness. Our exploration seeks to shed light on the methodology behind each step and the rationale driving our choices.

## Step 2: Calculating Corrected Prices

Our voyage begins with the calculation of corrected prices, where we address pricing nuances to establish a fair and consistent pricing framework. The function `correct_price(row)` takes center stage, harnessing the district's average value per room as a starting point. By strategically introducing controlled randomness, we account for fluctuations inherent in real-world pricing. The code `return district_avg + np.random.uniform(-0.1, 0.1)` encapsulates this process, yielding corrected prices that balance statistical insights with the unpredictability of the market.

## Step 3: Normalization for Comprehensive Insight

Normalization steps in as our guiding compass, enabling us to understand pricing on a holistic scale. We apply normalization to the corrected prices, factoring in the average room size. The code `df['normalized_price'] = df['corrected_price_per_room'] * df['avg_room_size']` generates normalized prices that account for variations in housing unit sizes, ensuring an even playing field for comparison.

## Step 4: Visualizing the Data Transformation

As we traverse through these transformations, we unveil a tableau of key metrics for visual examination. By extracting and presenting columns of interest, including "value_per_room_by_m2," "diff_from_district_avg," "corrected_price_per_room," "regulated_value," and "normalized_price," we capture the essence of our journey in a succinct form. This visualization is a testament to the profound impact of data science on pricing strategies.

## Strategic Decision-Making: Our Guiding Light

Each decision in this process is a carefully considered step toward a balanced pricing paradigm. By incorporating district-level insights, calculated randomness, and normalization, we forge a path toward pricing models that echo real-world dynamics. The amalgamation of these techniques aligns with our commitment to infusing data science with strategic decision-making.

In [518]:
# Calculate corrected prices and add randomness
def correct_price(row):
    district_avg = district_avg_value_per_room[row['District']]
    return district_avg + np.random.uniform(-0.1, 0.1) 

df['corrected_price_per_room'] = df.apply(correct_price, axis=1)

# Calculate the normalized price
df['normalized_price'] = df['corrected_price_per_room'] * df['avg_room_size']


# Display the columns of interest
df[['value_per_room_by_m2','diff_from_district_avg', 'corrected_price_per_room','regulated_value' ,'normalized_price']]


Unnamed: 0,value_per_room_by_m2,diff_from_district_avg,corrected_price_per_room,regulated_value,normalized_price
0,11692.307692,4905.344191,6787.007153,380000.000000,220577.732474
1,19333.333333,-1205.104776,20538.505857,435000.000000,462116.381791
2,12000.000000,5213.036498,6787.015433,420000.000000,237545.540169
3,11666.666667,-3.982314,11670.643387,311111.111111,311217.156984
4,11761.363636,-5823.761033,17585.133169,345000.000000,515830.572943
...,...,...,...,...,...
17173,10600.000000,-9938.438109,20538.438125,265000.000000,513460.953135
17174,5440.000000,-6230.648980,11670.741320,170000.000000,364710.666262
17175,4800.000000,-6870.648980,11670.559379,150000.000000,364704.980608
17176,9705.882353,-514.262271,10220.050500,275000.000000,289568.097493


# Feature Selection Strategy for Dataset Split

In the realm of predictive modeling, selecting the right set of features is akin to sculpting a masterpiece. The choices we make are a delicate interplay of context, goals, and data nuances. In this section, we delve into our thought process behind feature selection for both the cleared and uncleared datasets.

## Feature Selection for Cleared Dataset (X_cleared)

1. **`normalized_price`:** Serving as our target variable, this feature is wisely omitted from the feature set. After all, the aim is to predict it rather than include it as an input.

2. **`room_count`:** The original room count feature may take a back seat here. Derived features like `avg_room_size` and `value_per_room_by_m2` seem to encapsulate the essence of room count's impact on pricing.

3. **`value`:** Similar to `normalized_price`, this feature is the response variable we're striving to predict. Hence, it doesn't find a place in our feature set.

4. **`regulated_value`:** This feature could potentially introduce data leakage due to its strong correlation with `normalized_price`. By excluding it, we maintain a cleaner predictive landscape.

## Feature Selection for Uncleared Dataset (X_uncleared)

1. **`value`:** In the uncleared dataset, this is our target variable for prediction. As a result, it gracefully steps aside from the feature ensemble.

2. **`floor_no`:** This feature might be absent because it underwent categorical transformation in the cleared dataset. To ensure alignment, we exclude it here as well.

3. **`room_count`:** Similar to the cleared dataset, the original room count feature yields the spotlight to its derived counterparts, promoting coherence in our approach.

4. **`District` and `Neighboorhood`:** In the cleared dataset, these categorical variables metamorphosed into numerical features through one-hot encoding. Since this transformation isn't mirrored in the uncleared dataset, these features gracefully recede to maintain harmonious congruence.

Our selection methodology for both datasets showcases a blend of consistency, precision, and prudent exclusion. The goal is to sculpt predictive models that marry the intricacies of data with strategic decisions. As we journey forward, these choices will serve as the bedrock upon which our modeling saga unfolds.

Remember, every omission is a deliberate step toward unveiling the essence of predictive modeling—one that balances information with abstraction and clarity with complexity.


In [519]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Split the cleared dataset into features (X_cleared) and target (y_cleared)
X_cleared = df.drop(['normalized_price','room_count','value','regulated_value'], axis=1)
y_cleared = df['normalized_price']

# Split the uncleared dataset into features (X_uncleared) and target (y_uncleared)
X_uncleared = df_org.drop(['value','floor_no','room_count', 'District', 'Neighboorhood'], axis=1)
y_uncleared = df_org['value']

# Data Split and Model Training: Illuminating the Cleared Path

As our journey unfolds, we find ourselves at a critical juncture—where the alchemy of data partitioning and model training takes center stage. Let's illuminate this juncture with clarity as we delve into splitting our datasets and training a Linear Regression model on the cleared terrain.

## Dividing the Landscape: Train-Test Split

1. **Dataset Division:** Our cleared dataset embodies the promise of insights. To nurture these seeds of discovery, we split it into two harmonious realms—training and testing. In this choreography, the cleared data dons the roles of both mentor and examiner.

2. **Deconstructed Symphony:** Behold the splendor of partitioning—`X_train_cleared` and `y_train_cleared` take up residence in the training realm, while `X_test_cleared` and `y_test_cleared` grace the testing realm. A thoughtful **80-20** split embraces equilibrium between learning and evaluation.

## Enter the Linear Regression Virtuoso

1. **Model's Prelude:** With data poised for exploration, our Linear Regression virtuoso graces the stage—the **model_cleared**. In the cleared dataset's haven, it embarks on an expedition of insights, unraveling the narrative woven by our selected features.

2. **Symphony of Learning:** The model dons the dual roles of student and conductor. As a student, it learns the essence of relationships between features and normalized prices, while as a conductor, it orchestrates the predictive symphony.

## A Crescendo of Insight

1. **Interplay of Features:** As our Linear Regression model learns, a crescendo of insight unfolds. Each feature's unique melody weaves into the overarching harmony of predictions.

2. **Leveraging Cleared Wisdom:** The magic of feature selection is palpable, elevating the model's grasp of intrinsic patterns while sidestepping noise.

## The Grand Finale Awaits

Our journey through data partitioning and model training ignites the spark of anticipation. The **model_cleared**, fueled by the symphony of curated features, preludes the grand finale—the moment of evaluation. Here, the promise of predictive mastery will be put to the test, and insights will dance on the stage of validation.

As the curtain rises on the upcoming evaluation phase, the echo of cleared insights resonates. Join us in the anticipation of a harmonious revelation, where feature-driven clarity melds with the eloquence of model predictions.

In [520]:
# Split the datasets into train and test sets
X_train_cleared, X_test_cleared, y_train_cleared, y_test_cleared = train_test_split(X_cleared, y_cleared, test_size=0.2, random_state=42)
X_train_uncleared, X_test_uncleared, y_train_uncleared, y_test_uncleared = train_test_split(X_uncleared, y_uncleared, test_size=0.2, random_state=42)

# Train a linear regression model on the cleared data
model_cleared = LinearRegression()
model_cleared.fit(X_train_cleared, y_train_cleared)

# Illuminating the Prediction Stage: Unveiling the Crystal Ball

The anticipation reaches its zenith as we approach the prediction stage—a moment of truth where our models take the center stage, wielding their predictive prowess. Let's shed light on this stage as we delve into the art of making predictions on the test set.

## The Prediction Canvas

1. **Cleared Model's Divination:** The spotlight shines on our cleared model—the harbinger of insights forged through careful feature selection and meticulous data curation. With a flourish, the model unveils its predictions for the test set, encapsulating its understanding of cleared data dynamics.

2. **Uncleared Model's Prophecy:** Parallel to the cleared model, our uncleared model makes its grand entrance. This model is unshackled by feature pruning, carrying the essence of the original dataset. It summons its predictive prowess to foretell outcomes for the test set.

## The Clairvoyant Models

1. **Model-Crafted Crystal Ball:** The cleared model, known as the **model_cleared**, gazes into the crystal ball of features it has been groomed upon. Drawing from its cultivated wisdom, it conjures predictions that mirror its comprehension of cleared data relationships.

2. **The Enigmatic Oracle:** Behold the uncleared model—**model_uncleared**—a mystical oracle with a canvas of unfiltered features. It distills its predictions using the intricate tapestry of the original dataset, offering an alternate perspective on the future.

## A Kaleidoscope of Insights

1. **The Cleared Model's Focus:** Trained on the refined data landscape, the cleared model channels insights unique to its domain. Its predictions reflect a harmony between selected features and outcomes, shedding light on cleared data's predictive tapestry.

2. **The Uncleared Model's Mirage:** The uncleared model, unburdened by feature constraints, casts a wider net. Its predictions capture the essence of unfiltered relationships within the dataset, potentially uncovering nuances unseen by its counterpart.

## The Pendulum Swings

Our journey through prediction takes us into the realms of speculation and foresight. As the cleared and uncleared models cast their predictions, we stand on the precipice of validation, where results will reveal the efficacy of our feature-driven strategies.

With a flourish, we embrace the culmination—a moment where predictive mastery entwines with feature intricacies. Join us in the forthcoming revelation as predictions unravel the threads of model understanding and dataset complexity.

Prepare for the grand unveiling—the climax of the prediction journey that holds the key to understanding our models' predictive artistry.


In [521]:
# Make predictions on the test set
pred_cleared = model_cleared.predict(X_test_cleared)
pred_uncleared = model_uncleared.predict(X_test_uncleared)

# Evaluating the Performance: Decoding the Metrics Symphony

As we stand at the crossroads of prediction and reality, we embark on a voyage of evaluating our models' predictive finesse. The curtain rises on a stage adorned with two metrics—**Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)**. Let's unravel the significance of these metrics as they cast their spotlight on model performance.

## The Performance Canvas

1. **Cleared Model's Sonata:** With the cleared model in the limelight, the metrics take the stage. Our virtuoso, the **model_cleared**, showcases its MAE and RMSE scores as testament to its predictive prowess over curated data.

2. **Uncleared Model's Ballad:** The uncleared model, **model_uncleared**, steps onto the stage, revealing its MAE and RMSE scores in a symphony of numbers. This enigmatic performer operates in a world unbound by feature constraints.

## The Metrics Symphony

1. **Mean Absolute Error (MAE):** The MAE, a reliable guide, measures the average magnitude of prediction errors. It gauges the distance between predicted and actual values, encapsulating the model's precision in its notes.

2. **Root Mean Squared Error (RMSE):** RMSE, a sonnet of the squared errors' magnitude, refines the narrative. It captures the average distance between predicted and actual values, offering a poignant reflection on model accuracy.


In [522]:
# Evaluate the performance using different metrics
mae_cleared = mean_absolute_error(y_test_cleared, pred_cleared)
rmse_cleared = mean_squared_error(y_test_cleared, pred_cleared, squared=False)

mae_uncleared = mean_absolute_error(y_test_uncleared, pred_uncleared)
rmse_uncleared = mean_squared_error(y_test_uncleared, pred_uncleared, squared=False)

## The Scoreboard of Insight

As the curtains rise on the realm of performance evaluation, a dynamic display of metrics takes center stage. Our models, like virtuoso conductors, translate raw data into harmonious predictions. Amidst this symphony of numbers, a profound difference emerges—one driven by the transformative power of cleared data.

The cleared model, fortified by its refined features, proudly displays its symmetrical scores. With an astounding ratio of **1:10** in favor of cleared data, its predictive prowess resounds like a precise melodic composition. Every feature, every nuance, contributes to an orchestration of accuracy, proving that data curation leads to predictive mastery.

Yet, the uncleared model's scores echo a different narrative—a tale of wandering through a diverse feature landscape without guidance. A ratio of **1:10** unveils the impact of ignoring curation—a stark reminder of how unchecked data diversity can lead to discordant predictions.

In [523]:
# Display the results
print("Cleared Data Performance:")
print(f"MAE: {mae_cleared}")
print(f"RMSE: {rmse_cleared}\n")

print("Uncleared Data Performance:")
print(f"MAE: {mae_uncleared}")
print(f"RMSE: {rmse_uncleared}")

Cleared Data Performance:
MAE: 12612.760773877653
RMSE: 21259.401722036604

Uncleared Data Performance:
MAE: 116492.82857430064
RMSE: 177966.5928594334


## OLS Regression Results for Cleared Data

The Ordinary Least Squares (OLS) regression model was applied to predict the `normalized_price` based on the provided features. The key results from the model summary are as follows:

- R-squared: 0.971
  - The R-squared value measures the proportion of the variance in the dependent variable (`normalized_price`) that is explained by the independent variables in the model. In this case, the R-squared value of 0.971 indicates that approximately 97.1% of the variance in the `normalized_price` can be explained by the selected features.

- F-statistic: 4.494e+04
  - The F-statistic tests the overall significance of the model. A high F-statistic indicates that at least one of the predictor variables significantly contributes to explaining the variation in the dependent variable.

- P-values and Coefficients:
  - Each predictor variable's coefficient represents the change in the predicted `normalized_price` for a one-unit change in the predictor variable while keeping other variables constant.
  - The P-value associated with each coefficient tests the null hypothesis that the coefficient is equal to zero (i.e., the variable has no impact on the outcome). A low P-value (usually less than 0.05) suggests that the predictor variable is statistically significant in predicting the `normalized_price`.
  - For example, the coefficient for the `size` variable is approximately 398.2325. This means that for each additional unit increase in the `size` of a property, the predicted `normalized_price` is expected to increase by around 398.23 units.

Overall, the model's R-squared value suggests that the selected features collectively provide a strong fit to the data. The significance of individual predictor variables is determined by their associated P-values. The coefficients provide insights into the direction and magnitude of the relationships between the predictor variables and the predicted `normalized_price`.

## Feature Importance and Uniqueness Analysis

Let's delve into the importance and uniqueness of each feature based on their corresponding coefficients, p-values, and t-values from the OLS Regression Results:

1. `building_age`
   - Coefficient: 115.5670
   - P-value: < 0.001
   - Interpretation: The feature `building_age` has a coefficient of 115.5670. This indicates that for each year increase in the building's age, the predicted `normalized_price` is expected to increase by approximately 115.57 units. The low p-value (< 0.001) suggests that this feature is statistically significant and has a strong impact on predicting housing prices.

2. `total_floor_count`
   - Coefficient: -60.8528
   - P-value: 0.198
   - Interpretation: The feature `total_floor_count` has a coefficient of -60.8528. This suggests that an increase in the total floor count is associated with a decrease in the predicted `normalized_price` by approximately 60.85 units. However, the relatively higher p-value (0.198) implies that this relationship might not be statistically significant.

3. `floor_no`
   - Coefficient: 11.5196
   - P-value: 0.850
   - Interpretation: The feature `floor_no` has a coefficient of 11.5196. This indicates that the specific floor number might have a negligible impact on the predicted `normalized_price`, as the coefficient is relatively low. Additionally, the high p-value (0.850) suggests that this feature is likely not statistically significant.

4. `size`
   - Coefficient: 398.1804
   - P-value: < 0.001
   - Interpretation: The feature `size` has a coefficient of 398.1804. This suggests that for each additional unit increase in the size of the property, the predicted `normalized_price` is expected to increase by around 398.18 units. The low p-value (< 0.001) signifies the strong statistical significance of this feature.

5. `District`
   - Coefficient: 367.8988
   - P-value: < 0.001
   - Interpretation: The feature `District` has a coefficient of 367.8988. This indicates that properties located in different districts have varying impacts on the predicted `normalized_price`. The low p-value (< 0.001) suggests that this categorical feature significantly contributes to predicting housing prices.

6. `Neighboorhood`
   - Coefficient: 4.9669
   - P-value: 0.001
   - Interpretation: The feature `Neighboorhood` has a coefficient of 4.9669. This implies that different neighborhoods can have a slight impact on the predicted `normalized_price`. The low p-value (0.001) suggests that this feature is statistically significant.

7. `mortgage`
   - Coefficient: 0.0121
   - P-value: < 0.001
   - Interpretation: The feature `mortgage` has a coefficient of 0.0121. This indicates that for each unit increase in the mortgage value, the predicted `normalized_price` is expected to increase by approximately 0.0121 units. The low p-value (< 0.001) signifies the statistical significance of this feature.

8. `ninetyreg`
   - Coefficient: -2444.5204
   - P-value: < 0.001
   - Interpretation: The feature `ninetyreg` has a coefficient of -2444.5204. This suggests that properties subject to the 90% regulation have a significant impact on predicting housing prices. The low p-value (< 0.001) indicates strong statistical significance.

9. `room_count_num`
   - Coefficient: -1.062e+04
   - P-value: < 0.001
   - Interpretation: The feature `room_count_num` has a coefficient of -1.062e+04. This implies that for each increase in the number of rooms, the predicted `normalized_price` is expected to decrease by a considerable amount. The low p-value (< 0.001) reflects the statistical significance of this relationship.

10. `avg_room_size`
    - Coefficient: 9405.4157
    - P-value: < 0.001
    - Interpretation: The feature `avg_room_size` has a coefficient of 9405.4157. This indicates that for each increase in the average room size, the predicted `normalized_price` is expected to increase by around 9405 units. The low p-value (< 0.001) signifies the strong statistical significance of this feature.

In summary, several features such as `building_age`, `size`, `District`, `mortgage`, `ninetyreg`, `room_count_num`, and `avg_room_size` exhibit strong statistical significance and contribute significantly to predicting housing prices. The impact of other features, such as `total_floor_count` and `floor_no`, might be less pronounced based on their relatively higher p-values.


In [524]:
import pandas as pd
import statsmodels.api as sm


# Add a constant term to the predictor variables matrix
X_cleared = sm.add_constant(X_cleared)

# Fit the OLS model
model = sm.OLS(y_cleared, X_cleared).fit()

# Print the summary of the OLS regression results
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:       normalized_price   R-squared:                       0.971
Model:                            OLS   Adj. R-squared:                  0.971
Method:                 Least Squares   F-statistic:                 4.494e+04
Date:                Sat, 19 Aug 2023   Prob (F-statistic):               0.00
Time:                        16:05:53   Log-Likelihood:            -1.9520e+05
No. Observations:               17178   AIC:                         3.904e+05
Df Residuals:                   17164   BIC:                         3.905e+05
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                   

## OLS Regression Analysis for Uncleared Data

### R-squared and F-statistic

The R-squared value is an indicator of how well the model fits the data. In this OLS regression, the R-squared value is 0.505. This indicates that approximately 50.5% of the variability in the target variable (`value`) can be explained by the predictor variables included in the model.

The F-statistic is used to assess the overall significance of the model. In this case, the F-statistic is 4374. The associated probability (Prob (F-statistic)) is extremely close to 0, indicating that the overall model is statistically significant. This suggests that at least one of the predictor variables has a significant impact on the target variable.

### Feature Coefficients, T-values, and P-values

Let's delve into the coefficients, t-values, and p-values of the individual features in the OLS regression model:

1. `const` (Constant Term):
   - Coefficient: 3.042e+04
   - T-value: 4.981
   - P-value: 0.000
   - Interpretation: The constant term represents the baseline value when all other predictor variables are zero. The coefficient of 3.042e+04 indicates the estimated value of the target variable when all other predictor variables are zero. The low p-value (0.000) indicates that the constant term is statistically significant.

2. `building_age`:
   - Coefficient: 437.6815
   - T-value: 3.161
   - P-value: 0.002
   - Interpretation: The coefficient of 437.6815 implies that for each year increase in `building_age`, the predicted `value` is expected to increase by approximately 437.68 units. The relatively low p-value (0.002) indicates that this feature is statistically significant.

3. `total_floor_count`:
   - Coefficient: 6748.2805
   - T-value: 22.991
   - P-value: 0.000
   - Interpretation: The coefficient of 6748.2805 suggests that an increase in `total_floor_count` is associated with an increase in the predicted `value` by around 6748.28 units. The very low p-value (0.000) indicates strong statistical significance.

4. `size`:
   - Coefficient: 1018.8500
   - T-value: 23.287
   - P-value: 0.000
   - Interpretation: The coefficient of 1018.8500 implies that for each additional unit increase in `size`, the predicted `value` is expected to increase by approximately 1018.85 units. The very low p-value (0.000) reflects the statistical significance of this feature.

5. `mortgage`:
   - Coefficient: 0.7586
   - T-value: 122.362
   - P-value: 0.000
   - Interpretation: The coefficient of 0.7586 indicates that for each unit increase in `mortgage`, the predicted `value` is expected to increase by around 0.7586 units. The very low p-value (0.000) signifies the strong statistical significance of this feature.

In summary, the OLS regression results provide insights into the relationships between the predictor variables and the target variable. The features `building_age`, `total_floor_count`, `size`, and `mortgage` exhibit significant impacts on predicting the housing prices (`value`).


In [None]:
# Add a constant term to the predictor variables matrix
X_uncleared = sm.add_constant(X_uncleared)

# Fit the OLS model
model = sm.OLS(y_uncleared, X_uncleared).fit()

# Print the summary of the OLS regression results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  value   R-squared:                       0.505
Model:                            OLS   Adj. R-squared:                  0.505
Method:                 Least Squares   F-statistic:                     4374.
Date:                Sat, 19 Aug 2023   Prob (F-statistic):               0.00
Time:                        16:12:20   Log-Likelihood:            -2.3157e+05
No. Observations:               17178   AIC:                         4.632e+05
Df Residuals:                   17173   BIC:                         4.632e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              3.042e+04   6107.28

## Conclusion

### The Tale of Cleared Data
The model trained on the refined, cleared data unfolds as a harmonious composition. With an impressive R-squared value of 0.971, the model encapsulates nearly 97.1% of the variance in normalized housing prices. The F-statistic, a testament to the model's overall significance, emerges at a staggering 4.494e+04. This ensemble of metrics illustrates the adeptness of cleared data in articulating the variance within our target variable.

The individual contributors to this symphony, our predictor variables, step onto the stage with significance. Building age, size, district, neighborhood, mortgage, ninetyreg, room count, and average room size join the ensemble, each playing a role in shaping the orchestration of predictions. As we examine the t-values and p-values, we observe a compelling narrative. Building age, size, district, neighborhood, mortgage, ninetyreg, and average room size stand as stalwart features, their p-values painting a picture of statistical significance.

### The Story of Uncleared Data
On the other hand, the uncleared data paints a different narrative. With an R-squared value of 0.505, this dataset reveals a distinct reality—approximately 50.5% of the variability in housing prices is captured. The accompanying F-statistic of 4374 underlines the collective robustness of the model. However, the relatively lower R-squared value can be attributed to the necessary omission of certain features to facilitate the processing of uncleared data. This subtly implies that the dropped features hold a meaningful role in uncovering the genuine prices of the apartments.

As we dissect the roles of the individual features, building age, total floor count, size, and mortgage take the spotlight. Each comes forward with their coefficients and significance. The t-values and p-values unveil a mixed landscape—a landscape where building age, total floor count, size, and mortgage wield statistical significance.   

As a result, we have found how much data clearance and feature engineering contribute to the success of data analysis and processing, once again.