<a href="https://colab.research.google.com/github/kthanikonda/DataWithPython/blob/main/Real_Estate_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Dataset Selection and Initial Exploration

**Dataset:** Real Estate Sales Data (2001-2022)

**Source:** https://www.kaggle.com/datasets/omniamahmoudsaeed/real-estate-sales-2001-2022

**About Dataset:** This dataset provides detailed information about property sales, including various property features and sale statistics. The data spans multiple years and includes information about towns, property types, sale amounts, assessed values, and additional remarks from assessors.

### About the Columns in the Dataset

This dataset contains information about property sales. Below is what each column represents:

1. **Serial Number -** A unique ID for each property.  
2. **List Year -** The year the property was listed for sale.  
3. **Date Recorded -** When the sale was officially recorded.  
4. **Town -** The city or town where the property is located.  
5. **Address -** The street address of the property.  
6. **Assessed Value -** The value assigned for tax purposes.  
7. **Sale Amount -** The actual price the property sold for.  
8. **Sales Ratio -** Compares the sale price to the assessed value.  
9. **Property Type -** The type of property (e.g., Residential, Commercial).  
10. **Residential Type -** If residential, the specific type (e.g., Single Family).  
11. **Non Use Code -** Codes for properties not typically used (e.g., vacant land).  
12. **Assessor Remarks -** Additional notes from the assessor.  
13. **OPM Remarks -** Notes from the Office of Property Management.  
14. **Location -** The exact geographic coordinates (latitude and longitude).



# 2. Loading the Dataset into Colab

*   To begin the analysis, we first upload the dataset to Google Colab using the `files.upload()` method provided by `google.colab`. After uploading, we read the dataset using `pandas`.
*   We display the first few rows to get an initial idea of what the data looks like using `df.head()`


In [27]:
from google.colab import files
import os

uploaded = files.upload()

Saving Real_Estate_Sales_2001-2022_GL.csv to Real_Estate_Sales_2001-2022_GL (1).csv


In [28]:
import pandas as pd

df = pd.read_csv("Real_Estate_Sales_2001-2022_GL.csv")

df.head()

  df = pd.read_csv("Real_Estate_Sales_2001-2022_GL.csv")


Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
0,2020177,2020,04/14/2021,Ansonia,323 BEAVER ST,133000.0,248400.0,0.5354,Residential,Single Family,,,,POINT (-73.06822 41.35014)
1,2020225,2020,05/26/2021,Ansonia,152 JACKSON ST,110500.0,239900.0,0.4606,Residential,Three Family,,,,
2,2020348,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,,,,,
3,2020090,2020,12/14/2020,Ansonia,57 PLATT ST,127400.0,202500.0,0.6291,Residential,Two Family,,,,
4,210288,2021,06/20/2022,Avon,12 BYRON DRIVE,179990.0,362500.0,0.4965,Residential,Condo,,,,POINT (-72.879115982 41.773452988)


# 3. Data Import and Cleaning

In this section, We examine the dataset for missing values, inconsistent data, and incorrect data types. We also perform appropriate cleaning steps to ensure the dataset is ready for analysis.



In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1097629 entries, 0 to 1097628
Data columns (total 14 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   Serial Number     1097629 non-null  int64  
 1   List Year         1097629 non-null  int64  
 2   Date Recorded     1097627 non-null  object 
 3   Town              1097629 non-null  object 
 4   Address           1097578 non-null  object 
 5   Assessed Value    1097629 non-null  float64
 6   Sale Amount       1097629 non-null  float64
 7   Sales Ratio       1097629 non-null  float64
 8   Property Type     715183 non-null   object 
 9   Residential Type  699240 non-null   object 
 10  Non Use Code      313451 non-null   object 
 11  Assessor Remarks  171228 non-null   object 
 12  OPM remarks       13031 non-null    object 
 13  Location          298111 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 117.2+ MB


In [30]:
df.describe()

Unnamed: 0,Serial Number,List Year,Assessed Value,Sale Amount,Sales Ratio
count,1097629.0,1097629.0,1097629.0,1097629.0,1097629.0
mean,537035.7,2011.218,281801.6,405314.6,9.603926
std,7526074.0,6.773485,1657890.0,5143492.0,1801.664
min,0.0,2001.0,0.0,0.0,0.0
25%,30713.0,2005.0,89090.0,145000.0,0.4778667
50%,80706.0,2011.0,140580.0,233000.0,0.6105663
75%,170341.0,2018.0,228270.0,375000.0,0.77072
max,2000500000.0,2022.0,881510000.0,5000000000.0,1226420.0


In [31]:
df.isnull().sum()

Unnamed: 0,0
Serial Number,0
List Year,0
Date Recorded,2
Town,0
Address,51
Assessed Value,0
Sale Amount,0
Sales Ratio,0
Property Type,382446
Residential Type,398389


In [32]:
columns_to_drop = ['Non Use Code', 'Assessor Remarks', 'OPM remarks']
df.drop(columns=columns_to_drop, inplace=True)

df_cleaned = df.dropna(subset=['Date Recorded', 'Address'])

print("Remaining rows after cleaning:", len(df_cleaned))
df_cleaned.info()


Remaining rows after cleaning: 1097578
<class 'pandas.core.frame.DataFrame'>
Index: 1097578 entries, 0 to 1097628
Data columns (total 11 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   Serial Number     1097578 non-null  int64  
 1   List Year         1097578 non-null  int64  
 2   Date Recorded     1097578 non-null  object 
 3   Town              1097578 non-null  object 
 4   Address           1097578 non-null  object 
 5   Assessed Value    1097578 non-null  float64
 6   Sale Amount       1097578 non-null  float64
 7   Sales Ratio       1097578 non-null  float64
 8   Property Type     715179 non-null   object 
 9   Residential Type  699236 non-null   object 
 10  Location          298106 non-null   object 
dtypes: float64(3), int64(2), object(6)
memory usage: 100.5+ MB


In [33]:
df_cleaned = df.dropna(subset=['Date Recorded', 'Address', 'Location','Residential Type', 'Property Type' ])

# Drop columns with excessive missing data
#df_cleaned = df_cleaned.drop(columns=['Non Use Code', 'Assessor Remarks', 'OPM remarks'])

# Display final info
print(f"Remaining rows after cleaning: {df_cleaned.shape[0]}")
df_cleaned.info()


Remaining rows after cleaning: 214091
<class 'pandas.core.frame.DataFrame'>
Index: 214091 entries, 0 to 1097628
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Serial Number     214091 non-null  int64  
 1   List Year         214091 non-null  int64  
 2   Date Recorded     214091 non-null  object 
 3   Town              214091 non-null  object 
 4   Address           214091 non-null  object 
 5   Assessed Value    214091 non-null  float64
 6   Sale Amount       214091 non-null  float64
 7   Sales Ratio       214091 non-null  float64
 8   Property Type     214091 non-null  object 
 9   Residential Type  214091 non-null  object 
 10  Location          214091 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 19.6+ MB


In [34]:
df_cleaned.to_csv("cleaned_property_dataset.csv", index=False)

# Download the file to your local system
files.download("cleaned_property_dataset.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [35]:
df_cleaned.isnull().sum()

Unnamed: 0,0
Serial Number,0
List Year,0
Date Recorded,0
Town,0
Address,0
Assessed Value,0
Sale Amount,0
Sales Ratio,0
Property Type,0
Residential Type,0


In [36]:
df_cleaned.describe()

Unnamed: 0,Serial Number,List Year,Assessed Value,Sale Amount,Sales Ratio
count,214091.0,214091.0,214091.0,214091.0,214091.0
mean,1200888.0,2017.34036,254614.2,415593.9,7.203829
std,12143850.0,5.037375,914930.3,1114359.0,2668.575
min,21.0,2006.0,0.0,0.0,0.0
25%,90361.5,2014.0,110390.0,175000.0,0.4851
50%,190895.0,2020.0,160080.0,275000.0,0.5922741
75%,212191.5,2021.0,248880.0,425000.0,0.7353994
max,1710011000.0,2022.0,68646970.0,318790000.0,1226420.0


In [37]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 214091 entries, 0 to 1097628
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Serial Number     214091 non-null  int64  
 1   List Year         214091 non-null  int64  
 2   Date Recorded     214091 non-null  object 
 3   Town              214091 non-null  object 
 4   Address           214091 non-null  object 
 5   Assessed Value    214091 non-null  float64
 6   Sale Amount       214091 non-null  float64
 7   Sales Ratio       214091 non-null  float64
 8   Property Type     214091 non-null  object 
 9   Residential Type  214091 non-null  object 
 10  Location          214091 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 19.6+ MB
