<a href="https://colab.research.google.com/github/kthanikonda/DataWithPython/blob/main/Real_Estate_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Dataset Selection and Initial Exploration

**Dataset:** Real Estate Sales Data (2001-2022)

**Source:** https://www.kaggle.com/datasets/omniamahmoudsaeed/real-estate-sales-2001-2022

**About Dataset:** This dataset provides detailed information about property sales, including various property features and sale statistics. The data spans multiple years and includes information about towns, property types, sale amounts, assessed values, and additional remarks from assessors.

### About the Columns in the Dataset

This dataset contains information about property sales. Below is what each column represents:

1. **Serial Number -** A unique ID for each property.  
2. **List Year -** The year the property was listed for sale.  
3. **Date Recorded -** When the sale was officially recorded.  
4. **Town -** The city or town where the property is located.  
5. **Address -** The street address of the property.  
6. **Assessed Value -** The value assigned for tax purposes.  
7. **Sale Amount -** The actual price the property sold for.  
8. **Sales Ratio -** Compares the sale price to the assessed value.  
9. **Property Type -** The type of property (e.g., Residential, Commercial).  
10. **Residential Type -** If residential, the specific type (e.g., Single Family).  
11. **Non Use Code -** Codes for properties not typically used (e.g., vacant land).  
12. **Assessor Remarks -** Additional notes from the assessor.  
13. **OPM Remarks -** Notes from the Office of Property Management.  
14. **Location -** The exact geographic coordinates (latitude and longitude).



# 2. Loading the Dataset into Colab

*   To begin the analysis, we first upload the dataset to Google Colab using the `files.upload()` method provided by `google.colab`. After uploading, we read the dataset using `pandas`.
*   We display the first few rows to get an initial idea of what the data looks like using `df.head()`


In [16]:
from google.colab import files
import os

uploaded = files.upload()

Saving Real_Estate_Sales_2001-2022_GL.csv to Real_Estate_Sales_2001-2022_GL (1).csv


In [17]:
import pandas as pd

df = pd.read_csv("Real_Estate_Sales_2001-2022_GL.csv")

df.head()

  df = pd.read_csv("Real_Estate_Sales_2001-2022_GL.csv")


Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
0,2020177,2020,04/14/2021,Ansonia,323 BEAVER ST,133000.0,248400.0,0.5354,Residential,Single Family,,,,POINT (-73.06822 41.35014)
1,2020225,2020,05/26/2021,Ansonia,152 JACKSON ST,110500.0,239900.0,0.4606,Residential,Three Family,,,,
2,2020348,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,,,,,
3,2020090,2020,12/14/2020,Ansonia,57 PLATT ST,127400.0,202500.0,0.6291,Residential,Two Family,,,,
4,210288,2021,06/20/2022,Avon,12 BYRON DRIVE,179990.0,362500.0,0.4965,Residential,Condo,,,,POINT (-72.879115982 41.773452988)


# 3. Data Import and Cleaning

In this section, We examine the dataset for missing values, inconsistent data, and incorrect data types. We also perform appropriate cleaning steps to ensure the dataset is ready for analysis.



In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1097629 entries, 0 to 1097628
Data columns (total 14 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   Serial Number     1097629 non-null  int64  
 1   List Year         1097629 non-null  int64  
 2   Date Recorded     1097627 non-null  object 
 3   Town              1097629 non-null  object 
 4   Address           1097578 non-null  object 
 5   Assessed Value    1097629 non-null  float64
 6   Sale Amount       1097629 non-null  float64
 7   Sales Ratio       1097629 non-null  float64
 8   Property Type     715183 non-null   object 
 9   Residential Type  699240 non-null   object 
 10  Non Use Code      313451 non-null   object 
 11  Assessor Remarks  171228 non-null   object 
 12  OPM remarks       13031 non-null    object 
 13  Location          298111 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 117.2+ MB


In [19]:
df.describe()

Unnamed: 0,Serial Number,List Year,Assessed Value,Sale Amount,Sales Ratio
count,1097629.0,1097629.0,1097629.0,1097629.0,1097629.0
mean,537035.7,2011.218,281801.6,405314.6,9.603926
std,7526074.0,6.773485,1657890.0,5143492.0,1801.664
min,0.0,2001.0,0.0,0.0,0.0
25%,30713.0,2005.0,89090.0,145000.0,0.4778667
50%,80706.0,2011.0,140580.0,233000.0,0.6105663
75%,170341.0,2018.0,228270.0,375000.0,0.77072
max,2000500000.0,2022.0,881510000.0,5000000000.0,1226420.0


In [20]:
df.isnull().sum()

Unnamed: 0,0
Serial Number,0
List Year,0
Date Recorded,2
Town,0
Address,51
Assessed Value,0
Sale Amount,0
Sales Ratio,0
Property Type,382446
Residential Type,398389


In [21]:
# See % of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent.sort_values(ascending=False)


Unnamed: 0,0
OPM remarks,98.812805
Assessor Remarks,84.400194
Location,72.840459
Non Use Code,71.442901
Residential Type,36.295415
Property Type,34.842921
Address,0.004646
Date Recorded,0.000182
Serial Number,0.0
List Year,0.0


In [22]:
# Step 1: Copy the original DataFrame to preserve raw data
cleaned_df = df.copy()
print(" Original DataFrame copied to 'cleaned_df'.")

# Step 2: Drop columns with too many missing values
columns_to_drop = ['Non Use Code', 'Assessor Remarks', 'OPM remarks']
cleaned_df.drop(columns=[col for col in columns_to_drop if col in cleaned_df.columns], inplace=True)
print(f"Dropped columns: {columns_to_drop}")

# Step 3: Fill missing values in useful categorical columns
cleaned_df['Residential Type'] = cleaned_df['Residential Type'].fillna('Unknown')
cleaned_df['Property Type'] = cleaned_df['Property Type'].fillna('Unknown')
cleaned_df['Address'] = cleaned_df['Address'].fillna('Unknown')
print(" Filled missing values in 'Residential Type', 'Property Type', and 'Address'.")

# Step 4: Convert 'Date Recorded' to datetime format
cleaned_df['Date Recorded'] = pd.to_datetime(cleaned_df['Date Recorded'], errors='coerce')
print("Converted 'Date Recorded' to datetime format.")

# Step 5: Extract coordinates from 'Location' column
cleaned_df['Longitude'] = cleaned_df['Location'].str.extract(r'POINT \((-?\d+\.\d+)')
cleaned_df['Latitude'] = cleaned_df['Location'].str.extract(r'POINT \(-?\d+\.\d+ (\d+\.\d+)')

# Convert extracted coordinates to numeric (float)
cleaned_df['Longitude'] = pd.to_numeric(cleaned_df['Longitude'], errors='coerce')
cleaned_df['Latitude'] = pd.to_numeric(cleaned_df['Latitude'], errors='coerce')
print(" Extracted and converted 'Longitude' and 'Latitude' from 'Location' column.")

# Step 6: Show result of cleaning
print("\n Cleaned DataFrame Overview:")
print(cleaned_df.info())
print("\n First few rows of cleaned data:")
print(cleaned_df.head())



✅ Original DataFrame copied to 'cleaned_df'.
✅ Dropped columns: ['Non Use Code', 'Assessor Remarks', 'OPM remarks']
✅ Filled missing values in 'Residential Type', 'Property Type', and 'Address'.
✅ Converted 'Date Recorded' to datetime format.
✅ Extracted and converted 'Longitude' and 'Latitude' from 'Location' column.

📊 Cleaned DataFrame Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1097629 entries, 0 to 1097628
Data columns (total 13 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   Serial Number     1097629 non-null  int64         
 1   List Year         1097629 non-null  int64         
 2   Date Recorded     1097627 non-null  datetime64[ns]
 3   Town              1097629 non-null  object        
 4   Address           1097629 non-null  object        
 5   Assessed Value    1097629 non-null  float64       
 6   Sale Amount       1097629 non-null  float64       
 7   Sales Ratio       1097629 n

In [23]:
cleaned_df.head()



Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Location,Longitude,Latitude
0,2020177,2020,2021-04-14,Ansonia,323 BEAVER ST,133000.0,248400.0,0.5354,Residential,Single Family,POINT (-73.06822 41.35014),-73.06822,41.35014
1,2020225,2020,2021-05-26,Ansonia,152 JACKSON ST,110500.0,239900.0,0.4606,Residential,Three Family,,,
2,2020348,2020,2021-09-13,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,Unknown,,,
3,2020090,2020,2020-12-14,Ansonia,57 PLATT ST,127400.0,202500.0,0.6291,Residential,Two Family,,,
4,210288,2021,2022-06-20,Avon,12 BYRON DRIVE,179990.0,362500.0,0.4965,Residential,Condo,POINT (-72.879115982 41.773452988),-72.879116,41.773453


In [24]:
num_full_duplicates = cleaned_df.duplicated().sum()
print(f"Number of fully duplicated rows: {num_full_duplicates}")




Number of fully duplicated rows: 0


In [25]:
cleaned_df.isnull().sum()

Unnamed: 0,0
Serial Number,0
List Year,0
Date Recorded,2
Town,0
Address,0
Assessed Value,0
Sale Amount,0
Sales Ratio,0
Property Type,0
Residential Type,0


In [26]:
cleaned_df.describe()

Unnamed: 0,Serial Number,List Year,Date Recorded,Assessed Value,Sale Amount,Sales Ratio,Longitude,Latitude
count,1097629.0,1097629.0,1097627,1097629.0,1097629.0,1097629.0,298110.0,298110.0
mean,537035.7,2011.218,2012-06-28 08:23:23.755556864,281801.6,405314.6,9.603926,-72.876112,41.497602
min,0.0,2001.0,1999-04-05 00:00:00,0.0,0.0,0.0,-121.23091,34.34581
25%,30713.0,2005.0,2005-11-04 00:00:00,89090.0,145000.0,0.4778667,-73.192288,41.287632
50%,80706.0,2011.0,2012-08-13 00:00:00,140580.0,233000.0,0.6105663,-72.90117,41.502569
75%,170341.0,2018.0,2018-10-16 00:00:00,228270.0,375000.0,0.77072,-72.629131,41.715729
max,2000500000.0,2022.0,2023-09-29 00:00:00,881510000.0,5000000000.0,1226420.0,-71.18755,44.93459
std,7526074.0,6.773485,,1657890.0,5143492.0,1801.664,0.444885,0.259391


In [27]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1097629 entries, 0 to 1097628
Data columns (total 13 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   Serial Number     1097629 non-null  int64         
 1   List Year         1097629 non-null  int64         
 2   Date Recorded     1097627 non-null  datetime64[ns]
 3   Town              1097629 non-null  object        
 4   Address           1097629 non-null  object        
 5   Assessed Value    1097629 non-null  float64       
 6   Sale Amount       1097629 non-null  float64       
 7   Sales Ratio       1097629 non-null  float64       
 8   Property Type     1097629 non-null  object        
 9   Residential Type  1097629 non-null  object        
 10  Location          298111 non-null   object        
 11  Longitude         298110 non-null   float64       
 12  Latitude          298110 non-null   float64       
dtypes: datetime64[ns](1), float64(5), int64(2)

In [29]:
zero_sales_count = (cleaned_df['Sale Amount'] == 0).sum()
print(f"❗ Rows with Sale Amount = 0: {zero_sales_count}")



❗ Rows with Sale Amount = 0: 1810


In [30]:
cleaned_df['Valid Sale'] = cleaned_df['Sale Amount'] > 0


In [31]:
print(" Count of valid and invalid sales:")
print(cleaned_df['Valid Sale'].value_counts())


 Count of valid and invalid sales:
Valid Sale
True     1095819
False       1810
Name: count, dtype: int64


In [32]:
cleaned_df[['Sale Amount', 'Valid Sale']].head(10)


Unnamed: 0,Sale Amount,Valid Sale
0,248400.0,True
1,239900.0,True
2,325000.0,True
3,202500.0,True
4,362500.0,True
5,400000.0,True
6,775000.0,True
7,415000.0,True
8,243000.0,True
9,100000.0,True


In [33]:
print(f" Total records in cleaned_df: {len(cleaned_df)}")


 Total records in cleaned_df: 1097629


In [34]:
cleaned_df['Valid Assessed'] = cleaned_df['Assessed Value'] > 0
print(cleaned_df['Valid Assessed'].value_counts())


Valid Assessed
True     1090440
False       7189
Name: count, dtype: int64


In [35]:
for col in ['Town', 'Property Type', 'Residential Type']:
    cleaned_df[col] = cleaned_df[col].str.strip().str.title()


In [36]:
cleaned_df.head()


Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Location,Longitude,Latitude,Valid Sale,Valid Assessed
0,2020177,2020,2021-04-14,Ansonia,323 BEAVER ST,133000.0,248400.0,0.5354,Residential,Single Family,POINT (-73.06822 41.35014),-73.06822,41.35014,True,True
1,2020225,2020,2021-05-26,Ansonia,152 JACKSON ST,110500.0,239900.0,0.4606,Residential,Three Family,,,,True,True
2,2020348,2020,2021-09-13,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,Unknown,,,,True,True
3,2020090,2020,2020-12-14,Ansonia,57 PLATT ST,127400.0,202500.0,0.6291,Residential,Two Family,,,,True,True
4,210288,2021,2022-06-20,Avon,12 BYRON DRIVE,179990.0,362500.0,0.4965,Residential,Condo,POINT (-72.879115982 41.773452988),-72.879116,41.773453,True,True
