# Data cleaning

This is a data cleaning performed on the real estate market trends in Conneticut. 


The raw data file was obtained from https://catalog.data.gov/dataset/real-estate-sales-2001-2018. On the wbsite, the file is described to include

>town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment. 

>Annual real estate sales are reported by grand list year (October 1 through September 30 each year). For instance, sales from 2018 GL are from 10/01/2018 through 9/30/2019 (Data.gov).

### Table of Content

1. [Import Libraries and the dataset](#import)

#### 1. Import Libraries and the dataset. <a id = 'import'></a>

In [6]:
#Libraies imported
import pandas as pd
import re
import seaborn as sns
import numpy as np

In [7]:
#Dataset imported
correct_real_estate = pd.read_csv('../Datasets/correct_real_estate.csv')
#Reset Index
correct_real_estate.reset_index(drop=True)

  correct_real_estate = pd.read_csv('../Datasets/correct_real_estate.csv')


Unnamed: 0.1,Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location,Full Address,latitude,longitude
0,0,2020348,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463000,Commercial,,,,,,"230, Wakelee Ave, Ansonia",,
1,1,20002,2020,10/02/2020,Ashford,390 TURNPIKE RD,253000.0,430000.0,0.588300,Residential,Single Family,,,,,"390, Turnpike Rd, Ashford",,
2,2,200212,2020,03/09/2021,Avon,5 CHESTNUT DRIVE,130400.0,179900.0,0.724800,Residential,Condo,,,,,"5, Chestnut Drive, Avon",,
3,3,200243,2020,04/13/2021,Avon,111 NORTHINGTON DRIVE,619290.0,890000.0,0.695800,Residential,Single Family,,,,,"111, Northington Drive, Avon",,
4,4,200377,2020,07/02/2021,Avon,70 FAR HILLS DRIVE,862330.0,1447500.0,0.595700,Residential,Single Family,,,,,"70, Far Hills Drive, Avon",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997190,997208,190272,2019,06/24/2020,New London,4 BISHOP CT,60410.0,53100.0,1.137665,Single Family,Single Family,14 - Foreclosure,,,,"4, Bishop Ct, New London",,
997191,997209,190284,2019,11/27/2019,Waterbury,126 PERKINS AVE,68280.0,76000.0,0.898400,Single Family,Single Family,25 - Other,PRIVATE SALE,,,"126, Perkins Ave, Waterbury",,
997192,997210,190129,2019,04/27/2020,Windsor Locks,19 HATHAWAY ST,121450.0,210000.0,0.578300,Single Family,Single Family,,,,,"19, Hathaway St, Windsor Locks",,
997193,997211,190504,2019,06/03/2020,Middletown,8 BYSTREK DR,203360.0,280000.0,0.726300,Single Family,Single Family,,,,,"8, Bystrek Dr, Middletown",,


In [8]:
correct_real_estate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997195 entries, 0 to 997194
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        997195 non-null  int64  
 1   Serial Number     997195 non-null  int64  
 2   List Year         997195 non-null  int64  
 3   Date Recorded     997193 non-null  object 
 4   Town              997195 non-null  object 
 5   Address           997144 non-null  object 
 6   Assessed Value    997195 non-null  float64
 7   Sale Amount       997195 non-null  float64
 8   Sales Ratio       997195 non-null  float64
 9   Property Type     614751 non-null  object 
 10  Residential Type  608889 non-null  object 
 11  Non Use Code      289663 non-null  object 
 12  Assessor Remarks  149858 non-null  object 
 13  OPM remarks       9916 non-null    object 
 14  Location          197691 non-null  object 
 15  Full Address      970503 non-null  object 
 16  latitude          19

Findings:

- Serial number does not match the index
- Specific `Address`es and `Town`s are avilable
- `Sales ratio` is `Assessed Value` / `Sale Amount`
- NaN values for several columns
- `Date Recorded` is date of sale according to the website.
- Some rows in `Location` include longitude and latitude.
As well as:
   - `Date` column can be changed to DateTime format
   - `Date Recorded`, `Town`, `Address`, `Assessed Value`, `Sale Amount`, `Sale Ratio` have very few missing values.
   - Address has 51 missingcolumns.
   - About 1/3 of rows missing in `Property Type` and `Residential Type`.
   - About 20% of the data has `Location` or longitude and latitude.

The dataset is imported. 

Because the dataset is continually being updated after retrieving the coordinates from Open Street Map API, the  latest version of the dataset should be imported, not the raw dataset. 

The latest updated dataset can be downloaded from the `Dataset` folder in the Github repository:

In [9]:

data_with_remarks = correct_real_estate[
    (correct_real_estate['Non Use Code'].notnull() |
    correct_real_estate['Assessor Remarks'].notnull()) |
    correct_real_estate['OPM remarks'].notnull() |
    correct_real_estate['Location'].notnull()
]



In [10]:
print(data_with_remarks["latitude"].isna().sum())
print(data_with_remarks["longitude"].isna().sum())

264203
264203


In [11]:
data_with_remarks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 461893 entries, 6 to 997191
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        461893 non-null  int64  
 1   Serial Number     461893 non-null  int64  
 2   List Year         461893 non-null  int64  
 3   Date Recorded     461893 non-null  object 
 4   Town              461893 non-null  object 
 5   Address           461845 non-null  object 
 6   Assessed Value    461893 non-null  float64
 7   Sale Amount       461893 non-null  float64
 8   Sales Ratio       461893 non-null  float64
 9   Property Type     300553 non-null  object 
 10  Residential Type  296116 non-null  object 
 11  Non Use Code      289663 non-null  object 
 12  Assessor Remarks  149858 non-null  object 
 13  OPM remarks       9916 non-null    object 
 14  Location          197691 non-null  object 
 15  Full Address      444176 non-null  object 
 16  latitude          19

In [12]:
data_with_remarks[data_with_remarks["Location"].notnull() & (data_with_remarks["latitude"].isna() | data_with_remarks["longitude"].isna())]

Unnamed: 0.1,Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location,Full Address,latitude,longitude
932227,932239,19055,2019,04/22/2020,Canterbury,171 LISBON RD,230800.0,450000.0,0.5129,Single Family,Single Family,,,,POINT (-72 41.67984),"171, Lisbon Rd, Canterbury",,


In [13]:
data_with_remarks = data_with_remarks.drop(932239)


In [14]:
# Check for duplicate rows
duplicates = data_with_remarks[data_with_remarks.duplicated()]
duplicates


Unnamed: 0.1,Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location,Full Address,latitude,longitude


In [15]:
data_with_remarks['Non Use Code'] = data_with_remarks['Non Use Code'].str.replace(r'[a-zA-Z-]', '')

data_with_remarks['Non Use Code']  = data_with_remarks['Non Use Code'].str.replace(r'0([1-9])\b', r'\1')

data_with_remarks['Non Use Code']  = data_with_remarks['Non Use Code'].str.replace(r'\s', '')

data_with_remarks['Non Use Code'].unique()

  data_with_remarks['Non Use Code'] = data_with_remarks['Non Use Code'].str.replace(r'[a-zA-Z-]', '')
  data_with_remarks['Non Use Code']  = data_with_remarks['Non Use Code'].str.replace(r'0([1-9])\b', r'\1')
  data_with_remarks['Non Use Code']  = data_with_remarks['Non Use Code'].str.replace(r'\s', '')


array(['8', nan, '14', '25', '1', '12', '7', '28', '24', '3', '18', '6',
       '17', '2', '15', '9', '10', '16', '26', '11', '22', '', '19', '4',
       '5', '27', '13', '30', '21', '23', '20', '29', '31', '34', '74',
       '32', '75'], dtype=object)

In [16]:


# Convert the 'date_column' to a  data type if not already
data_with_remarks['Date Recorded'] = pd.to_datetime(data_with_remarks['Date Recorded'])

# Extract the month and create a new column 'extracted_month'
data_with_remarks['month_recorded'] = data_with_remarks['Date Recorded'].dt.month
data_with_remarks['year_recorded'] = data_with_remarks['Date Recorded'].dt.year




In [17]:
data_with_remarks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 461892 entries, 6 to 997191
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Unnamed: 0        461892 non-null  int64         
 1   Serial Number     461892 non-null  int64         
 2   List Year         461892 non-null  int64         
 3   Date Recorded     461892 non-null  datetime64[ns]
 4   Town              461892 non-null  object        
 5   Address           461844 non-null  object        
 6   Assessed Value    461892 non-null  float64       
 7   Sale Amount       461892 non-null  float64       
 8   Sales Ratio       461892 non-null  float64       
 9   Property Type     300552 non-null  object        
 10  Residential Type  296115 non-null  object        
 11  Non Use Code      230630 non-null  object        
 12  Assessor Remarks  149857 non-null  object        
 13  OPM remarks       9916 non-null    object        
 14  Loca

In [18]:
data_with_remarks.to_csv('../Datasets/real_estate_with_remarks.csv')