The intended purpose of this SQL script is to clean and standardize data within the NashvilleHousingData table. This process involves updating and formatting various columns, handling missing or duplicate data, and ensuring consistency in the data to adhere to best data handling practices.
- Standardizing Data Formats: Ensuring consistent formats for dates and text fields.
- Populating Missing Data: Using joins and update statements to fill in missing values.
- Breaking Down Compound Data: Splitting address fields into separate components (address, city, state).
- Standardizing Categorical Data: Converting 'Y' and 'N' values to 'Yes' and 'No'.
- Identifying and Handling Duplicates: Detecting duplicate records and isolating unique records.
- Creating Views for Reporting: Setting up views for easy visualization and reporting of cleaned data.
-
Standardizing Date Format
- Objective: Alter the
SaleDatecolumn to ensure it has aDATEdata type.
- Objective: Alter the
-
Populating Property Address
- Objective: Use an
UPDATEstatement to fill in missing property addresses.
- Objective: Use an
-
Verifying Property Address Update
- Objective: Check if any
PropertyAddressfields are still null.
- Objective: Check if any
-
Breaking Down Address into Specific Columns
- Objective: Split the
PropertyAddressinto separateaddressandcitycolumns.
- Objective: Split the
-
Breaking Down Owner Address
- Objective: Split the
OwnerAddressintoaddress,city, andstatecolumns.
- Objective: Split the
-
Standardizing Categorical Data
- Objective: Convert 'Y' and 'N' in the
SoldAsVacantcolumn to 'Yes' and 'No'.
- Objective: Convert 'Y' and 'N' in the
-
Creating Save Points
- Objective: Create save points to facilitate rollback in case of errors.
-
Identifying Duplicates
- Objective: Identify duplicate records based on specific fields.
-
Filtering Duplicates into a Temporary Table
- Objective: Use a Common Table Expression (CTE) to filter out duplicates and store unique records in a temporary table.
-
Creating a View for Visualization
- Objective: Create a view of the cleaned data for easier access and reporting.
- Use of Transactions: use of transactions (
begin TRANSACTION datacleaningandSAVE TRANSACTION savepoint) to ensure data integrity and provide rollback capabilities in case of errors during the cleaning process. - Interim Verification Steps: The script includes several
SELECTstatements to verify the intermediate results of updates and transformations, ensuring that each step of the cleaning process is executed correctly. - Handling of Missing and Duplicated Data: The script addresses common data issues such as missing values and duplicate records comprehensively, ensuring a clean and reliable dataset.