Nashville Housing Data Cleaning Project Detailed Explanation of Key SQL Techniques

By: Mahmoud Ghanem

Introduction

The Nashville Housing Data Cleaning Project aims to ensure the quality and usability of a dataset containing housing information for Nashville. This document provides a detailed explanation of the key SQL techniques and methods used to clean and preprocess the data, showcasing the benefits of each step in preparing the data for analysis. Objectives

Data Cleaning: Identify and correct inaccuracies and inconsistencies in the dataset to ensure it is accurate and reliable.
Data Transformation: Convert raw data into a structured format suitable for analysis.
Preparation for Analysis: Prepare the data for various analytical and machine learning models for future studies.

Data Overview

The dataset contains housing information for Nashville and is provided in both CSV and Excel formats. The data includes fields such as property addresses, sale dates, prices, and other relevant details crucial for understanding the housing market. Methodology and Code Explanation

Importing Data Imported all records from the NashvilleHousing table to understand the data structure and its contents. Analyzing the data and planning next steps of data cleaning.

SELECT *
FROM `NashvilleHousing`;

Formatting Dates Standardizing data formats Ex: reformats the SaleDate field into a standard date format.

UPDATE NashvilleHousing
SET SaleDate = STR_TO_DATE(SaleDate, '%M %e, %Y');

Populating Missing Addresses Missing data can lead to incomplete analysis, Ex: PropertyAddress, The SQL code below fills in missing addresses by leveraging existing data with the same ParcelID.

-- Identify and fill missing Property Address data
SELECT a.UniqueID, a.ParcelID, a.PropertyAddress, b.UniqueID, b.ParcelID, b.PropertyAddress,
       IF(a.PropertyAddress = '', b.PropertyAddress, a.PropertyAddress) as address
FROM `NashvilleHousing` a
JOIN `NashvilleHousing` b
ON a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress = '';

UPDATE `NashvilleHousing` a
JOIN `NashvilleHousing` b
ON a.ParcelID = b.ParcelID AND a.UniqueID <> b.UniqueID
SET a.PropertyAddress = IF(a.PropertyAddress = '', b.PropertyAddress, a.PropertyAddress)
WHERE a.PropertyAddress = '';

Breaking Down Address Components For detailed analysis, I break down PropertyAddress fields into separate parts such as street address and city.

-- Extracting individual parts of the address (Address, City)
SELECT SUBSTRING(PropertyAddress, 1, LOCATE(',', PropertyAddress) - 1) AS Address,
       SUBSTRING(PropertyAddress, LOCATE(',', PropertyAddress) + 1, LENGTH(PropertyAddress)) AS City
FROM `NashvilleHousing`;

ALTER TABLE `NashvilleHousing`
ADD PropertSplitAddress NVARCHAR(255);

UPDATE `NashvilleHousing`
SET PropertSplitAddress = SUBSTRING(PropertyAddress, 1, LOCATE(',', PropertyAddress) - 1);

Remove Duplicate Removing duplicte data point (Carefully) to avoid inflating the numbers

SELECT *,
		row_number() OVER (partition by 
        ParcelID,
        PropertyAddress,
        SalePrice,
        SaleDate,
        LegalReference
        ORDER BY UniqueID) as row_num
FROM `NashvilleHousing`
ORDER BY ParcelID DESC
)
SELECT COUNT(*)
FROM row_num
WHERE row_num > 1
ORDER BY ParcelID;

DROP TABLE IF EXISTS row_num;
CREATE TEMPORARY TABLE row_num AS
SELECT *,
    ROW_NUMBER() OVER (PARTITION BY 
        ParcelID,
        PropertyAddress,
        SalePrice,
        SaleDate,
        LegalReference
        ORDER BY UniqueID) AS row_num
FROM `NashvilleHousing`
ORDER BY ParcelID DESC;

DELETE t1
FROM `NashvilleHousing` t1
JOIN row_num t2 ON t1.UniqueID = t2.UniqueID
WHERE t2.row_num > 1;

Key Findings and Benefits

The cleaned dataset provides a reliable foundation for further analysis, uncovering patterns and trends in the Nashville housing market. The data cleaning process addressed issues such as missing values, inconsistent formats, and provided a structured format that supports accurate and insightful analysis.

Conclusion

This project sets the foundation for deeper exploration and understanding of the Nashville housing market. By ensuring the data is clean and structured, subsequent analyses will be more accurate and insightful, driving better decision-making in the housing sector.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Data		Data
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nashville Housing Data Cleaning Project Detailed Explanation of Key SQL Techniques

Introduction

Data Overview

Key Findings and Benefits

Conclusion

About

Uh oh!

Releases

Packages

mmaghanem/SQL_data_cleaning

Folders and files

Latest commit

History

Repository files navigation

Nashville Housing Data Cleaning Project Detailed Explanation of Key SQL Techniques

Introduction

Data Overview

Key Findings and Benefits

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages