The NashvilleHousing file was downloaded and loaded into microsft sql server management system
In this post, we'll dive into the data preprocessing steps carried out on the Nashville Housing dataset. The dataset was initially downloaded and loaded into Microsoft SQL Server Management System for analysis. The goal was to prepare the data for further analysis and modeling by standardizing dates, handling missing values, transforming categorical variables, removing duplicates, and eliminating unused columns.
The first step involved downloading the Nashville Housing dataset and loading it into Microsoft SQL Server Management System. This allowed for efficient data querying and manipulation using SQL queries.
One of the initial challenges in the dataset was dealing with various date formats. To ensure consistency, all date columns were standardized to a common format using SQL queries. This standardized format facilitated subsequent analysis and visualization.
The current date format for the SaleDate is highlighted below. It contains the Year, Month, day, Hour, Minute and Second.
The CONVERT function is used to transform the SaleDate to only Year, Month and Day.
A new column SaleDateConverted is added and updated with the above function.
Address information is crucial in housing datasets, but the PropertyAddress column had missing data. To address this, missing values in the PropertyAddress column were populated using various data sources and matching techniques. This step ensured that address-related information was available for further analysis.
To begin with, our observation reveals the existence of missing data within the PropertyAddress column. The NULL function is utilized for this step.
Subsequently, we proceed to populate the missing values in the PropertyAddress column using the address information associated with the same parcelId.
The dataset is rechecked for missing values in the Property Address with multiple matching ParcelId.
A quick view of the PropertyAddress column is observed.
We employed the SUBSTRING and CHARINDEX functions to extract the address and city information from the PropertyAddress Column.
New Columns were created for the Updated Address and Updated City that were extracted. These columns are then populated using the above functions.
The UpdatedAddress and UpdatedCity is highlighted.
Also, PARSENAME and REPLACE functions can be used to split values in a column instead of the SUBSTRING function.
Results shown.
The PropertyAddress column contained complete address information, including city and state. To enhance data organization and querying, the address information was broken down into separate columns for address, city, and state using SQL string manipulation functions.
The SoldAsVacant field contained values "Y" and "N," which were transformed to "Yes" and "No" for better interpretability and consistency. This transformation improved the clarity of the data and its subsequent analysis.
Displaying the first few rows of the SoldAsVacant Column.
We proceed to checking for all the unique values contained down the column.
CASE is used to replace all the 'Y' and 'N' with 'YES' and 'NO'.
The UPDATE function is then used to update the SoldAsVacant column entries.
The Column is rechecked for the unique values after updating with the correct values.
Step 6: Removing Duplicates
Duplicate records can distort analysis results and lead to biased insights. Therefore, duplicate entries were identified and removed from the dataset using SQL queries, ensuring that each observation was unique.
A Common Table Expression (CTE) is employed to categorize the columns for the purpose of identifying recurring values.
104 rows contain duplicates
These 104 rows of duplicated values are then removed from the dataset
Some columns in the dataset were not relevant to the analysis goals or contained redundant information. These unused columns were identified and subsequently deleted to streamline the dataset and reduce unnecessary complexity.
In this post, we explored the critical data preprocessing steps carried out on the Nashville Housing dataset. By loading the dataset into Microsoft SQL Server Management System, standardizing dates, populating missing address information, transforming categorical variables, removing duplicates, and eliminating unused columns, we successfully prepared the data for further analysis and modeling. These preprocessing steps are essential to ensure the accuracy, reliability, and effectiveness of subsequent data analysis and machine learning endeavors.