Skip to content

kuntathegreat/PortfolioProjects

Repository files navigation

The NashvilleHousing file was downloaded and loaded into microsft sql server management system

Data Preprocessing for Nashville Housing Dataset

Introduction:

In this post, we'll dive into the data preprocessing steps carried out on the Nashville Housing dataset. The dataset was initially downloaded and loaded into Microsoft SQL Server Management System for analysis. The goal was to prepare the data for further analysis and modeling by standardizing dates, handling missing values, transforming categorical variables, removing duplicates, and eliminating unused columns.

Step 1: Loading Data into SQL Server Management System

The first step involved downloading the Nashville Housing dataset and loading it into Microsoft SQL Server Management System. This allowed for efficient data querying and manipulation using SQL queries.

Step 2: Standardizing Dates

One of the initial challenges in the dataset was dealing with various date formats. To ensure consistency, all date columns were standardized to a common format using SQL queries. This standardized format facilitated subsequent analysis and visualization.

The current date format for the SaleDate is highlighted below. It contains the Year, Month, day, Hour, Minute and Second.

image

The CONVERT function is used to transform the SaleDate to only Year, Month and Day.

image

A new column SaleDateConverted is added and updated with the above function.

image

Step 3: Handling Missing Data

Address information is crucial in housing datasets, but the PropertyAddress column had missing data. To address this, missing values in the PropertyAddress column were populated using various data sources and matching techniques. This step ensured that address-related information was available for further analysis.

image

To begin with, our observation reveals the existence of missing data within the PropertyAddress column. The NULL function is utilized for this step.

image

Subsequently, we proceed to populate the missing values in the PropertyAddress column using the address information associated with the same parcelId.

image image

The dataset is rechecked for missing values in the Property Address with multiple matching ParcelId.

image

Step 4: Breaking Address Information

A quick view of the PropertyAddress column is observed. image

We employed the SUBSTRING and CHARINDEX functions to extract the address and city information from the PropertyAddress Column.

image

New Columns were created for the Updated Address and Updated City that were extracted. These columns are then populated using the above functions.

image

The UpdatedAddress and UpdatedCity is highlighted.

image

Also, PARSENAME and REPLACE functions can be used to split values in a column instead of the SUBSTRING function.

image

Results shown.

image image

The PropertyAddress column contained complete address information, including city and state. To enhance data organization and querying, the address information was broken down into separate columns for address, city, and state using SQL string manipulation functions.

Step 5: Handling Categorical Data

The SoldAsVacant field contained values "Y" and "N," which were transformed to "Yes" and "No" for better interpretability and consistency. This transformation improved the clarity of the data and its subsequent analysis.

Displaying the first few rows of the SoldAsVacant Column.

image

We proceed to checking for all the unique values contained down the column.

image

CASE is used to replace all the 'Y' and 'N' with 'YES' and 'NO'.

image

The UPDATE function is then used to update the SoldAsVacant column entries.

image

The Column is rechecked for the unique values after updating with the correct values.

image

Step 6: Removing Duplicates

Duplicate records can distort analysis results and lead to biased insights. Therefore, duplicate entries were identified and removed from the dataset using SQL queries, ensuring that each observation was unique.

A Common Table Expression (CTE) is employed to categorize the columns for the purpose of identifying recurring values.

image

104 rows contain duplicates

image

These 104 rows of duplicated values are then removed from the dataset

image

Step 7: Eliminating Unused Columns

Some columns in the dataset were not relevant to the analysis goals or contained redundant information. These unused columns were identified and subsequently deleted to streamline the dataset and reduce unnecessary complexity.

image

Conclusion:

In this post, we explored the critical data preprocessing steps carried out on the Nashville Housing dataset. By loading the dataset into Microsoft SQL Server Management System, standardizing dates, populating missing address information, transforming categorical variables, removing duplicates, and eliminating unused columns, we successfully prepared the data for further analysis and modeling. These preprocessing steps are essential to ensure the accuracy, reliability, and effectiveness of subsequent data analysis and machine learning endeavors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published