GitHub

The NashvilleHousing file was downloaded and loaded into microsft sql server management system

Data Preprocessing for Nashville Housing Dataset

Introduction:

In this post, we'll dive into the data preprocessing steps carried out on the Nashville Housing dataset. The dataset was initially downloaded and loaded into Microsoft SQL Server Management System for analysis. The goal was to prepare the data for further analysis and modeling by standardizing dates, handling missing values, transforming categorical variables, removing duplicates, and eliminating unused columns.

Step 1: Loading Data into SQL Server Management System

The first step involved downloading the Nashville Housing dataset and loading it into Microsoft SQL Server Management System. This allowed for efficient data querying and manipulation using SQL queries.

Step 2: Standardizing Dates

One of the initial challenges in the dataset was dealing with various date formats. To ensure consistency, all date columns were standardized to a common format using SQL queries. This standardized format facilitated subsequent analysis and visualization.

The current date format for the SaleDate is highlighted below. It contains the Year, Month, day, Hour, Minute and Second.

The CONVERT function is used to transform the SaleDate to only Year, Month and Day.

A new column SaleDateConverted is added and updated with the above function.

Step 3: Handling Missing Data

Address information is crucial in housing datasets, but the PropertyAddress column had missing data. To address this, missing values in the PropertyAddress column were populated using various data sources and matching techniques. This step ensured that address-related information was available for further analysis.

To begin with, our observation reveals the existence of missing data within the PropertyAddress column. The NULL function is utilized for this step.

Subsequently, we proceed to populate the missing values in the PropertyAddress column using the address information associated with the same parcelId.

The dataset is rechecked for missing values in the Property Address with multiple matching ParcelId.

Step 4: Breaking Address Information

A quick view of the PropertyAddress column is observed.

We employed the SUBSTRING and CHARINDEX functions to extract the address and city information from the PropertyAddress Column.

New Columns were created for the Updated Address and Updated City that were extracted. These columns are then populated using the above functions.

The UpdatedAddress and UpdatedCity is highlighted.

Also, PARSENAME and REPLACE functions can be used to split values in a column instead of the SUBSTRING function.

Results shown.

The PropertyAddress column contained complete address information, including city and state. To enhance data organization and querying, the address information was broken down into separate columns for address, city, and state using SQL string manipulation functions.

Step 5: Handling Categorical Data

The SoldAsVacant field contained values "Y" and "N," which were transformed to "Yes" and "No" for better interpretability and consistency. This transformation improved the clarity of the data and its subsequent analysis.

Displaying the first few rows of the SoldAsVacant Column.

We proceed to checking for all the unique values contained down the column.

CASE is used to replace all the 'Y' and 'N' with 'YES' and 'NO'.

The UPDATE function is then used to update the SoldAsVacant column entries.

The Column is rechecked for the unique values after updating with the correct values.

Step 6: Removing Duplicates

Duplicate records can distort analysis results and lead to biased insights. Therefore, duplicate entries were identified and removed from the dataset using SQL queries, ensuring that each observation was unique.

A Common Table Expression (CTE) is employed to categorize the columns for the purpose of identifying recurring values.

104 rows contain duplicates

These 104 rows of duplicated values are then removed from the dataset

Step 7: Eliminating Unused Columns

Some columns in the dataset were not relevant to the analysis goals or contained redundant information. These unused columns were identified and subsequently deleted to streamline the dataset and reduce unnecessary complexity.

Conclusion:

In this post, we explored the critical data preprocessing steps carried out on the Nashville Housing dataset. By loading the dataset into Microsoft SQL Server Management System, standardizing dates, populating missing address information, transforming categorical variables, removing duplicates, and eliminating unused columns, we successfully prepared the data for further analysis and modeling. These preprocessing steps are essential to ensure the accuracy, reliability, and effectiveness of subsequent data analysis and machine learning endeavors.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Activity_ Course 7 Salifort Motors project lab.ipynb		Activity_ Course 7 Salifort Motors project lab.ipynb
CAPSTONE TASK KUN.R		CAPSTONE TASK KUN.R
DataCleaning.sql		DataCleaning.sql
README.md		README.md
covid.sql		covid.sql
footie.ipynb		footie.ipynb
googlestore.ipynb		googlestore.ipynb
house_price_prediction.ipynb		house_price_prediction.ipynb
kun.ipynb		kun.ipynb
supermarket_sales_analysis.ipynb		supermarket_sales_analysis.ipynb
titanic.ipynb		titanic.ipynb
word cloud.ipynb		word cloud.ipynb
yusuf ola da.docx		yusuf ola da.docx
yusuf ola da.pdf		yusuf ola da.pdf

kuntathegreat/PortfolioProjects

Folders and files

Latest commit

History

Repository files navigation

Data Preprocessing for Nashville Housing Dataset

Introduction:

Step 1: Loading Data into SQL Server Management System

Step 2: Standardizing Dates

Step 3: Handling Missing Data

Step 4: Breaking Address Information

Step 5: Handling Categorical Data

Step 7: Eliminating Unused Columns

Conclusion:

About

Resources

Stars

Watchers

Forks

Languages