This repository contains a set of SQL scripts designed to perform data cleaning and analysis on layoff-related data. The project aims to process raw data by cleaning it, handling duplicates, standardizing values, and dealing with missing data. The analysis part explores the cleaned data, generating insights such as the total number of layoffs by company, industry, and country, along with other key metrics.
DATA_CLEANING.sql: Contains SQL scripts for cleaning and preparing the data.DATA_ANALYSIS.sql: Contains SQL scripts for performing exploratory data analysis (EDA) and generating insights.layoffs.csv: Contains RAW data .
- Database: MySQL
- Input Data: The project works with a table named
layoffs, which contain raw layoff data.
The cleaning process involves several key steps to ensure that the data is ready for analysis:
We begin by identifying and removing duplicate records. Duplicates are identified based on several columns such as company, location, industry, and other relevant fields. Once identified, duplicate rows are deleted to ensure unique data.
Next, we standardize the data to ensure consistency across the dataset:
- Whitespace is trimmed from text fields such as company, location, and industry.
- Certain columns, like industry, are corrected for known inconsistencies (e.g., grouping similar values like "Crypto" into a single category).
- The
datecolumn is converted into a proper date format to facilitate time-based analysis.
Rows with missing or blank values in key columns are identified. If appropriate, missing values are filled or replaced by information from other records. In cases where a value cannot be determined, rows with crucial missing data are deleted to maintain data integrity.
Any rows that contain irrelevant or non-essential data (e.g., rows with no layoffs recorded) are removed. Additionally, columns that are not necessary for analysis are dropped to streamline the dataset.
Once the data is cleaned, we proceed with exploratory data analysis (EDA) to generate meaningful insights. The analysis includes:
We calculate the maximum number of layoffs, the highest percentage of layoffs, and the maximum amount of funds raised across all records. This gives a quick overview of the most significant layoff events in the dataset.
The total number of layoffs per month is calculated. Additionally, we calculate a rolling total of layoffs over time, which helps visualize trends in layoff occurrences across different months.
We aggregate the data by company to determine which companies have had the highest number of layoffs. This helps to identify trends or patterns specific to certain companies.
The dataset is analyzed by industry and country to understand the distribution of layoffs across different sectors and geographical locations.
The data is grouped by year to analyze the overall trend of layoffs over time. This provides insights into how layoffs have evolved from year to year.
- Set Up Your Database: Ensure that you have a MySQL database set up.
- Import the Raw Data: Import the
layoffstable containing the raw layoff data. - Execute Data Cleaning: Run the
DATA_CLEANING.sqlscript to clean and prepare the data. - Execute Data Analysis: Once the data is cleaned, run the
DATA_ANALYSIS.sqlscript to analyze the data and generate insights.