# MYSQL DATA CLEANING - Layoffs Dataset

## Introduction

This first project is about data cleaning/cleansing of a dataset downloaded directly from kaggle.com
This document contains data of layoffs that happened during COVID19 to 2022 on tech companies.

Dataset link: https://www.kaggle.com/datasets/swaptr/layoffs-2022/data

#### Data Import

First step is to import the dataset into MySQL. 
We create a database called "world_layoffs" with a table "layoffs". Right click on the table open a window with an option below "Table Data Import Wizard". Afterwards, we can select our file in CSV format.
In this option, we can choose to change columns format but we will modify them directly on MySQL using requests.

#### First display of data

In [None]:
SELECT * 
FROM layoffs;

#### Next steps ?

For this data cleaning project, we need to follow some steps:
- Creating a copy of the layoffs table for backup in case errors are made, called "layoffs_test"
- Verify existing duplicates
- Standardize the data so it can be used efficiently with no errors.
- Delete useless rows and columns.

#### Creating back-up copy

In [None]:
SELECT *
FROM layoffs;

CREATE TABLE world_layoffs.layoffs_test
LIKE world_layoffs.layoffs;

INSERT layoffs_test
SELECT * FROM world_layoffs.layoffs;

SELECT * 
FROM layoffs_test

### Duplicates

In [None]:
# Searching for duplicates

SELECT *, COUNT(*) AS count
FROM world_layoffs.layoffs_test
GROUP BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised
HAVING COUNT(*) > 1;

#This code search the whole table, each identical column and count the number of duplicates.
#In our dataset, we found 2 duplicates, one from the company Cazoo and the other from Beyond Meat.

#Reminder: a row is considered as a duplicate if EVERY columns are identical.

In [None]:
#To easily find our way, we can create an "Id" column which we incremente in ascending order.

ALTER TABLE world_layoffs.layoffs_test
ADD COLUMN id INT AUTO_INCREMENT PRIMARY KEY;

In [None]:
#We create a temporary Id table to avoid errors.
#And then we can just delete the duplicated rows according to the id 

CREATE TABLE temp_ids (id INT);
INSERT INTO temp_ids (id)
SELECT MIN(id)
FROM world_layoffs.layoffs_test
GROUP BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised
HAVING COUNT(*) > 1;

DELETE FROM world_layoffs.layoffs_test
WHERE id IN (SELECT id FROM temp_ids);

In [None]:
#Verification of companies "Cazoo" and "Beyond Meat"

SELECT * 
FROM layoffs_test
WHERE company = "Cazoo";

SELECT * 
FROM layoffs_test
WHERE company = "Beyond Meat"; 

#These companies had one duplicate each, after verification, there is no more duplicates. But the whole rows were replaced with null values.

In [None]:
#We then delete column id which was used as a reference for deleting duplicates.
ALTER TABLE world_layoffs.layoffs_test
DROP COLUMN id;


In [None]:
#We know that the dataset necessarily possesses name companies. We can refer to this very column to delete the "null" row.
DELETE FROM world_layoffs.layoffs_test
WHERE company IS NULL 
  AND location IS NULL 
  AND industry IS NULL 
  AND total_laid_off IS NULL 
  AND percentage_laid_off IS NULL 
  AND `date` IS NULL 
  AND stage IS NULL 
  AND country IS NULL 
  AND funds_raised IS NULL;

#Verification of companies "Cazoo" and "Beyond Meat" again : null rows were successfully deleted.

SELECT * 
FROM layoffs_test
WHERE company = "Cazoo";

SELECT * 
FROM layoffs_test
WHERE company = "Beyond Meat";

#### Standardization

In [None]:
#We create another copy of layoffs_test for standardization this time.
#With Trim() function, we can avoid unnecessary spaces that are not visible, in front and behind the company name.

SELECT *
FROM world_layoffs.layoffs_test2;

UPDATE world_layoffs.layoffs_test2
SET company = TRIM(company);

# Verifying company : nothing to signal
SELECT DISTINCT company
FROM world_layoffs.layoffs_test2
ORDER BY company;

# Verifying industry : there is a row with nothing assigned and another one with an URL
SELECT DISTINCT industry
FROM world_layoffs.layoffs_test2
ORDER BY industry;

# Verifying if there is a row starting with https: for the industry

UPDATE world_layoffs.layoffs_test2
SET industry = 'N/A'
WHERE industry LIKE 'https:%';

SELECT *
FROM world_layoffs.layoffs_test2
WHERE company LIKE 'ebay';

# Changing the eBay industry column to N/A to remove the URL

# Verifying location : nothing to signal, the display is different for some characters but nothing important.
# example : Fayetteville, Düsseldorf

SELECT DISTINCT location
FROM world_layoffs.layoffs_test2
ORDER BY location;

# Verifying country : nothing to signal
SELECT DISTINCT country
FROM world_layoffs.layoffs_test2
ORDER BY country;



#### Useless blank and null rows and columns

In [None]:
SELECT *
FROM layoffs_test2;

#N/A and NULL columns/rows are found in total_laid_off and percentage_laid_off.
#We modify these columns with a blank string to transform them into blank. 

UPDATE world_layoffs.layoffs_test2
SET total_laid_off = NULL
WHERE total_laid_off = '';

UPDATE world_layoffs.layoffs_test2
SET percentage_laid_off = NULL
WHERE percentage_laid_off = '';

#As we are working on layoffs, analyzing companies with no data on total_laid_off and percentage_laid_off is useless.
#We can just delete all the blank rows for these columns.

DELETE FROM world_layoffs.layoffs_test2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

SELECT * 
FROM world_layoffs.layoffs_test2;