*  DSC 540-T302 Data Preparation
*  Term Project: Milestone 2
*  Peter Lozano

# Cleaning/Preparing Flat File Source

## Load packages

In [1]:
import pandas as pd

## Load flat file data into a DataFrame

In [2]:
flat_file_df = pd.read_csv("layoffs.csv")
flat_file_df.head()

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
0,PowerSchool,Sacramento,,12/12/2025,,Education,https://www.bizjournals.com/sacramento/news/20...,Acquired,31.0,United States,12/15/2025
1,Believer Meats,"Jerusalem, Non-U.S.",,12/11/2025,,Food,https://www.calcalistech.com/ctechnews/article...,Series B,389.0,Israel,12/12/2025
2,EyeEm,"Berlin, Non-U.S.",,12/11/2025,1.0,Other,https://petapixel.com/2025/12/11/eyeem-is-shut...,Acquired,24.0,Germany,12/12/2025
3,Lusha,Boston,24.0,12/10/2025,0.08,Sales,https://www.calcalistech.com/ctechnews/article...,Series B,245.0,United States,12/12/2025
4,Tenstorrent,"Toronto, Non-U.S.",80.0,12/9/2025,0.07,Hardware,https://www.eetimes.com/layoffs-at-tenstorrent...,Series D,1000.0,Canada,12/12/2025


Now that I can see what the data looks like, I will check the data types and for any null values.

In [3]:
flat_file_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4248 entries, 0 to 4247
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   company              4248 non-null   object 
 1   location             4247 non-null   object 
 2   total_laid_off       2778 non-null   float64
 3   date                 4248 non-null   object 
 4   percentage_laid_off  2686 non-null   float64
 5   industry             4246 non-null   object 
 6   source               4245 non-null   object 
 7   stage                4243 non-null   object 
 8   funds_raised         3766 non-null   float64
 9   country              4246 non-null   object 
 10  date_added           4248 non-null   object 
dtypes: float64(3), object(8)
memory usage: 365.2+ KB


There are some null values in the columns with data type 'float64'. This may require some cleaning later on. I will keep this in mind.

Also, I have records that are outside the USA which is out of scope for my project. I will need to filter these out.

# Transformations

## Step 1: Filter by Country (USA only)

In [4]:
flat_file_usa = flat_file_df[flat_file_df['country'] == 'United States']
flat_file_usa.head()

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
0,PowerSchool,Sacramento,,12/12/2025,,Education,https://www.bizjournals.com/sacramento/news/20...,Acquired,31.0,United States,12/15/2025
3,Lusha,Boston,24.0,12/10/2025,0.08,Sales,https://www.calcalistech.com/ctechnews/article...,Series B,245.0,United States,12/12/2025
5,VSCO,SF Bay Area,24.0,12/9/2025,,Consumer,https://techcrunch.com/2025/12/09/vsco-lays-of...,Series B,90.0,United States,12/10/2025
8,Payoneer,New York City,60.0,12/8/2025,0.06,Finance,https://www.calcalistech.com/ctechnews/article...,Post-IPO,,United States,12/8/2025
9,Teads,New York City,,12/4/2025,,Marketing,https://www.businessinsider.com/layoffs-at-adt...,Unknown,123.0,United States,12/10/2025


Now that I have filtered the data to only include records from the USA, I need to focus on the type of companies I'm working with.

## Step 2: Filter by Company Type (Post-IPO only)

Since my project is focused on publicly traded companies, I will filter out any companies that are private or subsidiaries by use of the `stage` column.

In [5]:
flat_file_public = flat_file_usa[flat_file_usa['stage'] == 'Post-IPO']

## Step 3: Date Formatting

Looking back at the columns and data types, I see that both `date` and `date_added` are of type 'object'. I will convert these to datetime format for easier analysis later on.

In [12]:
# Date formatting avoiding SettingWithCopyWarning
flat_file_public.loc[:, 'date'] = pd.to_datetime(flat_file_public['date'], errors='coerce')
flat_file_public.loc[:, 'date_added'] = pd.to_datetime(flat_file_public['date_added'], errors='coerce')
flat_file_public.info()

<class 'pandas.core.frame.DataFrame'>
Index: 550 entries, 8 to 4148
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   company              550 non-null    string 
 1   location             550 non-null    string 
 2   total_laid_off       550 non-null    Int64  
 3   date                 550 non-null    object 
 4   percentage_laid_off  385 non-null    Float64
 5   industry             550 non-null    string 
 6   source               550 non-null    string 
 7   stage                550 non-null    string 
 8   funds_raised         489 non-null    Float64
 9   country              550 non-null    string 
 10  date_added           550 non-null    object 
dtypes: Float64(2), Int64(1), object(2), string(6)
memory usage: 53.2+ KB


I had to use `.loc` to avoid the SettingWithCopyWarning. This method's parameters first passes the **row** selection, which in this case is all rows, and then the **column** selection.

I then pass the column name as a string. This allows me to set the `pd.to_datetime` conversion directly on the DataFrame.

By using `.loc`, I can ensure that I'm modifying the DataFrame in place without creating a copy.

Also, my date columns will now include time information which will be irrelevant for my analysis. Therefore, I will eventually convert these to just date format using the `.dt.date` accessor. This will convert the datetime to date only, removing the time component but also changing the data type to back to 'object'.

## Step 4: Handling Missing Values

Right now, my focus is on companies with layoff activity. Therefore, I will filter out any records that do not have values in the `total_layoffs` column. This will help me focus on the relevant data for my analysis.

In [13]:
# Filter out records with null values in 'total_laid_off' column
flat_file_public = flat_file_public.loc[flat_file_public['total_laid_off'].notnull()]

# Convert remaining columns to appropriate data types to avoid deprecation warnings
flat_file_public = flat_file_public.convert_dtypes()

flat_file_public.describe(include='all')

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
count,550,550,550.0,550,385.0,550,550,550,489.0,550,550
unique,279,46,,,,28,538,1,,1,
top,Salesforce,SF Bay Area,,,,Other,Internal memo,Post-IPO,,United States,
freq,11,258,,,,63,6,550,,550,
mean,,,736.587273,2023-07-03 03:32:04.363636224,0.133009,,,,2601.121063,,2023-07-12 13:39:29.454545664
min,,,11.0,2020-03-30 00:00:00,0.0,,,,1.0,,2020-04-02 00:00:00
25%,,,95.0,2022-11-14 06:00:00,0.05,,,,122.0,,2022-12-01 12:00:00
50%,,,200.0,2023-04-24 00:00:00,0.1,,,,410.0,,2023-04-28 00:00:00
75%,,,500.0,2024-04-01 12:00:00,0.17,,,,1200.0,,2024-04-10 06:00:00
max,,,22000.0,2025-12-08 00:00:00,1.0,,,,121900.0,,2025-12-08 00:00:00


I passed the `include='all'` parameter to the `describe()` method to get a comprehensive overview of all columns, including those with non-numeric data types. This will help me understand the distribution and characteristics of the data after filtering.

I can see that I maintained the 550 records after filtering for non-null values in the `total_laid_off` column. This indicates that all remaining records have layoff activity, which is essential for my analysis.

## Step 5: String Standardization

I will strip out any leading or trailing whitespace from string columns to ensure consistency in the data. This will help avoid issues during analysis, such as mismatched values due to extra spaces. This will be essential for accurate grouping and filtering later on. For example, the `company` column may consist of ` amazon ` and `amazon`, which should be treated as the same entity.

In [8]:
# Passing a regex pattern to identify leading or trailing whitespace
# \s matches any whitespace character
# ^ asserts position at start of a string
# $ asserts position at end of a string
mask = flat_file_public['company'].str.contains(r'^\s|\s$', na=False)
flat_file_public.loc[mask]

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
2166,WeWork,New York City,300,2023-01-19,,Real Estate,https://investors.wework.com/news-and-events/p...,Post-IPO,22200.0,United States,2023-01-19
2606,Science 37,Los Angeles,90,2022-11-14,,Healthcare,https://www.bizjournals.com/triangle/news/2022...,Post-IPO,347.0,United States,2022-11-14
3105,Pear Therapeutics,Boston,25,2022-07-25,0.09,Healthcare,https://www.massdevice.com/pear-therapeutics-a...,Post-IPO,409.0,United States,2022-07-26


I can see that there are at least 3 records with leading or trailing whitespace in the `company` column. I pass a regex pattern to the `str.contains()` method to identify these records. 

The argument `^\s|\s$` is used to match any string that starts (`^`) or ends (`$`) with whitespace characters (`\s`). The `na=False` parameter ensures that any NaN values are treated as False, preventing errors during the matching process.

I utilize the `mask` variable just to pass the records that match the condition for review before making any changes.

In [9]:
# Copy data to new DataFrame for cleaning
flat_file_cleaned = flat_file_public.copy()

# Strip leading/trailing whitespace and standardize capitalization
flat_file_cleaned.loc[:, 'company'] = flat_file_cleaned['company'].str.strip().str.title()

# Verification
flat_file_cleaned.loc[mask]

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
2166,Wework,New York City,300,2023-01-19,,Real Estate,https://investors.wework.com/news-and-events/p...,Post-IPO,22200.0,United States,2023-01-19
2606,Science 37,Los Angeles,90,2022-11-14,,Healthcare,https://www.bizjournals.com/triangle/news/2022...,Post-IPO,347.0,United States,2022-11-14
3105,Pear Therapeutics,Boston,25,2022-07-25,0.09,Healthcare,https://www.massdevice.com/pear-therapeutics-a...,Post-IPO,409.0,United States,2022-07-26


Can't really see the changes in the DataFrame preview above, but I know that the leading and trailing whitespace has been removed from the `company` column. 

I also applied the `str.title()` method to standardize the capitalization of company names, ensuring that each word starts with an uppercase letter followed by lowercase letters. This will help maintain consistency in the data.

I will need to ensure I follow this same process for any other data sources I use for my project to maintain consistency across datasets.

# Verification

In [10]:
print("Transformation Complete.")
print(f"Original Row Count: {len(flat_file_df)}")
print(f"Final Row Count: {len(flat_file_cleaned)}")

Transformation Complete.
Original Row Count: 4248
Final Row Count: 550


In [11]:
# Print cleaned DataFrame preview
flat_file_cleaned.head()

Unnamed: 0,company,location,total_laid_off,date,percentage_laid_off,industry,source,stage,funds_raised,country,date_added
8,Payoneer,New York City,60,2025-12-08,0.06,Finance,https://www.calcalistech.com/ctechnews/article...,Post-IPO,,United States,2025-12-08
16,Hp,SF Bay Area,4000,2025-11-25,,Hardware,https://www.wsj.com/tech/hp-to-cut-up-to-10-of...,Post-IPO,,United States,2025-03-01
28,Synopsys,SF Bay Area,2000,2025-11-12,0.1,Other,https://www.reuters.com/business/world-at-work...,Post-IPO,,United States,2025-11-14
34,Hewlett Packard Enterprise,SF Bay Area,52,2025-11-05,,Hardware,https://www.sfchronicle.com/tech/article/layof...,Post-IPO,1400.0,United States,2025-11-07
39,Amazon,Seattle,14000,2025-10-27,0.01,Retail,https://www.cnbc.com/2025/10/28/amazon-layoffs...,Post-IPO,8100.0,United States,2025-10-28


# Ethical Implications

When working with layoff data, it's crucial to handle the information with sensitivity and respect for the affected individuals and companies.

## What changes were made to the data?

I have filtered the dataset to include only records from publicly traded companies in the USA with layoff activity. I have also standardized string formats and ensured date columns are in the correct format for analysis.

## Are there any legal or regulatory guidelines for your data or project topic?

When dealing with employment and layoff data, it's important to comply with data privacy laws such as the California Consumer Privacy Act (CCPA). These regulations govern how personal data should be collected, stored, and shared, ensuring that individuals' privacy rights are protected. Additionally, ethical considerations should be taken into account to avoid misrepresentation or harm to the companies and individuals involved.

## What risks could be created based on the transformations done?

The risks associated with the transformations include potential misinterpretation of layoff data, which could lead to negative perceptions of the companies involved. Additionally, if the data is not handled securely, there could be privacy concerns for individuals affected by layoffs. It's essential to ensure that the data is used responsibly and ethically to avoid harm to both companies and individuals.

## Did you make any assumptions in cleaning/transforming the data?

I had to assume that the records with null values in the `total_layoffs` column indicated no layoff activity, which is why I filtered them out. Additionally, I assumed that standardizing string formats would not alter the meaning of the data but rather enhance its consistency for analysis.

## How was your data sourced/verified for credibility?

My data was sourced from the [Layoffs.fyi](https://www.layoffs.fyi) dataset, which compiles layoff information from various credible sources, including news articles, company announcements, and government reports. The dataset is regularly updated and maintained to ensure accuracy and reliability.

## Was your data acquired in an ethical way?

Yes, the data was acquired ethically from publicly available sources, specifically from [Kaggle](https://www.kaggle.com/datasets/swaptr/layoffs-2022), that reports on layoffs. The dataset does not contain personally identifiable information (PII) about individuals affected by layoffs, focusing instead on company-level data. This approach helps maintain privacy while still providing valuable insights into employment trends.

## How would you mitigate any of the ethical implications you have identified?

I would mitigate ethical implications by ensuring that the data is anonymized and aggregated to prevent the identification of individuals like the data source does. Additionally, I would implement strict data governance policies to control access to the data and ensure it is used solely for research and analysis purposes. Transparency about the data sources and methodologies used in the analysis would also help build trust and credibility.