---
title: Data Analysis
subtitle: Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends
author:
  - name: Group 11
    affiliations:
      - name: Boston University
        city: Boston
        state: MA
  
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3
---






# Introduction
This document outlines the data cleaning process, including:
- Handling missing values
- Dropping unnecessary columns
- Deduplicating records


::: {.callout-note collapse="true"}
#### **Code:Extracting the Dataset**


In [None]:
# columns drop
import pandas as pd
import gdown
import matplotlib.pyplot as plt
import plotly.express as px
import gdown
import zipfile
import os


file_id = "1VNBTxArDMN2o9fJBDImaON6YUAyJGOU6" 
zip_file = "lightcast_job_postings.zip"  # Name of the downloaded ZIP file
csv_file = "./data/lightcast_job_postings.csv"  # Path to the CSV file

# Step 1: Download the Dataset
print("Downloading the dataset...")
gdown.download(f"https://drive.google.com/uc?id={file_id}", zip_file, quiet=False)

# Step 2: Unzip the File
print("Extracting files...")
with zipfile.ZipFile(zip_file, "r") as zip_ref:
    zip_ref.extractall("./data")  # Extracts to 'data' directory

# Step 3: Read the CSV File
print("Reading the CSV file...")
df = pd.read_csv(csv_file)

# Display dataset info
print("Dataset Loaded Successfully!")
print(df.info())

print("Available columns in dataset:", df.columns.tolist())

In [None]:
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]

df.drop(columns=columns_to_drop, inplace=True)
print("Dropped unnecessary columns.")
print(df.columns)

# handle missing value
print("Missing values before cleaning:")
print(df.isnull().sum())

In [None]:
import missingno as msno
import matplotlib.pyplot as plt

# Check column names
df.columns = df.columns.str.upper().str.strip()  # Normalize column names
print(df.columns)  # Debugging step

# Visualize missing data
msno.heatmap(df)
plt.title("Missing Values Heatmap")
plt.show()

# Drop columns with >50% missing values
df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)

# Check if "SALARY" exists before filling missing values
if "SALARY" in df.columns:
    df["SALARY"].fillna(df["SALARY"].median(), inplace=True)
else:
    print("⚠️ Warning: 'SALARY' column not found in dataframe!")

# Check if "INDUSTRY" exists before filling missing values
if "INDUSTRY" in df.columns:
    df["INDUSTRY"].fillna("Unknown", inplace=True)
else:
    print("⚠️ Warning: 'INDUSTRY' column not found in dataframe!")

print("✅ Missing value handling complete.")

```{pythonb}
# delete duplicates
df = df.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"])
print("Duplicates removed.")
```





## Research