# JBI100 Visualization 
### Academic year 2024-2025

## Incidents and Accidents
Data sources:

- Work-related Injury and Illness (https://www.osha.gov/Establishment-Specific-Injury-and-Illness-Data)


In [None]:
# Import libraries
import pandas as pd
import plotly.express as px
import numpy as np
import os

# Do not truncate tables
pd.set_option("display.max_columns", None)

# Assignment 1

## Exercise 1 – Data Set 

### (a) What is the information you can obtain from the data set/ data sets?

The OSHA Injury Tracking Application (ITA) Case Detail dataset contains detailed information on work-related injuries and illnesses (each row is a case of work-related injuries or illnesses). The data includes the following types of information:

#### 1. Establishment-Level Information
- **Unique Identifiers**: Establishment ID, Employer Identification Number (EIN).
- **Demographic Information**: Establishment name, company name, street address, city, state, zip code.
- **Industry Classification**: North American Industry Classification System (NAICS) code, year of NAICS code used, industry description.
- **Establishment Type**: Private industry, state government entity, or local government entity.
- **Workforce Data**: Size of the establishment, annual average employees, total hours worked.

#### 2. Incident-Level Information
- **Incident Identifiers**: Unique case number, establishment ID linkable to 300A data.
- **Incident Details**: Date of incident, type of incident (injury, skin disorder, etc.), time of incident, time started work prior to incident, and whether time was unknown.
- **Outcomes**: Most serious outcome (e.g., death, days away from work, job transfer/restriction), number of days away from work, number of restricted duty or transfer days.
- **Fatalities**: Date of death (if applicable).

#### 3. Narrative Descriptions
- **Incident Details**: What the employee was doing before the incident, how the incident happened, injury/illness description, and the object/substance directly harming the employee.

#### 4. Occupational Coding Information
- **Job Information**: Job title of the injured/ill employee.
- **Standard Occupation Code (SOC)**:
  - SOC Code and Description: Assigned using NIOCCS.
  - SOC Probability: Confidence score for SOC coding.
  - SOC Reviewed: Indicates whether the SOC code was reviewed or reassigned.

#### 5. System Metadata
- **Submission Information**: Created timestamp, year of filing.
- **Data Quality Indicators**: Codes for missing or invalid entries (e.g., "9999" for SOC code when unassignable).


### (b) What are the attributes in the data and what is their meaning?

The OSHA Injury Tracking Application (ITA) Case Detail dataset includes the following attributes, categorized by their context and meaning:

#### 1. **Establishment Information**
- **`establishment_ID`**: Unique identifier for each establishment.
- **`establishment_name`**: Name of the establishment reporting the data.
- **`ein`**: Employer Identification Number (Federal Tax Identification Number).
- **`company_name`**: Name of the parent company of the establishment.
- **`street_address`**: Street address of the establishment.
- **`city`**: City where the establishment is located.
- **`state`**: State or territory where the establishment is located.
- **`zip_code`**: Full zip code of the establishment.
- **`naics_code`**: North American Industry Classification System (NAICS) code for the establishment.
- **`naics_year`**: Year version of NAICS code used.
- **`industry_description`**: Industry description based on the NAICS code.
- **`establishment_type`**: Type of establishment:
  - 1 = Private industry
  - 2 = State government entity
  - 3 = Local government entity
- **`size`**: Size of the establishment based on maximum employees:
  - 1 = <20 employees
  - 21 = 20-99 employees
  - 22 = 100-249 employees
  - 3 = 250+ employees
- **`annual_average_employees`**: Annual average number of employees.
- **`total_hours_worked`**: Total hours worked by all employees at the establishment.

#### 2. **Incident Information**
- **`case_number`**: Employer-assigned unique case number for each injury/illness.
- **`date_of_incident`**: Date when the incident occurred.
- **`incident_outcome`**: Most serious outcome of the incident:
  - 1 = Death
  - 2 = Days away from work (DAFW)
  - 3 = Job transfer or restriction
  - 4 = Other recordable case
- **`dawf_num_away`**: Number of days away from work due to the incident.
- **`djtr_num_tr`**: Number of days on restricted duty or job transfer due to the incident.
- **`type_of_incident`**: Type of incident:
  - 1 = Injury
  - 2 = Skin disorder
  - 3 = Respiratory condition
  - 4 = Poisoning
  - 5 = Hearing loss
  - 6 = All other illness
- **`time_started_work`**: Time the employee began work prior to the incident.
- **`time_of_incident`**: Time the incident occurred.
- **`time_unknown`**: Indicator if the time of the incident is unknown:
  - 0 = No
  - 1 = Yes
- **`date_of_death`**: Date of death, if applicable.

#### 3. **Narrative Descriptions**
- **`incident_description`**: Description of the incident.
- **`nar_before_incident`**: Description of what the employee was doing before the incident.
- **`nar_what_happened`**: Description of what happened during the incident.
- **`nar_injury_illness`**: Description of the injury or illness.
- **`nar_object_substance`**: Description of the object or substance directly harming the employee.

#### 4. **Occupational Information**
- **`job_description`**: Job title of the injured/ill employee.
- **`SOC_code`**: Standard Occupation Code assigned by NIOCCS or OSHA.
- **`SOC_description`**: Text description of the SOC code.
- **`SOC_probability`**: Confidence score for the SOC coding (5 indicates a manually reassigned code).
- **`SOC_reviewed`**: Indicator of whether the SOC code was reviewed:
  - 0 = Not reviewed, NIOCCS coded
  - 1 = Reviewed by OSHA
  - 2 = Not SOC coded (SOC = "9999")

#### 5. **System Metadata**
- **`created_timestamp`**: Timestamp when the record was submitted.
- **`year_of_filing`**: Year in which the reported injuries/illnesses occurred.

### (c) Write a small parsing function that can read the data position (column, row) from the file format you selected. 

In [None]:
def parse_dataset():
    dataset_path = os.path.join(
        "Work-related Injury and Illness",
        "ITA Case Detail Data 2023 through 8-31-2023.csv",
    )
    return pd.read_csv(
        dataset_path,
        delimiter=",",
        low_memory=False,
        encoding="utf-8",
        dtype={"zip_code": "string", "naics_code": "string", "naics_year": "string"},
        index_col="id",
    )


df = parse_dataset()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.sample(5)

### (d) Write another function that outputs the distribution of the attributes, and counts the frequencies of the different values

In [None]:
df.describe(include="all")

In [None]:
def get_frequency_distribution(df):
    """
    Creates a DataFrame where:
    - Primary index: column names (attributes)
    - Secondary index: unique values in each column
    - Value column: frequency of each unique value

    Parameters:
    - df (pd.DataFrame): Input DataFrame.

    Returns:
    - pd.DataFrame: Frequency distribution as described.
    """
    frequency_df = pd.concat(
        {col: df[col].value_counts() for col in df.columns},
        names=["Attribute", "Value"],
    ).reset_index(name="Frequency")

    # Set the index as required
    return frequency_df.set_index(["Attribute", "Value"])


df_frequency = get_frequency_distribution(df)
df_frequency

In [None]:
def attribute_distribution(dataframe, attribute, plot_distribution=False):
    """
    Calculates the distribution of values for a specified attribute in the dataset
    and optionally plots the distribution using Plotly Express.

    Parameters:
        dataframe (pd.DataFrame): The DataFrame containing the dataset.
        attribute (str): Column name of the attribute to analyze.
        plot_distribution (bool): Whether to plot the distribution of the attribute.

    Returns:
        pd.DataFrame: A DataFrame with value counts and percentage distribution.
    """
    if attribute not in dataframe.columns:
        raise ValueError(f"Attribute '{attribute}' not found in the dataset.")

    # Calculate value counts and percentage
    counts = dataframe[attribute].value_counts()
    percentages = (counts / counts.sum()) * 100

    # Combine counts and percentages into a DataFrame
    distribution = pd.DataFrame(
        {
            "Value": counts.index.astype(
                str
            ),  # Ensure all values are strings for categorical plotting
            "Frequency": counts.values,
            "Percentage": percentages.values,
        }
    ).set_index("Value")

    if not plot_distribution:
        return distribution

    fig = px.bar(
        distribution.reset_index(),
        x="Value",
        y="Percentage",
        text="Percentage",
        title=f"Distribution of {attribute} (Percentage)",
        labels={"Value": "Attribute Value", "Percentage": "Percentage (%)"},
    )
    fig.update_traces(texttemplate="%{text:.2f}%", textposition="outside")
    fig.update_layout(
        xaxis=dict(title="Values"),
        yaxis=dict(title="Percentage (%)"),
        uniformtext_minsize=8,
        uniformtext_mode="hide",
    )
    fig.show()

    return distribution

In [None]:
# Single attribute
attribute_distribution(df, "size", True)

In [None]:
# # All attributes
# for column_name in df.columns:
#     print(attribute_distribution(df, column_name, False))



### (e) Try to describe the data set in just a few sentences. How is the data provided? Which kind of attributes are contained in the data set? How large is the data set in terms of the number of those elements (teams, matches, players, historic data, extra records, and so on)?

The OSHA Injury Tracking Application (ITA) dataset is a structured repository of work-related injury and illness records reported by establishments with 100 or more employees in high-hazard industries. The data is provided as a CSV file and contains attributes related to establishments (e.g., name, location, industry), incidents (e.g., date, type, outcome, days away from work), and employee roles (e.g., job title, Standard Occupation Codes). Additional narrative fields describe incidents and injuries in detail. The dataset size depends on the reporting frequency but typically includes thousands of records, each representing a unique incident, with detailed fields linking establishments, incidents, and employees for comprehensive analysis.

### (f) Analyze the errors and missing values. Write a function to count how many missing values per attribute and per entry you have. Analyze what are the most relevant missing values that might hinder the analysis according to you.

In [None]:
def analyze_and_plot_missing_values(dataframe):
    """
    Analyzes and plots the missing values in the dataset.

    Parameters:
        dataframe (pd.DataFrame): The DataFrame containing the dataset.

    Returns:
        dict: A dictionary with:
              - Total missing values per attribute
              - Percentage of missing values per attribute
              - Missing values per entry
    """
    # Count missing values per attribute
    missing_per_attribute = dataframe.isnull().sum()
    percent_missing_per_attribute = (missing_per_attribute / len(dataframe)) * 100

    # Combine counts and percentages into a DataFrame
    attribute_analysis = (
        pd.DataFrame(
            {
                "Attribute": dataframe.columns,
                "Missing_Count": missing_per_attribute,
                "Percentage_Missing": percent_missing_per_attribute,
            }
        )
        .query("Missing_Count > 0")
        .sort_values(by="Percentage_Missing", ascending=False)
        .set_index("Attribute")
    )  # Sort in descending order

    # Count missing values per entry (row)
    missing_per_entry = dataframe.isnull().sum(axis=1)
    # Distribution of rows by the number of missing values
    row_missing_distribution = missing_per_entry.value_counts().reset_index()
    row_missing_distribution.columns = ["Missing_Count", "Row_Count"]
    row_missing_distribution = row_missing_distribution.sort_values(
        by="Missing_Count", ascending=True
    ).set_index("Missing_Count")

    # Plot missing values per attribute
    fig_attr = px.bar(
        attribute_analysis.reset_index(),
        x="Attribute",
        y="Percentage_Missing",
        text="Percentage_Missing",
        title="Missing Values Per Attribute (Sorted by Percentage)",
        labels={
            "Attribute": "Attribute",
            "Percentage Missing": "Percentage Missing (%)",
        },
    )
    fig_attr.update_traces(texttemplate="%{text:.2f}%", textposition="outside")
    fig_attr.update_layout(
        xaxis=dict(title="Attributes", tickangle=45),
        yaxis=dict(title="Percentage Missing (%)"),
        uniformtext_minsize=8,
        uniformtext_mode="hide",
        showlegend=False,
    )
    fig_attr.show()

    # Plot distribution of missing values per row
    fig_row = px.bar(
        row_missing_distribution.reset_index(),
        x="Missing_Count",
        y="Row_Count",
        text="Row_Count",
        title="Distribution of Missing Values Per Row",
        labels={
            "Missing_Count": "Number of Missing Values",
            "Row_Count": "Number of Rows",
        },
    )
    fig_row.update_traces(texttemplate="%{text}", textposition="outside")
    fig_row.update_layout(
        xaxis=dict(title="Number of Missing Values"),
        yaxis=dict(title="Number of Rows"),
        uniformtext_minsize=8,
        uniformtext_mode="hide",
        showlegend=False,
    )
    fig_row.show()

    return attribute_analysis, row_missing_distribution


# Example usage
df_missing_attributes, df_row_missing_distribution = analyze_and_plot_missing_values(df)

In [None]:
df_missing_attributes

In [None]:
df_row_missing_distribution.query("Missing_Count > 0")

##### Analysis
1. **`date_of_death` (99.97% missing)**
- **Impact**: This field is crucial for analyzing fatalities but is practically unusable due to the high missing percentage.
- **Recommendation**: Use the `incident_outcome` field, which includes death as a category, to indirectly analyze fatality-related trends.

2. **`time_started_work` (12.58% missing) and `time_of_incident` (12.48% missing)**
- **Impact**: These fields are essential for analyzing temporal trends, such as incidents occurring shortly after starting work. Missing values reduce the reliability of time-dependent analyses.
- **Recommendation**: Focus analyses on the available data or consider imputing missing values based on similar cases or statistical methods.

3. **`ein` (8.60% missing)**
- **Impact**: The EIN uniquely identifies establishments and is vital for merging datasets or conducting establishment-specific studies. Missing values hinder these analyses.
- **Recommendation**: Use `establishment_ID` as an alternative identifier if it is complete.

4. **`industry_description` (6.72% missing)**
- **Impact**: Industry classification is critical for sector-specific risk analysis. Missing values hinder comparisons of workplace safety across industries.
- **Recommendation**: Use `naics_code` for industry-level analysis or group missing values into an "Unknown" category.

5. **`job_description` (0.41% missing)**
- **Impact**: This field is important for analyzing risks associated with specific job roles. Missing data limits occupation-specific insights.
- **Recommendation**: Exclude rows with missing `job_description` from job-specific analyses or impute values based on similar cases.


## Exercise 2 – Goal - Data (Domain specific)


#### General Overall Goal
The primary goal of the visualization tool is to **enable workplace safety analysts, policymakers, and industry leaders** to:
1. Identify trends in workplace injuries and illnesses across industries, establishments, and job roles.
2. Explore the temporal, geographic, and sector-specific distribution of incidents to identify patterns and potential risk factors.
3. Facilitate decision-making by highlighting areas that require safety interventions, such as industries with high incident rates or recurring issues in specific job roles.

#### Target Users
The visualization tool is designed for:
- **Workplace Safety Analysts**: To understand patterns in workplace incidents and investigate contributing factors.
- **Policymakers**: To design and evaluate regulations that mitigate risks in high-hazard industries.
- **Industry Leaders/Managers**: To assess their establishments’ performance compared to industry benchmarks and implement targeted safety measures.

#### Overall Goal and High-Level Actions
The primary goal is **"Exploratory Analysis"**, focusing on:
1. **Comparative Analysis**: Compare incidents across industries, job roles, and geographic regions to identify high-risk categories.
2. **Trend Identification**: Examine temporal trends in incident occurrences and severity (e.g., time of day, seasonality).
3. **Insight Generation**: Drill down into specific establishments or job types to identify recurring patterns or anomalies.
4. **Communication and Awareness**: Present findings in an intuitive, interactive format to raise awareness and drive action.

#### Why This Goal is Suitable for the Available Data
- The dataset contains a wealth of detailed information about workplace incidents, including establishment-level, incident-level, and job-level attributes. These can be visualized to uncover patterns and correlations that would be hard to identify otherwise.
- While some attributes have missing values (e.g., `date_of_death`, `time_started_work`), the remaining data is sufficient to provide meaningful insights at industry, establishment, and incident levels.

#### Why Visualization is the Right Means
1. **Pattern Recognition**: Visualization allows users to recognize patterns and outliers, such as industries with unusually high incident rates.
2. **Exploration and Interaction**: An interactive tool enables users to explore the dataset from various perspectives (e.g., filtering by industry, geographic location, or incident severity).
3. **Decision Support**: Visualizing data enables managers and policymakers to make informed decisions quickly by presenting complex data in an understandable format.
4. **Communication**: Visualizations can convey insights effectively to diverse stakeholders, including non-technical audiences.

This tool is ideal for leveraging the available data to inform workplace safety improvements, reduce incident rates, and ensure compliance with regulations.


## Exercise 3 – Data (What) Domain specific


### (a) Write in section What (Data) the description of the data. You can base it on the analysis you have done in exercise 1. What are the general properties of the data you want to use? 


#### Attributes Needed for the Analysis and Their Relevance

1. **Establishment-Level Attributes**:
   - **`naics_code` and `industry_description`**: Essential for identifying high-risk industries and understanding sector-specific trends.
   - **`state`, `city`, `zip_code`**: Crucial for geographic analysis to detect regional patterns or disparities in workplace safety.
   - **`size`, `annual_average_employees`, and `total_hours_worked`**: Provide context for scaling incident data (e.g., incidents per employee or hours worked) to make comparisons meaningful across establishments of different sizes.

2. **Incident-Level Attributes**:
   - **`date_of_incident`**: Key for temporal analysis, such as identifying trends over time or seasonal variations.
   - **`type_of_incident`**: Important for categorizing and understanding the nature of incidents (e.g., injuries vs. illnesses).
   - **`incident_outcome`**: Crucial for evaluating the severity of incidents and prioritizing interventions.
   - **`dawf_num_away` and `djtr_num_tr`**: Provide metrics to assess the impact of incidents on productivity and employee health.

3. **Narrative and Job-Level Attributes**:
   - **`job_description`**: Helps identify which roles are most vulnerable to workplace incidents, enabling targeted interventions.
   - **`SOC_code` and `SOC_description`**: Provide standard classifications for jobs, supporting cross-industry comparisons.
   - **`incident_description` and `nar_what_happened`**: Offer qualitative insights into incident causes and circumstances, which are valuable for designing preventative measures.

4. **System Metadata**:
   - **`year_of_filing`**: Allows analysis of data trends across multiple reporting years.
   - **`created_timestamp`**: Ensures timeliness and relevance of the data used for analysis.

#### Why These Attributes Are Relevant
These attributes enable comprehensive analyses aligned with the goals of the visualization tool:
- **Comparative Analysis**: Attributes like `naics_code`, `state`, and `incident_outcome` allow comparisons across industries, regions, and severity levels.
- **Trend Identification**: Temporal attributes such as `date_of_incident` and `year_of_filing` help identify trends in workplace safety over time.
- **Actionable Insights**: Narrative and job-level attributes (`job_description`, `SOC_code`, and `incident_description`) provide detailed insights into specific incident causes, enabling targeted interventions.
- **Scalability**: Workforce metrics (`size`, `annual_average_employees`, `total_hours_worked`) ensure that analyses are normalized, allowing for meaningful comparisons across establishments of varying sizes.

By focusing on these attributes, the visualization tool can deliver actionable insights to workplace safety analysts, policymakers, and industry leaders.


### (b) Most of the data sets contain noise, missing data values, and relations, or measurement errors. The data of this course is no exception. In exercise 1, you already looked at the missing values. How will you handle missing data values or measurement errors? Think of multiple ways and their pros and cons.

In [None]:
def preprocess_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocesses the given DataFrame by performing data cleaning, type conversions,
    and mapping of categorical variables for better interpretability and analysis.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing raw data.

    Returns:
        pd.DataFrame: A cleaned and preprocessed DataFrame with the following transformations:
            - String columns are stripped of whitespace, tabs, and excess spaces.
            - Numeric columns are converted to appropriate numeric types with downcasting.
            - Categorical columns are mapped to meaningful labels and converted to 'category' type.
            - Date and time columns are converted to datetime or time objects as needed.
            - Missing values are imputed based on column-specific logic or filled with default placeholders.
            - Invalid or placeholder values in specific columns (e.g., EIN, soc_code) are replaced with standardized values.
            - Columns with redundant or irrelevant information (e.g., 'year_filing_for') are dropped.

    Key Transformations:
        - String Cleaning: Strips leading/trailing whitespace, removes tabs, and normalizes spacing.
        - Numeric Conversion: Downcasts numeric columns to optimize memory usage.
        - Categorical Mapping: Maps numeric or placeholder codes to meaningful labels.
        - Date/Time Conversion: Parses date and time strings into appropriate formats.
        - Missing Value Imputation:
            - 'case_number', 'company_name', 'street_address', 'job_description': Filled with "Not provided".
            - 'industry_description': Filled with "No description given".
            - 'ein': "Enter EIN" replaced with "No EIN Given".
            - 'time_unknown': Mapped to "No" by default if missing.
            - Others: Column-specific imputation logic applied.
        - Dropped Columns: 'year_filing_for'.

    Notes:
        - Assumes specific formats for date and time columns.
        - Handles invalid or missing data gracefully using pandas' built-in capabilities (e.g., `errors='coerce'`).
    """

    codes_to_drop = ['36-83962', '74-187392']
    df_copy = df[~df['ein'].isin(codes_to_drop)].copy()
    df_copy["ein"] = (
        df_copy["ein"].str.replace("Enter EIN", "-1").fillna("-1")
    )
    # Define mappings and preprocessing rules
    to_string = [
        "establishment_id",
        "company_name",
        "street_address",
        "city",
        "zip_code",
        "industry_description",
        "case_number",
        "job_description",
        "soc_code",
        "soc_description",
        "establishment_name",
        "naics_code",
    ]
    to_numeric = [
        "annual_average_employees",
        "total_hours_worked",
        "dafw_num_away",
        "djtr_num_tr",
        "ein",
    ]
    categorical_mappings = {
        "establishment_type": {
            0.0: "Invalid entry",
            1.0: "Private industry",
            2.0: "State government entity",
            3.0: "Local government entity",
        },
        "size": {1: "<20", 2: "20-249", 21: "20-99", 22: "100-249", 3: "250+"},
        "incident_outcome": {
            1: "Death",
            2: "Days away from work (DAFW)",
            3: "Job transfer or restriction",
            4: "Other recordable case",
        },
        "type_of_incident": {
            1: "Injury",
            2: "Skin disorder",
            3: "Respiratory condition",
            4: "Poisoning",
            5: "Hearing Loss",
            6: "All other illness",
        },
        "time_unknown": {0: "No", 1: "Yes"},
        "soc_reviewed": {0: "Not reviewed", 1: "Reviewed", 2: "Not SOC coded"},
        "state": {
            "PA": "Pennsylvania",
            "GA": "Georgia",
            "VA": "Virginia",
            "TX": "Texas",
            "UT": "Utah",
            "AZ": "Arizona",
            "IN": "Indiana",
            "TN": "Tennessee",
            "WI": "Wisconsin",
            "NC": "North Carolina",
            "NY": "New York",
            "OH": "Ohio",
            "IA": "Iowa",
            "AK": "Alaska",
            "OK": "Oklahoma",
            "MN": "Minnesota",
            "MO": "Missouri",
            "IL": "Illinois",
            "CT": "Connecticut",
            "NE": "Nebraska",
            "LA": "Louisiana",
            "WV": "West Virginia",
            "NM": "New Mexico",
            "CO": "Colorado",
            "FL": "Florida",
            "CA": "California",
            "MD": "Maryland",
            "AL": "Alabama",
            "KY": "Kentucky",
            "MI": "Michigan",
            "SC": "South Carolina",
            "ID": "Idaho",
            "KS": "Kansas",
            "MS": "Mississippi",
            "AR": "Arkansas",
            "NV": "Nevada",
            "NH": "New Hampshire",
            "VT": "Vermont",
            "NJ": "New Jersey",
            "DE": "Delaware",
            "MA": "Massachusetts",
            "ND": "North Dakota",
            "WA": "Washington",
            "OR": "Oregon",
            "ME": "Maine",
            "SD": "South Dakota",
            "MT": "Montana",
            "PR": "Puerto Rico",
            "RI": "Rhode Island",
            "WY": "Wyoming",
            "HI": "Hawaii",
            "DC": "District of Columbia",
            "VI": "U.S. Virgin Islands",
            "GU": "Guam",
            "MP": "Northern Mariana Islands",
            "AS": "American Samoa",
        },
    }

    # Convert to string and clean text
    df_copy[to_string] = (
        df_copy[to_string]
        .astype("string")
        .apply(
            lambda col: col.str.strip()
            .str.replace(r"\t", "", regex=True)
            .str.replace(r"\s+", " ", regex=True)
        )
    )

    # Convert to numeric with downcasting
    df_copy[to_numeric] = df_copy[to_numeric].apply(
        pd.to_numeric, errors="coerce", downcast="integer"
    )
    df_copy["soc_probability"] = df_copy["soc_probability"].apply(
        pd.to_numeric, errors="coerce", downcast="float"
    )

    # Map categorical columns
    for col, mapping in categorical_mappings.items():
        if col in df_copy:
            df_copy[col] = (
                df_copy[col].map(mapping).fillna("Not stated").astype("category")
            )

    # Handle specific columns
    df_copy["case_number"] = df_copy["case_number"].fillna("Not provided")
    df_copy["company_name"] = df_copy["company_name"].fillna("Not provided")
    df_copy["street_address"] = df_copy["street_address"].fillna("")
    df_copy["naics_year"] = (
        df_copy["naics_year"].fillna("Invalid NAICS codes").astype("category")
    )
    df_copy["industry_description"] = df_copy["industry_description"].fillna(
        "No description given"
    )
    df_copy["job_description"] = df_copy["job_description"].fillna("No job description")
    df_copy["soc_code"] = (
        df_copy["soc_code"].replace("0000", "00-0000").replace("9999", "99-9999")
    )

    # Date and time conversions
    date_columns = {
        "date_of_incident": "%m/%d/%Y",
        "date_of_death": "%m/%d/%Y",
        "created_timestamp": "%d%b%y:%H:%M:%S",
    }
    for col, fmt in date_columns.items():
        df_copy[col] = pd.to_datetime(df_copy[col], format=fmt, errors="coerce")


    time_columns = ["time_started_work", "time_of_incident"]

    for col in time_columns:
        # Convert to datetime
        df_copy[col] = pd.to_datetime(df_copy[col], format="%H:%M:%S.%f", errors="coerce")
        # TODO: potentially duplicated data, but decided to leave both
        # Add hours and minutes columns
        df_copy[f"{col}_hours"] = df_copy[col].dt.hour
        df_copy[f"{col}_minutes"] = df_copy[col].dt.minute

    # Since it's just 2023, and it does not provide any info
    df_copy = df_copy.drop(columns="year_filing_for")
    # Outlier values are contained by default [172307584.0, 126.0]
    # and [307584.0, 273751.0]. The number of employees is taken from
    # https://www.zippia.com/golden-state-foods-careers-24869/demographics/
    # and then hours are imputed for roughly same scale companies
    # df_copy.loc[
    #     df["company_name"] == "Golden State Foods",
    #     ["annual_average_employees", "total_hours_worked"],
    # ] = [
    #     4000,
    #     df_copy.query("3000 < annual_average_employees < 5000")[
    #         "total_hours_worked"
    #     ].median(),
    # ]

    return df_copy


df_preproc = preprocess_dataframe(df)

In [None]:
# df["street_address"].sample(50)
# TODO: using the soc_code classifier find all the soc_descriptions
df[["soc_code", "soc_description"]].sample(50)

In [None]:
df_preproc.sample(5)

In [None]:
df_preproc.info()

In [None]:
df_preproc.query("soc_probability != 5 & soc_reviewed == 'Reviewed'")

In [None]:
output_directory = "datasets"
os.makedirs(output_directory, exist_ok=True)
df.to_parquet(os.path.join(output_directory, "processed_data.parquet"), index=False)

### (c) (Data (What)) Choose one of the methods and implement it for the data set. Describe it in the section and mention what is the effect on the data.

In [None]:
df_numeric = df_preproc.select_dtypes(include=[np.number])

for column in df_numeric.columns:
    fig_incident = px.histogram(
        df_numeric, x=column, title=f"Histogram of {column}", nbins=50, text_auto=True
    )
    fig_incident.show()

In [None]:
df_time = df_preproc[df_preproc.select_dtypes(include=["datetime"]).columns]

for column in df_time.columns:
    fig_incident = px.histogram(
        df_time[df_time[column].notna()],
        x=column,
        title=f"Histogram of {column}",
        nbins=100,
    )
    fig_incident.show()

In [None]:
df_categorical = df_preproc.select_dtypes(include=["category"])

# Generate histograms for categorical data
for column in df_categorical.columns:
    fig = px.histogram(
        df_categorical[column].dropna(),
        x=column,
        title=f"Distribution of {column}",
        text_auto=True,
    )
    fig.show()

# El problemo: wtf is happening to company employees and hours


In [None]:
# 
df_preproc.query("annual_average_employees > 3_500_000")[["annual_average_employees", "total_hours_worked"]]

In [None]:
df_preproc.query("total_hours_worked > 2000000000")

In [None]:
ratio_df = df_preproc.copy()
ratio_df["hours_employee_ratio"] = (
    ratio_df["total_hours_worked"] / ratio_df["annual_average_employees"]
)
ratio_df.query("hours_employee_ratio < 1 | hours_employee_ratio > 4000").sort_values(
    "hours_employee_ratio"
).groupby("company_name")["establishment_id"].count()

In [None]:
analyze_and_plot_missing_values(df_preproc)

## Exercise 4 – Data (What) Abstraction

| id | Variable Name               | Description                                                                                                                                                                        | Data Type         |
|----|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
| 1  | establishment_id           | Identifier for the establishment.                                                                                                                                                  | string            |
| 2  | establishment_name         | The name of the establishment reporting data.                                                                                                                                       | string            |
| 3  | ein                        | Employer Identification Number (EIN) is also known as Federal Tax Identification Number. Has 9 digit format. If not given, states “No EIN Given”.                                   | float64            |
| 4  | company_name               | The name of the company that owns the establishment.                                                                                                                               | string            |
| 5  | street_address             | The street address of the establishment. If not given, states “Not provided”.                                                                                                      | string            |
| 6  | city                       | The city where the establishment is located.                                                                                                                                       | string            |
| 7  | state                      | Full name of the state or territory where the establishment is located.                                                                                                            | category          |
| 8  | zip_code                   | The full zip code for the establishment. Can be converted to numbers, but stored as a string for interpretability.                                                                 | string            |
| 9  | naics_code                 | The North American Industry Classification System (NAICS) code for the establishment. Data use a 2012, 2017, or 2022 NAICS code.                                                  | string            |
| 10 | naics_year                 | The calendar year reflecting the version of NAICS codes used by the establishment [2012, 2017, or 2022]. Invalid NAICS codes are shown as “Invalid NAICS codes”.                   | category          |
| 11 | industry_description       | The industry description for the establishment.                                                                                                                                    | string            |
| 12 | establishment_type         | Type of establishment: Private industry, State government entity, Local government entity.                                                                                         | category          |
| 13 | size                       | The size of the establishment is employer-reported and based on the maximum number of employees who worked there at any point in the year: <20, 20-249, 20-99, 100-249, 250+.      | category          |
| 14 | annual_average_employees   | The annual average number of employees at the establishment. Note: This field should not be summed across cases in an establishment.                                               | int32             |
| 15 | total_hours_worked         | The total hours worked by all employees at the establishment. Note: This field should not be summed across cases in an establishment.                                              | int64             |
| 16 | case_number                | An employer-assigned case number for each unique case (i.e., injured/ill employee).                                                                                                | string            |
| 17 | job_description            | The job title of the injured/ill employee.                                                                                                                                          | string            |
| 18 | soc_code                   | The 2018 Standard Occupation Code (SOC) assigned by the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) or OSHA.                                                  | string            |
| 19 | soc_description            | Text description of the 2018 SOC Code.                                                                                                                                              | string            |
| 20 | soc_reviewed               | Indicator variable as to whether the SOC code was manually reviewed before posting: Not reviewed, Reviewed, Not SOC coded.                                                         | category          |
| 21 | soc_probability            | The score given by the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) for the expected accuracy of the SOC code. Codes assigned directly by OSHA are given a score of 5. | float64           |
| 22 | date_of_incident           | The date the incident occurred.                                                                                                                                                    | datetime64[ns]    |
| 23 | incident_outcome           | The most serious outcome that occurred: Death, Days away from work (DAFW), Job transfer or restriction, Other recordable case.                                                     | category          |
| 24 | dafw_num_away              | The number of days away from work the employee required to recover from the incident before returning to work.                                                                      | int16             |
| 25 | djtr_num_tr                | The number of days the employee needed to be transferred or reassigned to another job or placed on restricted duty due to the incident.                                             | int16             |
| 26 | type_of_incident           | The type of incident that occurred: Injury, Skin disorder, Respiratory condition, Poisoning, Hearing Loss, All other illness.                                                      | category          |
| 27 | time_started_work          | The time the affected employee started work prior to the incident.                                                                                                                 | datetime64[ns]    |
| 28 | time_of_incident           | The time the incident occurred. Can have none values.                                                                                                                                                   | datetime64[ns]    |
| 29 | time_unknown               | Was the time of the incident unknown? Yes, No.                                                                                                                                      | category          |
| 30 | date_of_death              | The date the death occurred, if applicable. Can have none values.                                                                                                                                       | datetime64[ns]    |
| 31 | created_timestamp          | Timestamp when the record was created. Can have none values.                                                                                                                                             | datetime64[ns]    |
| 32 | time_started_work_minutes  | The minute component of the time the affected employee started work prior to the incident. Can have missing values.                           | float64           |
| 33 | time_of_incident_hours     | The hour component of the time the incident occurred. Can have missing values.                                                                | float64           |
| 34 | time_of_incident_minutes   | The minute component of the time the incident occurred. Can have missing values.                                                              | float64           |
