# Data Cleaning Utilities

This section defines all helper functions used to prepare OSMI Mental Health in Tech survey data from multiple years (2014–2023).
The goal is to create a clean, unified dataset specifically for Goal 2, which focuses on predicting mental-health disorder likelihood using Random Forest and Sentence-BERT embeddings.


In [1]:
import pandas as pd
import numpy as np

## Combine all free-text fields into a single column
Combining all free-text responses into a single field allows a transformer model to capture the full semantic meaning of everything a respondent wrote, instead of treating each small text box separately. As a result, the model becomes more robust, more accurate, and better at recognizing subtle patterns related to mental health.


In [2]:
def clean_val(v):
    bad_values = {"-1", ".", " ", "", "nan"}

    if pd.isna(v):
        return None
    v = str(v).strip()
    return None if v in bad_values else v
    

def free_text(df):
    df = df.replace("-1", np.nan)
    df = df.replace(-1, np.nan)

    exclude_free_text = {
    "What country do you live in?",
    "What country do you work in?",
    "If you live in the United States, which state or territory do you live in?",
    "If yes, what condition(s) have you been diagnosed with?", # this would be data leakage
    "If maybe, what condition(s) do you believe you have?", # this also
    "What US state or territory do you work in?",
    "What is your age?",
    "What is your gender?",
    "SurveyID",
    "UserID",
    "Start Date (UTC)",
    "Submit Date (UTC)",
    "Network ID",
    }
    UNIQUE_THRESHOLD = 50       # columns with > 50 unique values are likely free text
    AVG_LENGTH_THRESHOLD = 20   # average length > 20 characters are likely free text

    # add all free text into one array
    free_text_cols = []
    for col in df.columns:
        if col in exclude_free_text:
            continue
            
        num_unique = df[col].nunique()
        avg_len = df[col].astype(str).apply(len).mean()
        if num_unique > UNIQUE_THRESHOLD or avg_len > AVG_LENGTH_THRESHOLD:
            free_text_cols.append(col)

    print("Dataset free-text columns are: ")
    for col in free_text_cols:
        print("-", col)
        
    # add all free text into one column & drop the others
    df["all_text"] = df[free_text_cols].apply(
        lambda row: " ".join(
            [clean_val(v) for v in row if clean_val(v) is not None]),
        axis=1)
    
    df["all_text"] = df["all_text"].replace("", np.nan)
    df.drop(columns=free_text_cols, inplace=True)

    return df

## Combine free-text (2023 version)
The 2023 dataset uses different schemas and column names. This function performs the same free-text merging logic while adapting to the updated structure.


In [3]:
def free_text_2023(df):
    df = df.replace("-1", np.nan)
    df = df.replace(-1, np.nan)

    free_text_cols = [
        "Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.",
        "Describe the conversation with coworkers you had about your mental health including their reactions.",
        "Describe the conversation your coworker had with you about their mental health (please do not use names).",
        "Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.",
        "Describe the conversation you had with your previous coworkers about your mental health including their reactions.",
        "Describe the conversation your coworker had with you about their mental health (please do not use names)..1",
        "Describe the circumstances of the supportive or well handled response.",
        "Describe the circumstances of the badly handled or unsupportive response.",
        "Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.",
        "Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?",
        "Were you aware of the options for mental health care provided by your previous employers?",
        "Would you have been willing to discuss your mental health with your direct supervisor(s)?",
        "Would you have been willing to discuss your mental health with your coworkers at previous employers?",
        "If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.",
        "Why or why not?"
    ]
    
    df["all_text"] = df[free_text_cols].apply(
        lambda row: " ".join(
            [clean_val(v) for v in row if clean_val(v) is not None]),
        axis=1)
    
    df["all_text"] = df["all_text"].replace("", np.nan)
    df.drop(columns=free_text_cols, inplace=True)

    return df

## Remove columns & rows with too many missing values

To improve dataset reliability:

- remove columns where more than 50% of values are missing
- remove respondents with more than 50% missing data

This ensures that the final dataset contains enough meaningful information for classification models.

In [4]:
def clean_cols_and_rows(df):
    # Keep only columns where more than 50% of respondents answered
    non_null_counts = df.notna().sum()

    valid_cols = non_null_counts[non_null_counts > df.shape[0] / 2].index
    df = df[valid_cols]
    print("Remaining columns:", df.shape[1])

    # Remove respondents (rows) that have too many missing values
    row_nan_counts = df.isna().sum(axis=1)
    
    threshold = df.shape[1] * 0.5
    
    print("Threshold is", threshold, "rows")
    df = df[row_nan_counts < threshold].reset_index(drop=True)

    print("Remaining rows:", df.shape[0])

    return df


## Clean the gender column

Gender entries vary across survey years (e.g., `"Male"`, `"male"`). This function standardizes them into consistent categories.

In [5]:
def clean_gender(value):
    if pd.isna(value):
        return np.nan
    
    v = str(value).strip().lower()

    # Male
    male = {
        "male", "cishet male", "male-ish", "ostensibly male",
        "masculine", "male/androgynous", "masculino", "m", 
        "cisgender male", "mail", "cis male", "cis-het male",
        "male/he/him", "dude", "man"
    }
    if v in male:
        return "Male"

    # Female
    female = {
        "female", "female-ish", "female-identified",
        "female assigned at birth", "woman-identified",
        "fm", "femmina", "female-bodied; no feelings about gender",
        "gender non-conforming woman", "f", "woman", "female, she/her",
        "female (cis)", "female, cis-gendered"
    }
    if v in female:
        return "Female"

    # Transgender
    trans = [
        "trans woman", "trans man", "trans female", "transgender woman",
        "transitioned, m2f", "mtf", "male (trans, ftm)", "transgender",
        "transfeminine", "other/transfeminine", "queer",
        "trans non-binary/genderfluid"
    ]
    if v in trans:
        return "Transgender"

    # Non-binary
    nb = [
        "non-binary", "non binary", "nonbinary", "nb", "enby",
        "genderfluid", "genderqueer", "agender", "androgynous",
        "demiguy", "bigender", "nonbinary/femme", 
        "genderflux demi-girl", "genderqueer demigirl",
        "genderqueer/non-binary", "non-binary and gender fluid",
        "genderfluid (born female)", "genderqueer woman",
        "nb masculine", "Non-binary/Agender"
    ]
    if v in nb:
        return "Non-binary"

    # Everything else to Other
    return "Other"

## Clean and validate respondent age

Survey participants occasionally enter unrealistic ages (e.g., 2, 500, “twenty”). We convert the column to numeric form and remove clearly invalid values:
- ages below 15  
- ages above 80

Empty or non-numeric values are filled with the median age.

In [6]:
def clean_age(df, col, min_age=15, max_age=80):
    
    df[col] = pd.to_numeric(df[col], errors="coerce")
    
    df.loc[
        (df[col] < min_age) | 
        (df[col] > max_age), 
        col
    ] = np.nan
    
    median_age = df[col].median()
    df[col] = df[col].fillna(median_age)
    
    return df

<br>

# Load Raw Survey Data (2014–2023)

We load all raw survey files into memory:

- `kaggle_uncleaned.csv` — contains the OSMI survey data for 2014 and 2016–2019  
- `osmi_mental_health_20XX.csv` — individual survey exports for 2020–2023  

These files differ significantly in schema, so each year must be processed **separately** before merging.

---

## Clean Each Dataset Individually Before Merging

Each survey year has:

- different column names  
- different free-text fields  
- different missing-value conventions  
- different age and gender formats  
- different response structures  

For this reason, **we clean each dataset independently** before merging them into a unified format.

For every dataset:

1. Add a `SurveyID` field representing the year  
2. Apply year-appropriate free-text consolidation  
3. Clean rows with excessive missing values  
4. Normalize gender into unified categories  
5. Validate and clean age values  
6. Store the cleaned dataframe back into a list  

After all datasets are cleaned, they are merged in a later step.

In [7]:
df_kaggle = pd.read_csv("../data/raw/kaggle_uncleaned.csv")
df20 = pd.read_csv("../data/raw/osmi_mental_health_2020.csv")
df21 = pd.read_csv("../data/raw/osmi_mental_health_2021.csv")
df22 = pd.read_csv("../data/raw/osmi_mental_health_2022.csv")
df23 = pd.read_csv("../data/raw/osmi_mental_health_2023.csv")


osmi = [df_kaggle, df20, df21, df22, df23]
years = ["kaggle", "2020", "2021", "2022", "2023"]

for i, df in enumerate(osmi):
    print("Dataset:", i+1)
    
    if "#" in df.columns:
        df = df.drop(columns=["#"])
        
    if i != 0: # Kaggle dataset has already SurveyID
        df["SurveyID"] = years[i]
    
    if i == 4:
        df = free_text_2023(df)
    else: 
        df = free_text(df)

    df = clean_cols_and_rows(df)

    df["Gender_cleaned"] = df["What is your gender?"].apply(clean_gender)
    df = df.drop(columns=["What is your gender?"])

    print()
    
    df = clean_age(df, "What is your age?")
    osmi[i] = df


osmi_dict = {years[i]: osmi[i] for i in range(len(osmi))}

Dataset: 1
Dataset free-text columns are: 
- Any additional notes or comments
- Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.
- Describe the circumstances of the badly handled or unsupportive response.
- Describe the conversation with coworkers you had about your mental health including their reactions.
- Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
- Describe the conversation you had with your previous coworkers about your mental health including their reactions.
- Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
- Describe the conversation your coworker had with you about their mental health (please do not use names).
- If there is anything 

In [8]:
for name, df in osmi_dict.items():
    print(name, df.shape)

for year in osmi_dict:
    print(year, "→", "all_text" in osmi_dict[year].columns)

kaggle (2941, 37)
2020 (172, 53)
2021 (125, 56)
2022 (154, 57)
2023 (6, 63)
kaggle → True
2020 → True
2021 → True
2022 → True
2023 → True


<br>

# Find Similarity Between Column Names and Harmonize with a Canonical Name
This step compares column names across different data sources to identify semantically similar names.  
The goal is to detect columns that represent the same concept but are written differently (e.g. `"FirstName"`, `"first_name"`, `"fname"`) and unify them under a single canonical name.

A basic string similarity metric (`SequenceMatcher`) is used to measure how closely two column names match.  
Based on these similarity scores, column names that refer to the same concept are grouped together, and one canonical name is chosen for each group.

Result: a standardized set of column names across all sources, making further cleaning, analysis, and merging much easier.


In [9]:
import glob
from difflib import SequenceMatcher

In [10]:
# Check fuzzy similarity between column names
def similar(a,b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

## Compute similarity between column names
This step calculates a similarity score for every possible pair of column names across all datasets. For each pair, a string similarity metric (`SequenceMatcher`) is applied to quantify how closely the two names match.

In [11]:
dfs = osmi_dict.copy()
similar_pairs = []
threshold = 0.80

all_cols = []
for name, df in dfs.items():
    for col in df.columns:
        all_cols.append((name, col))

for i in range(len(all_cols)):
    name1, c1 = all_cols[i]
    for j in range(i+1, len(all_cols)):
        name2, c2 = all_cols[j]
        
        score = similar(c1, c2)
        if score >= threshold and c1 != c2:
            similar_pairs.append((score, c1, c2, name1, name2))
            
# Sort strongest matches first
similar_pairs_sorted = sorted(similar_pairs, key=lambda x: -x[0])

# Preview top matches
for s, c1, c2, n1, n2 in similar_pairs_sorted[:40]:
    print(f"{s:.3f} | {n1}:{c1} <--> {n2}:{c2}")

0.993 | kaggle:Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace? <--> 2020:Have you observed or experienced an *unsupportive or badly handled response* to a mental health issue in your current or previous workplace?
0.993 | kaggle:Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace? <--> 2021:Have you observed or experienced an *unsupportive or badly handled response* to a mental health issue in your current or previous workplace?
0.993 | kaggle:Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace? <--> 2022:Have you observed or experienced an *unsupportive or badly handled response* to a mental health issue in your current or previous workplace?
0.993 | kaggle:Have you observed or experienced an unsupportive or badly handled resp

## Select the best canonical name for each group of similar columns
Using the similarity scores, we identify which column names belong together. For each group of similar names, the shortest one is chosen as the canonical version. This ensures consistent naming across all datasets.


In [12]:
# Finding the shortest column name among each similar group as canonical name
canonical_map = {}

# All pairs are put into one cluster-building process
clusters = []

for score, c1, c2, n1, n2 in similar_pairs_sorted:
    placed = False
    for cluster in clusters:
        if c1 in cluster or c2 in cluster:
            cluster.add(c1)
            cluster.add(c2)
            placed = True
            break
    if not placed:
        clusters.append(set([c1, c2]))

# Choose canonical names
for cluster in clusters:
    canonical = min(cluster, key=len)  # shortest name
    for col in cluster:
        canonical_map[col] = canonical

print("Number of harmonized columns:", len(canonical_map))
#canonical_map

Number of harmonized columns: 46


## Apply harmonization to all datasets
In this step, the canonical column-name mapping is applied to every dataset. Each dataset's columns are renamed using the `canonical_map`, ensuring that all equivalent columns across different sources share the same standardized name.

After renaming, some datasets may end up with duplicated column names (e.g., when two different original columns are mapped to the same canonical name). To prevent conflicts, duplicated columns are removed by keeping only the first occurrence.

In [13]:
dfs_harmonized = {}

for name, df in dfs.items():           # dfs includes "kaggle", "2020", "2021", "2022", "2023"
    # rename columns via canonical_map
    new_cols = {col: canonical_map.get(col, col) for col in df.columns}
    tmp = df.rename(columns=new_cols)

    # drop duplicated column names (keep first occurrence)
    tmp = tmp.loc[:, ~tmp.columns.duplicated()]

    dfs_harmonized[name] = tmp

# sanity check
for name, df in dfs_harmonized.items():
    dups = df.columns[df.columns.duplicated()]
    print(name, "num duplicate cols after fix:", len(dups))

kaggle num duplicate cols after fix: 0
2020 num duplicate cols after fix: 0
2021 num duplicate cols after fix: 0
2022 num duplicate cols after fix: 0
2023 num duplicate cols after fix: 0


# Compute Common Columns across Harmonized Datasets
After column names have been harmonized, this step identifies which columns are present in *every* dataset. The result is the list of canonical column names that appear consistently in Kaggle and all OSMI survey years.


In [14]:
common_semantic = None
for name, df in dfs_harmonized.items():
    cols = set(df.columns)
    if common_semantic is None:
        common_semantic = cols
    else:
        common_semantic &= cols

common_semantic = sorted(common_semantic)
print("Common columns across Kaggle + all OSMI years:", len(common_semantic))
print(common_semantic)

Common columns across Kaggle + all OSMI years: 26
['Are you self-employed?', 'Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?', 'Do you currently have a mental health disorder?', 'Do you have a family history of mental illness?', 'Do you have previous employers?', 'Do you know the options for mental health care available under your employer-provided health coverage?', 'Does your employer offer resources to learn more about mental health disorders and options for seeking help?', 'Does your employer provide mental health benefits as part of healthcare coverage?', 'Gender_cleaned', 'Have you ever sought treatment for a mental health disorder from a mental health professional?', 'Have you had a mental health disorder in the past?', 'Have you observed or experienced a *supportive or well handled response* to a mental health issue in your current or previous workplace?', 'Have your previous employers provided m

# Merge All Datasets Using the Expanded Common Columns
In this step, all harmonized datasets are merged into a single consolidated dataframe. Only the set of `common_semantic` columns are selected to ensure structural consistency. This produces a clean, fully aligned dataset suitable for combined analysis across Kaggle and all OSMI survey years.


In [15]:
merged_parts = [
    df[common_semantic].assign(source=name).copy() for name, df in dfs_harmonized.items()
]

final = pd.concat(merged_parts, ignore_index=True)
print("Final merged dataset shape before dropping 'source':", final.shape)

final = final.drop(columns="source")
print("Final dataset shape:", final.shape)

print("\nCanonical columns:")
for col in final.columns:
    print("-", col)

Final merged dataset shape before dropping 'source': (3398, 27)
Final dataset shape: (3398, 26)

Canonical columns:
- Are you self-employed?
- Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?
- Do you currently have a mental health disorder?
- Do you have a family history of mental illness?
- Do you have previous employers?
- Do you know the options for mental health care available under your employer-provided health coverage?
- Does your employer offer resources to learn more about mental health disorders and options for seeking help?
- Does your employer provide mental health benefits as part of healthcare coverage?
- Gender_cleaned
- Have you ever sought treatment for a mental health disorder from a mental health professional?
- Have you had a mental health disorder in the past?
- Have you observed or experienced a *supportive or well handled response* to a mental health issue in your current or previou

<br>

# Data Cleaning Methods for Improving Machine Learning
These cleaning steps reduce noise and make categories consistent across the dataset. By standardizing uncertain responses and converting ordinal text answers into numeric values, the data becomes easier for machine learning models to learn from. This typically leads to better model accuracy and more reliable predictions.

### Binary 

In [16]:
def clean_binary(df, cols):
    mapping = {"Yes": 1, "No": 0,
                "1": 1, "0": 0,
                1: 1, 0: 0
    }
    for col in cols:
        df[col] = df[col].replace(mapping)
        df[col] = pd.to_numeric(df[col], errors="coerce") 
    return df

### Uncertainty

In [17]:
def clean_uncertainty(df):
    mapping = {
    "Yes": "Yes",
    "No": "No",
    "Maybe": "Maybe",
    "Maybe/Not sure": "Maybe",
    "Possibly": "Maybe",
    "Don't Know": "Unknown",
    "I am not sure": "Unknown",
    "Unknown": "Unknown",
    "I don't know": "Unknown"
    }
    for col in df:
        df[col] = df[col].replace(mapping)

    return df

### Openness

In [18]:
def clean_openness(df, col="How willing would you be to share with friends and family that you have a mental illness?"):
    mapping = {
    "Very open": 10,
    "Somewhat open": 7,
    "Neutral": 5,
    "Somewhat not open": 3,
    "Not open at all": 0,
    "Not applicable to me (I do not have a mental illness)": 0
    }
    df[col] = df[col].replace(mapping)
    df[col] = pd.to_numeric(df[col], errors="coerce")  # muudab numbriks
    
    return df

### Medical leave

In [19]:
def clean_medical_leave(df, col="If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?"):
    mapping = {
        "Very easy": 5,
        "Somewhat easy": 4,
        "Neither easy nor difficult": 3,
        "Somewhat difficult": 2,
        "Very difficult": 1,
        "Difficult": 1,   
        "Unknown": 0,
        "I don't know": 0
    }
    
    df[col] = df[col].replace(mapping).fillna(0)
    df[col] = pd.to_numeric(df[col], errors="coerce")
    return df

### Frequency

In [20]:
def clean_frequency(df, cols):
    mapping = {
        "Never": 1,
        "Rarely": 2,
        "Sometimes": 3,
        "Often": 4,
        "Not applicable to me": 0
    }
    for col in cols:
        df[col] = df[col].replace(mapping)
        df[col] = pd.to_numeric(df[col], errors="coerce")
    return df

### Discussion

In [21]:
def clean_discussion(df,col):
    mapping = {
        'Yes, they all did':'Yes',
        'Yes': 'Yes',
        'Some did' :'Partial',
        'No': 'No',
        'None did': 'No',
        'Unknown':'Unknown',
        'No, none did': 'No'
    }
    df[col] = df[col].replace(mapping)
    df[col] = df[col].fillna("Unknown")   # NaN -> "Unknown"
    return df


### Supportive response

In [22]:
def clean_supportive_response(df, col = "Have you observed or experienced a *supportive or well handled response* to a mental health issue in your current or previous workplace?"):
    mapping = {
        "Yes, I experienced": "Yes",
        "Yes, I observed": "Yes",
        "No": "No",
        "Maybe": "Maybe",
        "Unknown": "Unknown",
        "I've always been self-employed": "NA"
    }
    df[col] = df[col].replace(mapping).fillna("Unknown")
    return df

In [23]:
binary_cols = [
    "Are you self-employed?",
    "Do you have previous employers?",
    'Have you ever sought treatment for a mental health disorder from a mental health professional?',
    "Is your employer primarily a tech company/organization?"
]

freq_cols = [
    "If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?",
    "If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?"
]

ordinal_cols = [
    "How willing would you be to share with friends and family that you have a mental illness?",
    "If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?",
    "If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?",
    "If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?"
]

categorical_cols = [
    "Do you currently have a mental health disorder?",
    "Do you have a family history of mental illness?",
    "Have you had a mental health disorder in the past?",
    "Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?",
    "Would you feel comfortable discussing a mental health issue with your coworkers?",
]

### Applying all of the functions

In [24]:
df_cleaned = final.copy()
df_cleaned = clean_binary(df_cleaned, binary_cols)
df_cleaned = clean_uncertainty(df_cleaned)
df_cleaned = clean_openness(df_cleaned)
df_cleaned = clean_medical_leave(df_cleaned)
df_cleaned = clean_frequency(df_cleaned, freq_cols)
df_cleaned = clean_discussion(df_cleaned, "Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?")
df_cleaned = clean_discussion(df_cleaned, "Does your employer offer resources to learn more about mental health disorders and options for seeking help?")
df_cleaned = clean_discussion(df_cleaned, "Have your previous employers provided mental health benefits?")
df_cleaned = clean_supportive_response(df_cleaned, "Have you observed or experienced a *supportive or well handled response* to a mental health issue in your current or previous workplace?")

  df[col] = df[col].replace(mapping)
  df[col] = df[col].replace(mapping).fillna(0)


### Ordinal: Convert "Unknown" and NaN to 0
Ordinal features must be numeric and follow a meaningful order.  
Replacing "Unknown" and missing values with 0 keeps the data consistent and preserves the ordinal scale.


In [25]:
 for col in ordinal_cols:
     df_cleaned[col] = df_cleaned[col].replace("Unknown", 0).fillna(0)
     df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors="coerce")

### Convert NaN to "Unknown"
Fill all remaining missing values with "Unknown" to ensure every categorical 
column is complete and encodable, preventing errors and avoiding row loss during
machine learning preprocessing.

In [26]:
df_cleaned = df_cleaned.fillna("Unknown")

In [27]:
print(str(df_cleaned.shape) + "\n")

for col in df_cleaned.columns:
    uniques = df_cleaned[col].unique()
    print(f"\n--- {col} ---")
    print(f"Unique values ({len(uniques)}): {uniques}")
    

(3398, 26)


--- Are you self-employed? ---
Unique values (2): [0. 1.]

--- Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)? ---
Unique values (4): ['Unknown' 'No' 'Partial' 'Yes']

--- Do you currently have a mental health disorder? ---
Unique values (4): ['No' 'Yes' 'Maybe' 'Unknown']

--- Do you have a family history of mental illness? ---
Unique values (3): ['No' 'Yes' 'Unknown']

--- Do you have previous employers? ---
Unique values (2): [1. 0.]

--- Do you know the options for mental health care available under your employer-provided health coverage? ---
Unique values (3): ['Unknown' 'Yes' 'No']

--- Does your employer offer resources to learn more about mental health disorders and options for seeking help? ---
Unique values (4): ['No' 'Partial' 'Unknown' 'Yes']

--- Does your employer provide mental health benefits as part of healthcare coverage? ---
Unique values (4): ['Not eligible for coverage / 

In [28]:
df_cleaned["Is your employer primarily a tech company/organization?"]

0           1.0
1           1.0
2           1.0
3       Unknown
4           0.0
         ...   
3393        1.0
3394    Unknown
3395        0.0
3396    Unknown
3397        1.0
Name: Is your employer primarily a tech company/organization?, Length: 3398, dtype: object

In [29]:
# This will be the goal 2 prediction target.
df_cleaned["Do you currently have a mental health disorder?"].value_counts()

Do you currently have a mental health disorder?
Yes        1368
No         1134
Maybe       719
Unknown     177
Name: count, dtype: int64

## Export into a CSV file

In [30]:
df_cleaned.to_csv("../data/processed/combined_clean_data.csv", index=False)