# Data completion for Mental Health surveys

#### This notebook focuses on creating a .csv file from 2014-2019 OSMI Mental Health Survey's SQLite dataset. The new wide dataset is combined with datasets from 2020, 2021, 2022 and 2023.

## Let's first convert the SQLite file to .csv

In [1]:
import sqlite3
import pandas as pd
import numpy as np

In [2]:
# 1. Connect to SQLite database
db_path = "../data/raw/mental_health.sqlite"
conn = sqlite3.connect(db_path)

In [3]:
# 2. Define the join query (long format: one row per answer)
query = """
SELECT 
    A.SurveyID,
    A.UserID,
    A.QuestionID,
    Q.questiontext,
    A.AnswerText,
    S.Description AS SurveyDescription
FROM Answer A
LEFT JOIN Question Q ON A.QuestionID = Q.questionID
LEFT JOIN Survey S ON A.SurveyID = S.SurveyID;
"""

In [4]:
# 3. Load into pandas
df_long = pd.read_sql_query(query, conn)
conn.close()

In [5]:
df = df_long.pivot_table(
    index=["SurveyID", "UserID"],          # one row per user per survey
    columns="questiontext",                # one column per question
    values="AnswerText",
    aggfunc=lambda x: " | ".join(
        pd.Series(x).dropna().astype(str).unique()
    ),                                     # handle multi-answers
).reset_index()

In [6]:
# Remove the column name from header row.
df.columns.name = None

In [7]:
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,SurveyID,UserID,Any additional notes or comments,Are you openly identified at work as a person with a mental health issue?,Are you self-employed?,Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.,Describe the circumstances of the badly handled or unsupportive response.,Describe the circumstances of the supportive or well handled response.,Describe the conversation with coworkers you had about your mental health including their reactions.,"Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.",Describe the conversation you had with your previous coworkers about your mental health including their reactions.,"Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.",Describe the conversation your coworker had with you about their mental health (please do not use names).,Describe the conversation your coworker had with you about their mental health (please do not use names)..1,Did you ever discuss your mental health with a previous coworker(s)?,Did you ever discuss your mental health with your previous employer?,Did you ever have a previous coworker discuss their or another coworker's mental health with you?,Did you feel that your previous employers took mental health as seriously as physical health?,Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health disorders and how to seek help?,Do you believe your productivity is ever affected by a mental health issue?,Do you currently have a mental health disorder?,Do you feel that being identified as a person with a mental health issue would hurt your career?,Do you feel that your employer takes mental health as seriously as physical health?,Do you have a family history of mental illness?,Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders?,Do you have previous employers?,Do you know local or online resources to seek help for a mental health issue?,Do you know the options for mental health care available under your employer-provided health coverage?,Do you know the options for mental health care your employer provides?,Do you think that discussing a mental health disorder with previous employers would have negative consequences?,Do you think that discussing a mental health disorder with your employer would have negative consequences?,Do you think that discussing a mental health issue with your employer would have negative consequences?,Do you think that discussing a physical health issue with previous employers would have negative consequences?,Do you think that discussing a physical health issue with your employer would have negative consequences?,Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?,Do you work remotely (outside of an office) at least 50% of the time?,Do you work remotely?,Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Does your employer provide mental health benefits as part of healthcare coverage?,Does your employer provide resources to learn more about mental health issues and how to seek help?,Has being identified as a person with a mental health issue affected your career?,Has your employer ever discussed mental health as part of an employee wellness program?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Have you ever been diagnosed with a mental health disorder?,Have you ever discussed your mental health with coworkers?,Have you ever discussed your mental health with your employer?,Have you ever had a coworker discuss their or another coworker's mental health with you?,Have you ever sought treatment for a mental health disorder from a mental health professional?,Have you had a mental health disorder in the past?,Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?,Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?,Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?,Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?,Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?,Have your previous employers provided mental health benefits?,How easy is it for you to take medical leave for a mental health condition?,How has it affected your career?,How many employees does your company or organization have?,How willing would you be to share with friends and family that you have a mental illness?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?","If maybe, what condition(s) do you believe you have?","If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.","If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react?","If yes, what condition(s) have you been diagnosed with?","If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?","If you have a mental health condition, do you feel that it interferes with your work?","If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?","If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?","If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?","If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?","If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship?","If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship?","If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?","If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?","If you live in the United States, which state or territory do you live in?",Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,"Overall, how much importance did your previous employer place on mental health?","Overall, how much importance did your previous employer place on physical health?","Overall, how much importance does your employer place on mental health?","Overall, how much importance does your employer place on physical health?","Overall, how well do you think the tech industry supports employees with mental health issues?",Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Was your employer primarily a tech company/organization?,Were you aware of the options for mental health care provided by your previous employers?,What US state or territory do you work in?,What country do you live in?,What country do you work in?,What is your age?,What is your gender?,What is your race?,Which of the following best describes your work position?,Would you be willing to bring up a physical health issue with a potential employer in an interview?,Would you be willing to discuss a mental health issue with your coworkers?,Would you be willing to discuss a mental health issue with your direct supervisor(s)?,Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.),Would you bring up a mental health issue with a potential employer in an interview?,Would you bring up a physical health issue with a potential employer in an interview?,Would you feel comfortable discussing a mental health issue with your coworkers?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Would you feel more comfortable talking to your coworkers about your physical health or your mental health?,Would you have been willing to discuss a mental health issue with your previous co-workers?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?
0,2014,1,-1,,-1,,,,,,,,,,,,,,,,,,,,Yes,No,,,,,Not sure,,,No,,No,,No,,,Yes,Yes,,No,,,,,,1,,,No,,,,,Somewhat easy,,6-25,,,,,,,,Often,,,,,,,,,Illinois,Yes,1,,,,,,,,,,,United States,,37,Female,,,,Some of them,Yes,,No,Maybe,,,,,,
1,2014,2,-1,,-1,,,,,,,,,,,,,,,,,,,,Don't know,No,,,,,No,,,Maybe,,No,,No,,,Don't know,Don't know,,Don't know,,,,,,0,,,No,,,,,Don't know,,More than 1000,,,,,,,,Rarely,,,,,,,,,Indiana,Don't know,0,,,,,,,,,,,United States,,44,Male,,,,No,No,,No,No,,,,,,
2,2014,3,-1,,-1,,,,,,,,,,,,,,,,,,,,No,No,,,,,No,,,No,,No,,No,,,No,No,,No,,,,,,0,,,No,,,,,Somewhat difficult,,6-25,,,,,,,,Rarely,,,,,,,,,-1,Don't know,1,,,,,,,,,,,Canada,,32,Male,,,,Yes,Yes,,Yes,Yes,,,,,,
3,2014,4,-1,,-1,,,,,,,,,,,,,,,,,,,,No,Yes,,,,,Yes,,,Yes,,Yes,,No,,,No,No,,No,,,,,,1,,,Yes,,,,,Somewhat difficult,,26-100,,,,,,,,Often,,,,,,,,,-1,No,1,,,,,,,,,,,United Kingdom,,31,Male,,,,Some of them,No,,Maybe,Maybe,,,,,,
4,2014,5,-1,,-1,,,,,,,,,,,,,,,,,,,,Don't know,No,,,,,No,,,No,,No,,Yes,,,Yes,Don't know,,Don't know,,,,,,0,,,No,,,,,Don't know,,100-500,,,,,,,,Never,,,,,,,,,Texas,Don't know,1,,,,,,,,,,,United States,,31,Male,,,,Some of them,Yes,,Yes,Yes,,,,,,


In [8]:
print("Rows:", len(df))
print("Columns:", df.shape[1])

Rows: 4218
Columns: 107


## Clean the Kaggle dataset

In [9]:
df_filtered = df.copy()

In [10]:
# In the Kaggle dataset "-1" acts as a missing value.
df_filtered = df_filtered.replace("-1", np.nan)
df_filtered = df_filtered.replace(-1, np.nan)

  df_filtered = df_filtered.replace("-1", np.nan)


In [11]:
# Drop duplicate column.
df_filtered.drop(columns=["Describe the conversation your coworker had with you about their mental health (please do not use names)..1"], inplace=True)

### Combine all free-text columns into one text field for a better embedding
Combining all free-text responses into a single field allows a transformer model to capture the full semantic meaning of everything a respondent wrote, instead of treating each small text box separately. As a result, the model becomes more robust, more accurate, and better at recognizing subtle patterns related to mental health.

In [12]:
UNIQUE_THRESHOLD = 50       # columns with > 50 unique values are likely free text
AVG_LENGTH_THRESHOLD = 20   # average length > 20 characters are likely free text

exclude_free_text = {
    "What country do you live in?",
    "What country do you work in?",
    "If you live in the United States, which state or territory do you live in?",
    "If yes, what condition(s) have you been diagnosed with?", # this would be data leakage
    "If maybe, what condition(s) do you believe you have?", # this also
    "What US state or territory do you work in?",
    "What is your age?",
    "What is your gender?",
    "SurveyID",
    "UserID"
}
free_text_cols = []

for col in df_filtered.columns:
    if col in exclude_free_text:
        continue
        
    num_unique = df_filtered[col].nunique()
    avg_len = df_filtered[col].astype(str).apply(len).mean()
    
    if num_unique > UNIQUE_THRESHOLD:
        free_text_cols.append(col)

for col in free_text_cols:
    print("-", col)

- Any additional notes or comments
- Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.
- Describe the circumstances of the badly handled or unsupportive response.
- Describe the conversation with coworkers you had about your mental health including their reactions.
- Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
- Describe the conversation you had with your previous coworkers about your mental health including their reactions.
- Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
- Describe the conversation your coworker had with you about their mental health (please do not use names).
- If there is anything else you would like to tell us that has not

In [13]:
bad_values = {"-1", ".", " ", "", "nan"}

def clean_val(v):
    if pd.isna(v):
        return None
    v = str(v).strip()
    return None if v in bad_values else v

df_filtered["all_text"] = df_filtered[free_text_cols].apply(
    lambda row: " ".join(
        [clean_val(v) for v in row if clean_val(v) is not None]
    ),
    axis=1
)
df_filtered["all_text"].replace("", np.nan, inplace=True)

df_filtered.drop(columns=free_text_cols, inplace=True)

print("Column 'all_text' created and free-text columns dropped.")

Column 'all_text' created and free-text columns dropped.


  df_filtered["all_text"] = df_filtered[free_text_cols].apply(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filtered["all_text"].replace("", np.nan, inplace=True)


### Keep only columns where more than 50% of respondents answered

In [14]:
non_null_counts = df_filtered.notna().sum()

print("Non-null counts per column:")
print(non_null_counts.sort_values(ascending=False))

Non-null counts per column:
SurveyID                                                                                                                         4218
UserID                                                                                                                           4218
Do you have a family history of mental illness?                                                                                  4218
Would you bring up a mental health issue with a potential employer in an interview?                                              4218
Have you ever sought treatment for a mental health disorder from a mental health professional?                                   4218
                                                                                                                                 ... 
Has being identified as a person with a mental health issue affected your career?                                                 180
If you have revealed a mental heal

In [15]:
valid_cols = non_null_counts[non_null_counts > df_filtered.shape[0] / 2].index

df_filtered = df_filtered[valid_cols]
print("Remaining columns:", df_filtered.shape[1])

Remaining columns: 38


### Remove respondents (rows) that have too many missing values

In [16]:
row_nan_counts = df_filtered.isna().sum(axis=1)

threshold = df_filtered.shape[1] * 0.5
print("Threshold is", threshold, "rows")

df_filtered = df_filtered[row_nan_counts < threshold].reset_index(drop=True)

print("Remaining rows:", df_filtered.shape[0])

Threshold is 19.0 rows
Remaining rows: 2911


### Data Quality Fixes

The Kaggle dataset already standardizes all binary answers by converting "Yes" to 1 and "No" to 0.

In [17]:
unique_values = {} # Great for testing

for col in df_filtered.columns:
    vals = df_filtered[col].unique()
    unique_values[col] = sorted([str(v) for v in vals if pd.notna(v)])

unique_values["Do you have a family history of mental illness?"]

["I don't know", 'No', 'Yes']

In [18]:
def clean_gender(value):
    if pd.isna(value):
        return np.nan
    
    v = str(value).strip().lower()

    # Male
    male = {
        "male", "cishet male", "male-ish", "ostensibly male",
        "masculine", "male/androgynous", "masculino"
    }
    if v in male:
        return "Male"

    # Female
    female = {
        "female", "female-ish", "female-identified",
        "female assigned at birth", "woman-identified",
        "fm", "femmina", "female-bodied; no feelings about gender",
        "gender non-conforming woman"
    }
    if v in female:
        return "Female"

    # Transgender
    trans = [
        "trans woman", "trans man", "trans female", "transgender woman",
        "transitioned, m2f", "mtf", "male (trans, ftm)", "transgender",
        "transfeminine", "other/transfeminine",
        "trans non-binary/genderfluid"
    ]
    if v in trans:
        return "Transgender"

    # Non-binary
    nb = [
        "non-binary", "non binary", "nonbinary", "nb", "enby",
        "genderfluid", "genderqueer", "agender", "androgynous",
        "demiguy", "bigender", "nonbinary/femme", 
        "genderflux demi-girl", "genderqueer demigirl",
        "genderqueer/non-binary", "non-binary and gender fluid",
        "genderfluid (born female)", "genderqueer woman",
        "nb masculine"
    ]
    if v in nb:
        return "Non-binary"

    # Everything else to Other
    return "Other"

df_filtered["Gender_cleaned"] = df_filtered["What is your gender?"].apply(clean_gender)
df_filtered.drop(columns=["What is your gender?"], inplace=True)
df_filtered["Gender_cleaned"].value_counts()

Gender_cleaned
Male           2022
Female          777
Non-binary       46
Other            31
Transgender      13
Name: count, dtype: int64

In [19]:
df_filtered["What is your age?"] = pd.to_numeric(
    df_filtered["What is your age?"], errors="coerce"
)
df_filtered.loc[
    (df_filtered["What is your age?"] < 15) |
    (df_filtered["What is your age?"] > 80),
    "What is your age?"
] = np.nan

# Replace missing ages with median
median_age = df_filtered["What is your age?"].median()
df_filtered["What is your age?"] = df_filtered["What is your age?"].fillna(median_age)

In [20]:
# Replace remaining NaN values with "Unknown" so that categorical encoders and ML models 
# can handle missing categories consistently without dropping rows.
df_filtered = df_filtered.replace(to_replace=np.nan, value="Unknown")

In [21]:
df_filtered.tail()

Unnamed: 0,SurveyID,UserID,Are you self-employed?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health disorders and how to seek help?,Do you currently have a mental health disorder?,Do you feel that your employer takes mental health as seriously as physical health?,Do you have a family history of mental illness?,Do you have previous employers?,Do you know the options for mental health care available under your employer-provided health coverage?,Do you think that discussing a physical health issue with your employer would have negative consequences?,Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Does your employer provide mental health benefits as part of healthcare coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Have you ever sought treatment for a mental health disorder from a mental health professional?,Have you had a mental health disorder in the past?,Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?,Have your previous employers provided mental health benefits?,How many employees does your company or organization have?,How willing would you be to share with friends and family that you have a mental illness?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?","If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?","If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?","If you live in the United States, which state or territory do you live in?",Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,Is your employer primarily a tech company/organization?,Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Were you aware of the options for mental health care provided by your previous employers?,What country do you live in?,What country do you work in?,What is your age?,Would you be willing to bring up a physical health issue with a potential employer in an interview?,Would you bring up a mental health issue with a potential employer in an interview?,Would you feel comfortable discussing a mental health issue with your coworkers?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,all_text,Gender_cleaned
2906,2019,4213,0,Unknown,Unknown,No,Unknown,No,0,No,Unknown,No,No,No,0,No,No,Unknown,6-25,Neutral,Very easy,Not applicable to me,Often,California,I don't know,1,Unknown,Unknown,United States of America,United States of America,31.0,No,No,No,Maybe,Unknown,Supervisors should try and break the ice with ...,Female
2907,2019,4214,0,I don't know,None did,Possibly,Unknown,No,1,No,Unknown,I don't know,I don't know,No,0,No,"Yes, I observed","No, none did",More than 1000,Very open,Somewhat difficult,Sometimes,Sometimes,Unknown,I don't know,0,I don't know,N/A (none offered),India,India,27.0,Maybe,No,No,No,"No, none of my previous supervisors",Unknown,Male
2908,2019,4215,1,None did,None did,No,Unknown,No,1,Unknown,Unknown,Unknown,Unknown,Unknown,0,No,"Yes, I observed","No, none did",Unknown,Not open at all,Unknown,Not applicable to me,Not applicable to me,Louisiana,Unknown,Unknown,I don't know,N/A (none offered),United States of America,United States of America,48.0,Maybe,Maybe,Unknown,Unknown,"No, none of my previous supervisors",IDK employee was ostracized and employer discu...,Male
2909,2019,4217,0,None did,None did,Possibly,Unknown,No,1,No,Unknown,No,No,No,0,No,Maybe/Not sure,"No, none did",More than 1000,Not open at all,Difficult,Often,Often,Unknown,Yes,1,I don't know,N/A (was not aware),India,India,30.0,No,No,Maybe,No,I don't know,raise awareness no,Female
2910,2019,4218,0,Some did,Some did,Yes,Unknown,Yes,1,Yes,Unknown,Yes,Yes,No,1,Yes,Maybe/Not sure,"No, none did",More than 1000,Very open,Somewhat difficult,Sometimes,Often,Oregon,Yes,1,I don't know,N/A (none offered),United States of America,United States of America,24.0,Maybe,Yes,No,Maybe,Some of my previous supervisors,reduce stigma. offer options for part time wor...,Female


In [22]:
# This will be the prediction target.
df_filtered["Do you currently have a mental health disorder?"].value_counts()

Do you currently have a mental health disorder?
Yes           1223
No             951
Maybe          319
Possibly       296
Don't Know     122
Name: count, dtype: int64

In [23]:
df_filtered.to_csv("../data/processed/kaggle_cleaned.csv", index=False)

In [24]:
df_filtered.shape

(2911, 38)

<br>
<br>

## Let's combine the datasets from different years. 
# <span style="color:red">TODO: they have ~22 common columns. What about the other columns? Can we interpret them somehow? Also add cleaning. Maybe we should create functions of the cleaning Kaggle methods so we can use them here also.</span>


#### Adding 2020 to 2014-2019


In [25]:
df_2020 = pd.read_csv("../data/raw/osmi_mental_health_2020.csv")

In [26]:
df_combined = df_filtered.copy()

In [27]:
df_2020.head(2)

Unnamed: 0,#,*Are you self-employed?*,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided health coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?",Would you feel more comfortable talking to your coworkers about your physical health or your mental health?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Have you ever discussed your mental health with your employer?,"Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.",Would you feel comfortable discussing a mental health issue with your coworkers?,Have you ever discussed your mental health with coworkers?,Describe the conversation with coworkers you had about your mental health including their reactions.,Have you ever had a coworker discuss their or another coworker's mental health with you?,Describe the conversation your coworker had with you about their mental health (please do not use names).,"Overall, how much importance does your employer place on physical health?","Overall, how much importance does your employer place on mental health?",Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders?,Do you know local or online resources to seek help for a mental health issue?,"If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?","If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship?","If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?","If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship?",Do you believe your productivity is ever affected by a mental health issue?,"If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?",*Do you have previous employers?*,Was your employer primarily a tech company/organization?,Have your previous employers provided mental health benefits?,Were you aware of the options for mental health care provided by your previous employers?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health disorders and how to seek help?,Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,Did you ever discuss your mental health with your previous employer?,"Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.",Would you have been willing to discuss your mental health with your coworkers at previous employers?,Did you ever discuss your mental health with a previous coworker(s)?,Describe the conversation you had with your previous coworkers about your mental health including their reactions.,Did you ever have a previous coworker discuss their or another coworker's mental health with you?,Describe the conversation your coworker had with you about their mental health (please do not use names)..1,"Overall, how much importance did your previous employer place on physical health?","Overall, how much importance did your previous employer place on mental health?",Do you *currently* have a mental health disorder?,Have you ever been *diagnosed* with a mental health disorder?,"Anxiety Disorder (Generalized, Social, Phobia, etc)","Mood Disorder (Depression, Bipolar Disorder, etc)","Psychotic Disorder (Schizophrenia, Schizoaffective, etc)","Eating Disorder (Anorexia, Bulimia, etc)",Attention Deficit Hyperactivity Disorder,"Personality Disorder (Borderline, Antisocial, Paranoid, etc)",Obsessive-Compulsive Disorder,Post-Traumatic Stress Disorder,Stress Response Syndromes,Dissociative Disorder,Substance Use Disorder,Addictive Disorder,Other,"Anxiety Disorder (Generalized, Social, Phobia, etc).1","Mood Disorder (Depression, Bipolar Disorder, etc).1","Psychotic Disorder (Schizophrenia, Schizoaffective, etc).1","Eating Disorder (Anorexia, Bulimia, etc).1",Attention Deficit Hyperactivity Disorder.1,"Personality Disorder (Borderline, Antisocial, Paranoid, etc).1",Obsessive-Compulsive Disorder.1,Post-traumatic Stress Disorder,Stress Response Syndromes.1,Dissociative Disorder.1,Substance Use Disorder.1,Addictive Disorder.1,Other.1,"Anxiety Disorder (Generalized, Social, Phobia, etc).2","Mood Disorder (Depression, Bipolar Disorder, etc).2","Psychotic Disorder (Schizophrenia, Schizoaffective, etc).2","Eating Disorder (Anorexia, Bulimia, etc).2",Attention Deficit Hyperactivity Disorder.2,"Personality Disorder (Borderline, Antisocial, Paranoid, etc).2",Obsessive-Compulsive Disorder.2,Post-traumatic Stress Disorder.1,Stress Response Syndromes.2,Dissociative Disorder.2,Substance Use Disorder.2,Addictive Disorder.2,Other.2,Have you had a mental health disorder in the past?,Have you ever sought treatment for a mental health disorder from a mental health professional?,Do you have a family history of mental illness?,"If you have a mental health disorder, how often do you feel that it interferes with your work *when being treated effectively?*","If you have a mental health disorder, how often do you feel that it interferes with your work *when* _*NOT*_* being treated effectively (i.e., when you are experiencing symptoms)?*",Have your observations of how another individual who discussed a mental health issue made you less likely to reveal a mental health issue yourself in your current workplace?,How willing would you be to share with friends and family that you have a mental illness?,Would you be willing to bring up a physical health issue with a potential employer in an interview?,Why or why not?,Would you bring up your *mental* health with a potential employer in an interview?,Why or why not?.1,Are you openly identified at work as a person with a mental health issue?,Has being identified as a person with a mental health issue affected your career?,How has it affected your career?,"If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react?",Have you observed or experienced an *unsupportive or badly handled response* to a mental health issue in your current or previous workplace?,Describe the circumstances of the badly handled or unsupportive response.,Have you observed or experienced a *supportive or well handled response* to a mental health issue in your current or previous workplace?,Describe the circumstances of the supportive or well handled response.,"Overall, how well do you think the tech industry supports employees with mental health issues?",Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.,"If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.",Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.),What is your age?,What is your gender?,What country do you *live* in?,What US state or territory do you *live* in?,What is your race?,Other.3,What country do you *work* in?,What US state or territory do you *work* in?
0,zwrffw6ykfo82ft1twvzwrffw6c6wsfv,1,,,,,,,,,,,,,,,,,,,,,1.0,"No, I don't know any","No, because it doesn't matter",,"No, because it doesn't matter",I'm not sure,Yes,1-25%,1,1.0,I don't know,N/A (none offered),None did,None did,I don't know,Physical health,"No, none of my previous supervisors",0.0,,At some of my previous employers,0.0,,0.0,,6.0,0.0,Don't Know,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Don't Know,0,Yes,Not applicable to me,Sometimes,,5,Maybe,It depend on if the discussion would have an i...,Maybe,Same reason. But I believe this would have a g...,0,,,5,Maybe/Not sure,,"Yes, I experienced",,2,,,0,45,Male,United States of America,Connecticut,White,,United States of America,Connecticut
1,zhdmhaa8r0125c4zmoi7qzhdmtjrakhm,1,,,,,,,,,,,,,,,,,,,,,1.0,"Yes, I know several","Yes, always",Positively,"Yes, always",Positively,Yes,1-25%,1,1.0,"Yes, they all did","Yes, I was aware of all of them","Yes, they all did","Yes, they all did","Yes, always",Physical health,"Yes, all of my previous supervisors",1.0,,"Yes, at all of my previous employers",1.0,,1.0,,10.0,10.0,Yes,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,"Anxiety Disorder (Generalized, Social, Phobia,...",,,,,,,,,,,,,Yes,1,Yes,Never,Never,Yes,10,Yes,,Yes,,1,1.0,10.0,10,"Yes, I observed",,"Yes, I observed",,5,,,1,24,female,Russia,,,,Russia,


In [28]:
print(df_combined.shape, df_2020.shape)

(2911, 38) (180, 120)


In [29]:
df_combined.columns = df_combined.columns.str.strip()
df_2020.columns = df_2020.columns.str.strip()

In [30]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2020.columns)))


In [31]:
print("Shared columns:", len(common_cols))

Shared columns: 22


#### Let's ensure both datasets have SurveyID 
2014-2019 already has it but let's add for 2020.

In [32]:
df_2020["SurveyID"] = 2020

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [33]:
df_14_19_common = df_combined[common_cols].copy()
df_2020_common  = df_2020[common_cols].copy()

In [34]:
# Combine into one dataset
df_combined = pd.concat(
   [df_14_19_common, df_2020_common],
    ignore_index=True
)

#df_combined.head(), df_combined.shape

#### Let's continue with 2021


In [35]:
df_2021 = pd.read_csv("../data/raw/osmi_mental_health_2021.csv")

In [36]:
print(df_combined.shape, df_2021.shape)

(3091, 23) (131, 124)


In [37]:
df_combined.columns = df_combined.columns.str.strip()
df_2021.columns = df_2021.columns.str.strip()

In [38]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2021.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 22


In [39]:
df_2021["SurveyID"] = 2021

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [40]:
df_combined_common = df_combined[common_cols].copy()
df_2021_common  = df_2021[common_cols].copy()

In [41]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2021_common],
    ignore_index=True
)

#df_combined.head(), df_combined.shape

#### 2022

In [42]:
df_2022 = pd.read_csv("../data/raw/osmi_mental_health_2022.csv")

In [43]:
print(df_combined.shape, df_2022.shape)

(3222, 23) (164, 126)


In [44]:
df_combined.columns = df_combined.columns.str.strip()
df_2022.columns = df_2022.columns.str.strip()

In [45]:
# Identify shared columns
common_cols = sorted(list(set(df_filtered.columns).intersection(df_2022.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 22


In [46]:
df_2022["SurveyID"] = 2022

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [47]:
df_combined_common = df_combined[common_cols].copy()
df_2022_common  = df_2022[common_cols].copy()

In [48]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2022_common],
    ignore_index=True
)

#df_combined.head(), df_combined.shape

### 2023

In [49]:
df_2023 = pd.read_csv("../data/raw/osmi_mental_health_2023.csv")

In [50]:
print(df_combined.shape, df_2023.shape)

(3386, 23) (6, 126)


In [51]:
df_combined.columns = df_combined.columns.str.strip()
df_2022.columns = df_2022.columns.str.strip()

In [52]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2023.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 22


In [53]:
df_2023["SurveyID"] = 2023

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [54]:
df_combined_common = df_combined[common_cols].copy()
df_2023_common  = df_2023[common_cols].copy()

In [55]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2023_common],
    ignore_index=True
)

#df_combined.head(), df_combined.shape

### Let's save the combined dataset to a .csv file.

In [56]:
df_combined.to_csv("../data/processed/combined_data.csv", index=False)