# Data completion for Mental Health surveys

#### This notebook focuses on creating a .csv file from 2014-2019 OSMI Mental Health Survey's SQLite dataset. The new wide dataset is combined with datasets from 2020, 2021, 2022 and 2023.

## Let's first convert the SQLite file to .csv

In [33]:
import sqlite3
import pandas as pd

In [None]:
# 1. Connect to SQLite database
db_path = "../../data/datasets/mental_health.sqlite"
conn = sqlite3.connect(db_path)

In [35]:
# 2. Define the join query (long format: one row per answer)
query = """
SELECT 
    A.SurveyID,
    A.UserID,
    A.QuestionID,
    Q.QuestionText,
    A.AnswerText,
    S.Description AS SurveyDescription
FROM Answer A
LEFT JOIN Question Q ON A.QuestionID = Q.QuestionID
LEFT JOIN Survey S ON A.SurveyID = S.SurveyID;
"""

In [36]:
# 3. Load into pandas
df_long = pd.read_sql_query(query, conn)
conn.close()


In [37]:
df = df_long.pivot_table(
    index=["SurveyID", "UserID"],          # one row per user per survey
    columns="questiontext",                # one column per question
    values="AnswerText",
    aggfunc=lambda x: " | ".join(
        pd.Series(x).dropna().astype(str).unique()
    ),                                     # handle multi-answers
).reset_index()

In [38]:
# Remove the column name from header row.
df.columns.name = None

In [39]:
df.head()

Unnamed: 0,SurveyID,UserID,Any additional notes or comments,Are you openly identified at work as a person with a mental health issue?,Are you self-employed?,Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.,Describe the circumstances of the badly handled or unsupportive response.,Describe the circumstances of the supportive or well handled response.,Describe the conversation with coworkers you had about your mental health including their reactions.,"Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.",...,Would you be willing to discuss a mental health issue with your direct supervisor(s)?,Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.),Would you bring up a mental health issue with a potential employer in an interview?,Would you bring up a physical health issue with a potential employer in an interview?,Would you feel comfortable discussing a mental health issue with your coworkers?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Would you feel more comfortable talking to your coworkers about your physical health or your mental health?,Would you have been willing to discuss a mental health issue with your previous co-workers?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?
0,2014,1,-1,,-1,,,,,,...,Yes,,No,Maybe,,,,,,
1,2014,2,-1,,-1,,,,,,...,No,,No,No,,,,,,
2,2014,3,-1,,-1,,,,,,...,Yes,,Yes,Yes,,,,,,
3,2014,4,-1,,-1,,,,,,...,No,,Maybe,Maybe,,,,,,
4,2014,5,-1,,-1,,,,,,...,Yes,,Yes,Yes,,,,,,


In [40]:

df.tail()

Unnamed: 0,SurveyID,UserID,Any additional notes or comments,Are you openly identified at work as a person with a mental health issue?,Are you self-employed?,Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.,Describe the circumstances of the badly handled or unsupportive response.,Describe the circumstances of the supportive or well handled response.,Describe the conversation with coworkers you had about your mental health including their reactions.,"Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.",...,Would you be willing to discuss a mental health issue with your direct supervisor(s)?,Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.),Would you bring up a mental health issue with a potential employer in an interview?,Would you bring up a physical health issue with a potential employer in an interview?,Would you feel comfortable discussing a mental health issue with your coworkers?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Would you feel more comfortable talking to your coworkers about your physical health or your mental health?,Would you have been willing to discuss a mental health issue with your previous co-workers?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?
4213,2019,4214,,0,0,-1,-1,-1,-1,-1,...,,0,No,,No,No,Physical health,,"No, none of my previous supervisors",Physical health
4214,2019,4215,,0,1,IDK,employee was ostracized and employer discussed...,-1,-1,-1,...,,0,Maybe,,-1,-1,-1,,"No, none of my previous supervisors",Physical health
4215,2019,4216,,0,1,sdsdsdsdsdsd,-1,-1,-1,-1,...,,0,No,,-1,-1,-1,,-1,-1
4216,2019,4217,,0,0,raise awareness,-1,-1,-1,-1,...,,0,No,,Maybe,No,Physical health,,I don't know,Same level of comfort for each
4217,2019,4218,,1,0,reduce stigma. offer options for part time wor...,-1,-1,I've discussed mental illness with coworkers n...,My current manager will be retiring in the nex...,...,,1,Yes,,No,Maybe,Physical health,,Some of my previous supervisors,Physical health


In [41]:
output_path = "../../data/datasets/osmi_combined_wide.csv"

In [42]:
df.to_csv(output_path, index=False)

In [44]:
print("Saved CSV to:", output_path)
print("Rows:", len(df))
print("Columns:", df.shape[1])

Saved CSV to: ../../data/datasets/osmi_combined_wide.csv
Rows: 4218
Columns: 107


## Let's combine the datasets from different years.

#### Adding 2020 to 2014-2019


In [50]:
df_2020 = pd.read_csv("../../data/datasets/osmi_mental_health_2020.csv")

In [54]:
print(df.shape, df_2020.shape)

(4218, 107) (180, 120)


In [55]:
df.columns = df.columns.str.strip()
df_2020.columns = df_2020.columns.str.strip()

In [56]:
# Identify shared columns
common_cols = sorted(list(set(df.columns).intersection(df_2020.columns)))


In [58]:
print("Shared columns:", len(common_cols))

Shared columns: 62


#### Let's ensure both datasets have SurveyID 
2014-2019 already has it but let's add for 2020.

In [59]:
df_2020["SurveyID"] = 2020

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [60]:
df_14_19_common = df[common_cols].copy()
df_2020_common  = df_2020[common_cols].copy()

In [62]:
# Combine into one dataset
df_combined = pd.concat(
    [df_14_19_common, df_2020_common],
    ignore_index=True
)

df_combined.head(), df_combined.shape

(   SurveyID  \
 0      2014   
 1      2014   
 2      2014   
 3      2014   
 4      2014   
 
   Are you openly identified at work as a person with a mental health issue?  \
 0                                                NaN                          
 1                                                NaN                          
 2                                                NaN                          
 3                                                NaN                          
 4                                                NaN                          
 
   Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.  \
 0                                                NaN                                                                                  
 1                                                NaN                                                                                  
 2          

#### Let's continue with 2021


In [64]:
df_2021 = pd.read_csv("../../data/datasets/osmi_mental_health_2021.csv")

In [65]:
print(df_combined.shape, df_2020.shape)

(4398, 63) (180, 121)


In [66]:
df_combined.columns = df_combined.columns.str.strip()
df_2021.columns = df_2021.columns.str.strip()

In [68]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2021.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 62


In [69]:
df_2021["SurveyID"] = 2021

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [70]:
df_combined_common = df_combined[common_cols].copy()
df_2021_common  = df_2021[common_cols].copy()

In [71]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2021_common],
    ignore_index=True
)

df_combined.head(), df_combined.shape

(   SurveyID  \
 0      2014   
 1      2014   
 2      2014   
 3      2014   
 4      2014   
 
   Are you openly identified at work as a person with a mental health issue?  \
 0                                                NaN                          
 1                                                NaN                          
 2                                                NaN                          
 3                                                NaN                          
 4                                                NaN                          
 
   Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.  \
 0                                                NaN                                                                                  
 1                                                NaN                                                                                  
 2          

#### 2022

In [72]:
df_2022 = pd.read_csv("../../data/datasets/osmi_mental_health_2022.csv")

In [73]:
print(df_combined.shape, df_2022.shape)

(4529, 63) (180, 121)


In [74]:
df_combined.columns = df_combined.columns.str.strip()
df_2022.columns = df_2022.columns.str.strip()

In [75]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2022.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 62


In [76]:
df_2022["SurveyID"] = 2022

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [77]:
df_combined_common = df_combined[common_cols].copy()
df_2022_common  = df_2022[common_cols].copy()

In [78]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2022_common],
    ignore_index=True
)

df_combined.head(), df_combined.shape

(   SurveyID  \
 0      2014   
 1      2014   
 2      2014   
 3      2014   
 4      2014   
 
   Are you openly identified at work as a person with a mental health issue?  \
 0                                                NaN                          
 1                                                NaN                          
 2                                                NaN                          
 3                                                NaN                          
 4                                                NaN                          
 
   Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.  \
 0                                                NaN                                                                                  
 1                                                NaN                                                                                  
 2          

### 2023

In [79]:
df_2023 = pd.read_csv("../../data/datasets/osmi_mental_health_2023.csv")

In [80]:
print(df_combined.shape, df_2023.shape)

(4693, 63) (6, 126)


In [81]:
df_combined.columns = df_combined.columns.str.strip()
df_2022.columns = df_2022.columns.str.strip()

In [82]:
# Identify shared columns
common_cols = sorted(list(set(df_combined.columns).intersection(df_2023.columns)))
print("Shared columns:", len(common_cols))

Shared columns: 62


In [83]:
df_2023["SurveyID"] = 2023

if "SurveyID" not in common_cols:
    common_cols = ["SurveyID"] + common_cols

In [84]:
df_combined_common = df_combined[common_cols].copy()
df_2023_common  = df_2023[common_cols].copy()

In [85]:
# Combine into one dataset
df_combined = pd.concat(
    [df_combined_common, df_2023_common],
    ignore_index=True
)

df_combined.head(), df_combined.shape

(   SurveyID  \
 0      2014   
 1      2014   
 2      2014   
 3      2014   
 4      2014   
 
   Are you openly identified at work as a person with a mental health issue?  \
 0                                                NaN                          
 1                                                NaN                          
 2                                                NaN                          
 3                                                NaN                          
 4                                                NaN                          
 
   Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.  \
 0                                                NaN                                                                                  
 1                                                NaN                                                                                  
 2          

### Let's save the combined dataset to a .csv file.

In [86]:
df_combined.to_csv("../../data/datasets/osmi_mental_health_combined_data.csv", index=False)