# Mental Health Analysis 

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

# ☑️ How to complete and submit
Each exercise will look something like this:

```python
example_query = """
SELECT *
FROM sample
"""
#example_result = pd.read_sql(example_query, conn)
#example_result
```

In each exercise you will need to define a query variable by writing the SQL code that you think will solve the problem. SQL code should be enclosed in three double or single quotes.

Once you have your query, uncomment the last two code lines, this will execute it and load the resulting data into a dataframe.

Nothing else needs to be changed in the last code lines besides uncommenting it.

After running this you will be free to inspect the result produced to see whether it's what you'd expect as the result. KATE will look for variables with the names defined in this notebook, so it is important not to rename the variables defined in this notebook.

Once you've completed the exercises upload this notebook to **KATE** to get feedback. You can also upload the notebook when you only have parts of it completed - if you do so, make sure you do not uncomment the `pd.read_sql` lines for which you don't have a query yet.

Refer to the instructions on **KATE** for more details on the dataset.

# ☑️ Setting up the database connection

Run the following code cell to import `pandas` and `sqlite3` libraries and create the connection to the `mental_health.sqlite` database.

**Do not change this code!** The `conn` variable will be used throughout the notebook to query the database.

In [3]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('data/mental_health.sqlite')

# ☑️ Introduction to the Mental Health dataset 

This dataset is an Open Source Mental Illness (OSMI) data. 

It has been collected using surveys from 2014, 2016, 2017, 2018 and 2019. 

The surveys are a way of understanding the mental health situation and the frequency of mental health disorder in the tech industry. 

The dataset is available in sqlite format and can be downloaded from [here](https://www.kaggle.com/anth7310/mental-health-in-the-tech-industry)

Some preprocessing was performed before making the dataset available: similar questions were merged together, values for answers were made consistent (for example  1 == 1.0), and spelling errors were fixed. 
The raw data was processed using Python, SQL and Excel for cleaning and manipulation.


The database contains three tables: `Survey`, `Question`, and `Answer`.

  1. **Survey**, containing columns:
    - `PRIMARY KEY INT SurveyID`
    - `TEXT Description`


  2. **Question**, containing columns: 
    - `PRIMARY KEY QuestionID`
    - `TEXT QuestionText`


  3. **Answer**, containing columns:
    - `PRIMARY/FOREIGN KEY SurveyID`
    - `PRIMARY KEY UserID`
    - `PRIMARY/FOREIGN KEY QuestionID`
    - `TEXT AnswerText`


`SuveyID` column contains the survey year i.e. 2014, 2016, 2017, 2018, 2019 and the same question can be used for multiple surveys. 

Answer table is composite, with multiple primary keys. Here, `SurveyID` and `QuestionID` are [`FOREIGN KEYS`](https://www.w3schools.com/sql/sql_foreignkey.asp)

Some questions can contain multiple answers, thus the same user can appear more than once for any given QuestionID.

You can find more information [here](https://www.kaggle.com/anth7310/mental-health-in-the-tech-industry).

Run the following code cell to show all the tables in the `mental_health.sqlite` database:

In [4]:
query = """
SELECT name 
FROM sqlite_master 
WHERE type='table';
"""
df = pd.read_sql_query(query, conn)
df

Referencing these tables and their respective columns will be useful in answering the following questions. Run the following code to show column names and data types within each table:

In [5]:
for table in ['Answer','Question','Survey']:
    
    query = f"""
    PRAGMA table_info({table});
    """
    df = pd.read_sql_query(query, conn)
    print(df[['name','type']])
    print('='*40)

# ☑️ Queries

**1. Write a SQL query that finds all the records within the `Question` table where the `questionid` is equal to 2 or 3. The columns should be called `Question` and `ID`**

- Consider using SQL `AS` keyword, you can assign the aliases `Question` and `ID` to the `questiontext` and `questionid` columns, respectively

- To specify the filtering criteria, you may want to use the `WHERE` keyword

See below code syntax for some guidance:
```SQL
SELECT column_name1 AS <alias>, column_name2 AS <alias>
FROM Question
WHERE <condition1> OR <condition2>;
```

In [6]:
#add your code below


question_2_3_query = """
SELECT questiontext AS Question, questionid AS ID
FROM Question
WHERE questionid = 2 OR questionid = 3
"""
question_2_3_result = pd.read_sql(question_2_3_query, conn)
question_2_3_result



**2. Refer to the `Survey` table. Write a SQL query to retrieve the surveys from 2014 and 2017. The columns should be called `Year` and `Year_Description`**

- Consider using SQL `AS` keyword, you can assign the aliases `Year` and `Year_Description` to the `SurveyID` and `Description` columns, respectively

- Consider using `IN` with `WHERE` keyword to specify a set of values for comparison within a column. The `IN` can be thought of as a shorthand for multiple `OR` conditions


See below code syntax for some guidance:
```SQL
SELECT column_name1 AS <alias>, column_name2 AS <alias>
FROM Survey
WHERE <condition>;
```

In [9]:
#add your code below

survey_years_query = """
SELECT SurveyID AS Year, Description AS Year_Description 
FROM Survey
WHERE Year == 2014 or Year == 2017
"""




survey_years_result = pd.read_sql(survey_years_query, conn)
survey_years_result



**3. Refer to the `AnswerText` column in `Answer` table. Write a SQL query to find out how many answers in total have been given throughout the years. Your result should contain one column, called `answers_count`**

- Consider using SQL `COUNT()` function to calculate total number of answers, and assign the alias `answers_count` to the result using the `AS` keyword


See below code syntax for some guidance:
```SQL
SELECT COUNT(column_name) AS <alias>
FROM Answer;
```

In [11]:
#add your code below
number_of_answers_query = """
SELECT COUNT(AnswerText) AS answers_count 
FROM Answer;
"""
number_of_answers_result = pd.read_sql(number_of_answers_query, conn)
number_of_answers_result



**4. Refer to the `AnswerText` column in `Answer` table. Write a SQL query to find out how many answers have been given in 2017 and 2019. Your result should contain one column, called `answers_count`**

- Make use of SQL `COUNT()` function to calculate total number of answers, and assign the alias `answers_count` to the result using the `AS` keyword

- Consider using `IN` with `WHERE` keyword to specify a set of values for comparison within a column. The `IN` can be thought of as a shorthand for multiple `OR` conditions


See below code syntax for some guidance:
```SQL
SELECT COUNT(column_name) AS <alias>
FROM Survey
WHERE <condition>;
```

In [15]:
#add your code below
number_of_answers_17_19_query = """
SELECT COUNT(AnswerText) AS answers_count
FROM Answer
WHERE SurveyID == 2017 OR SurveyID == 2019;

"""
number_of_answers_17_19_result = pd.read_sql(number_of_answers_17_19_query, conn)
number_of_answers_17_19_result



**5. Refer to the `AnswerText` column in `Answer` table. Write a SQL query to extract the first 100 answers for the year 2014. Your result should contain one column (the `AnswerText`)**

- To specify the filtering criteria, you may want to use the `WHERE` keyword
- Also, consider using `LIMIT` to restrict the results to 100 rows


See below code syntax for some guidance:
```SQL
SELECT column_name
FROM Answer
WHERE <condition>
LIMIT <number_of_rows>;
```

In [18]:
#add your code below
answer_2014_query = """
SELECT AnswerText
FROM Answer
WHERE SurveyID == 2014
LIMIT 100;
"""
answer_2014_result = pd.read_sql(answer_2014_query, conn)
answer_2014_result



**6. Refer to the `Answer` table. For each year of the survey, how many questions have been asked? Return a table containing the survey year and the number of unique questions that have been asked for each year. Call the survey year column `year` and the second column `survey_answers`**


- Use `DISTINCT()` function to extract unique values and use `COUNT()` function to calculate the number of unique values in `QuestionID` column

- Grouping the data by the `SurveyID` column will be helpful

- Consider using SQL `AS` keyword, you can assign the aliases `year` and `survey_answers` to the `SurveyID` and to the result columns, respectively


See below code syntax for some guidance:
```SQL
SELECT column_name AS <alias>, COUNT(DISTINCT(column_name)) AS <alias>
FROM Answer
GROUP BY <column_name>;
```

In [23]:
answer = """
SELECT *
FROM Answer
"""
answer = pd.read_sql(answer, conn)
answer

In [25]:
#add your code below
answer_per_survey_query = """
SELECT SurveyID AS year, COUNT(DISTINCT(QuestionID)) AS survey_answers
FROM Answer
GROUP BY SurveyID;
"""
answer_per_survey_result = pd.read_sql(answer_per_survey_query, conn)
answer_per_survey_result



In [26]:
question_query = """
SELECT *
FROM question;
"""
question_result = pd.read_sql(question_query, conn)
question_result

**7. Refer to the `Answer` table. Select the maximum age of the participants for each survey year. Return a table containing the survey year and the maximum age of participants for that year. Your result should contain two columns: one called `year` and one called `max_age`**

- Have a look at the Question table first to find which question asks participants about their age
- To calculate the maximum age of participants, consider using the `MAX()` function along with the `CAST()` function
- To specify the filtering criteria, you may want to use the `WHERE` keyword
- Grouping the data by the `SurveyID` column will be helpful
- Use `AS` keyword to assign the aliases

See below code syntax for some guidance:
```SQL
SELECT column_name AS <alias>, MAX(CAST(column_name as int)) AS <alias>
FROM Answer
WHERE <condition>
GROUP BY <column_name>;
```

In [28]:
#add your code below
max_age_query = """
SELECT SurveyID AS year, MAX(CAST(AnswerText as int)) AS max_age
FROM Answer
WHERE QuestionID = 1
GROUP BY SurveyID;
"""
max_age_result = pd.read_sql(max_age_query, conn)
max_age_result



**8. Refer to the `Answer` table. Write a SQL query that finds out how many people always, never, or sometimes work remotely. Your result should have one column called `answer`, and one called `count`**

- Have a look at the Question table first to find which question asks participants about how often they work remotely. Note that always, never, and sometimes are the three possible answers.
- Consider using SQL `COUNT()` function to calculate total number of people, and assign the alias `count` to the result using the `AS` keyword
- Assign the alias `answer` to the column `AnswerText`
- To specify the filtering criteria, you may want to use the `WHERE` keyword
- Grouping the data by the `AnswerText` column will be helpful


See below code syntax for some guidance:
```SQL
SELECT column_name AS <alias>, COUNT(column_name) AS <alias>
FROM Answer
WHERE <condition>
GROUP BY <column_name>;
```

In [30]:
#add your code below
work_remotely_query = """
SELECT AnswerText AS answer, COUNT(AnswerText) AS count
FROM Answer
WHERE questionid = 118
GROUP BY AnswerText;

"""
work_remotely_result = pd.read_sql(work_remotely_query, conn)
work_remotely_result



**9. Refer to the `Answer` table. Write a SQL query that returns the given age of 2016 survey participants as well as the count of participants for each age. Call the age column `participant_age` and the count column `number_of_participants`**

- Consider using SQL `COUNT()` function, and assign the alias `number_of_participants` to the result using the `AS` keyword
- Assign the alias `participant_age` to the column `AnswerText`
- To specify the filtering criteria, you may want to use the `WHERE` keyword
- Grouping the data by the `AnswerText` column will be helpful


See below code syntax for some guidance:
```SQL
SELECT column_name AS <alias>, COUNT(column_name) AS <alias>
FROM Answer
WHERE <condition1> AND <condition2>
GROUP BY <column_name>;
```

In [31]:
answer = """
SELECT *
FROM Answer;
"""
answer_result = pd.read_sql(answer, conn)
answer_result

In [32]:
#add your code below
age_freq_query = """
SELECT AnswerText AS participant_age, COUNT(UserID) AS number_of_participants
FROM Answer
WHERE SurveyID == 2016 AND QuestionID == 1
GROUP BY AnswerText;
"""
age_freq_result = pd.read_sql(age_freq_query, conn)
age_freq_result


**10. This question refers to the query you wrote in Question 6. Now let's make Question 6 a little bit more complicated and order the year in descending order. Call the survey year column `year` and the count column `survey_answers`**

- Consider sorting the data in descending order, this could be achieved by using `ORDER BY` keyword with `DESC`

See below code syntax for some guidance:
```SQL
SELECT column_name AS <alias>, COUNT(DISTINCT(column_name)) AS <alias>
FROM Answer
GROUP BY <column_name>
ORDER BY <column_name> DESC;
```

In [33]:
#add your code below
#answer_per_survey_advanced_query = ...
#answer_per_survey_advanced_result = pd.read_sql(answer_per_survey_advanced_query, conn)
#answer_per_survey_advanced_result

answer_per_survey_advanced_query = """
SELECT SurveyID AS year, COUNT(DISTINCT(QuestionID)) AS survey_answers
FROM Answer
GROUP BY SurveyID
ORDER BY year DESC;
"""
answer_per_survey_advanced_result = pd.read_sql(answer_per_survey_advanced_query, conn)
answer_per_survey_advanced_result


