# Introduction to Python for Health Research

# Assessment

## How this assessment works

This assessment consists of 10 questions, each of which is delivered in a sequence of notebook cells. Each question requires you to write some Python code to solve a problem, and you can create as many new cells as you like to help you work out the answer. You may find it useful to have some paper and a pen to hand, in order to help you work out the answers, and you may find it helpful to look things up on the Internet<br><br> 

For each question, there is a cell that you need to run in order to submit your answer. You can submit an answer as many times as you want, and the score will be saved<br><br>

You can restart this assessment any time you like, simply by launching it from the Moodle page. The results of all questions you have answered already will have been saved. Be aware that each time you restart this assessment, you will get a different version of the data you are required to solve problems with<br><br>

In order to close the assessment and submit your overall result, you must run the final cell given in this notebook<br><br>

**Important:** In order to run save the result of each question you will need to enter your `assessment key`. This is displayed to you when you click on the *Click here to start a new attempt and get your assessment key*, on the Moodle page 

**If you have not already obtained an assessment key please quit this script now, return to the Moodle page and request one**<br><br>

Please start by running the next cell:

In [None]:
#
# This cell imports data and functions that are used later on
#
!wget -nv https://github.com/kcl-bhi-is-01/I_to_P_anc/raw/main/I_to_P_supp_funcs_v05.pyc
import I_to_P_supp_funcs_v05 as sf

You can use the function call in the following cell to check your progress in this assessment examination

In the cell, replace the string `my_key`, with the assessment key you obtained earlier from the Moodle page. If you have not already obtained an assessment key please quit this script now, return to the Moodle page and request one

You can run `sf.check_my_progress(my_key="my_key")` whenever you like

In [None]:
#
# Run this cell to check your progress
#
sf.check_my_progress(my_key="my_key")
#

## Question 1 

### Read a .csv with Pandas

A subset of a (manicured) Thyroid condition diagnosis dataset *(Thanks to Ross Quinlan, Garavan Institute, Sydney, Australia)* is available in `thyroid.csv`

*Use the Pandas package to read this dataset into a dataframe variable*

You can use the next cell to work out your answer, and the cell below that to submit it

In [None]:
#
# Use this cell to figure out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the parameter `my_dataframe` to the variable referencing your dataframe*

*If you run the cell with `submit=False`, then your result will **not** be stored in the assessments database. When you are ready, change this to `submit=True` and re-run the cell*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_my_dataframe(my_key="my_key", my_dataframe=None, submit=False)

## Question 2

### Create a dictionary from a .csv

The `outcome` column in `thyroid.csv` is a code which gives the diagnosis condition for each observation (i.e. row). The data set `thyroid_outcomes.csv` contains a textual explanation of each coded condition.

*Create a Python dictionary object, from `thyroid_outcomes.csv`, which can be used to look up the textual explanation of each coded condition. I.e. the `key` of each dictionary item will be the code, and the `value` will be the textual explanation*

HINT: It's easiest to use Pandas to load the data into a dataframe, and then create the dictionary from this.

You can use the next cell to work out your answer, and the cell below that to submit it

In [None]:
#
# Use this cell to figure out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument `my_dictionary` to the variable referencing your dataframe. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_my_dictionary(my_key="my_key", my_dictionary=None, submit=False)

## Question 3

### Summarise the `outcome` column in `thyroid.csv`

*Add a new column to a dataframe loaded from `thyroid_outcomes.csv`, called `count`. The new column should contain the count of the number of rows in `thyroid.csv`,  with that `outcome` code*

*You should ensure that the new column, `count`, has a `dtype` of `int64`. This can be done with `thyroid_outcomes_df = thyroid_outcomes_df.astype({"count": int})` (where `thyroid_outcomes_df` is your data frame)* 


HINT: It's quite likely that `thyroid.csv` does not contain at all `outcome` codes, since it is a quite severe subset of the original dataset. In these cases, the corresponding value in the new `count` column should be `0` 

You can use the next cell to work out your answer, and the cell below that to submit it

In [None]:
#
# Use this cell to figure out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument  `my_outcome_counts_df` to the variable referencing your dataframe. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_outcome_counts(my_key="my_key", my_outcome_counts_df=None, submit=False)

## Question 4

### Write a function to perform `mean imputation`

In `thyroid.csv`, question marks (`?`) are used to denote missing values, both in columns used to hold numeric values (e.g. `TSH`, `T3`, `TT4`, `T4U`, `FTI`, `TBG` - measures of hormones and related substances), and string (categorical) values (e.g. `sex`, `on_thyroxine`, `query_on_thyroxine`, `on_antithyroid_medication`, `sick`, `pregnant`, `thyroid_surgery`, `I131_treatment`)

The objective of this question is to write a function that performs a simple form of **imputation** on the numeric columns in `thyroid.csv`

*Define a function that takes as an argument any column from a Pandas dataframe loaded from `thyroid.csv`, that is used to hold numeric values, e.g. `thyroid_df["FTI"]`. Note that a column specified in this way is a Pandas Series*

*Your function should make a copy of the column and replace any `?`'s with the **mean** of all the non `?` values, rounded to 2 decimal places. It should **return** a Pandas Series with a `dtype` (i.e. data type) of `float64`*

HINT: Import the `copy` module and use `copy.deepcopy()` to make a copy of your function's argument

HINT: `pd.Series([5, 6, 7], dtype="float64")` is one way of forcing a `dtype` of `float64` in a Series

*For some **bonus marks**, write your function so that you can run it with a column argument that is not a numeric column (i.e. contains data other than numbers and `?`). In this case your function should return a Series of `float64` zeros*

HINT: You can use `try:` and `except:` blocks to help with this

*For some **further bonus marks**, write your function so that if you run it with a column argument that is all `?`'s, it will return a Series of `float64` zeros*

HINT: You can use `math.isnan` from the `math` module to check if your calculated mean is `NaN`

You can use the next cell to define and test your function, and the cell below that to submit it

In [None]:
#
# Use this cell to define and test your function
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument  `my_mean_function` to the name of your function. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_my_mean_function(my_key="my_key", my_mean_function=None, submit=False)

## Question 5

### Obtain summary statistics of numeric columns

This question uses a version of `thyroid.csv` that has had **mean imputation** applied to the numeric columns. You can get a copy of this data in a Pandas dataframe by running the function `get_imputed_df()` as given in the next cell.

*There is a Pandas function that produces a set of summary statistics from dataframes. Find out what this function is, and run it on the imputed dataframe, to create a new dataframe summarising the `numeric` columns only*

*Round the values in the summary statistics dataframe to 4 decimal places*

Use the next cell to get the dataframe, and work out your answer

In [None]:
#
# Use this cell to figure out your answer
#
# Run the next line to get the data
#
imputed_df = sf.get_imputed_df()

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument  `my_summary_stats` to refer to your dataframe. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_summary_stats(my_key="my_key", my_summary_stats_df=None, submit=False)

## Question 6

### Obtain frequency distributions of categorical columns

*The objective of this question is to build a Python dictionary containing the frequency distributions of the **categorical** columns of `thyroid.csv`*

*Each dictionary entry `key` should be the column name. Each dictionary entry `value` should be **another dictionary** containing the frequency distribution of the catagorical column*

*Each "inner" dictionary should have as its `key` a categorical value (or "label"), and its `value` should be the corresponding frequency count. E.g. the dictionary for `sex` would be in this format `'sex': {'F': 294, 'M': 126, '?': 16}`

HINT: These are the numeric columns `["age", "TSH", "T3", "TT4", "T4U", "FTI", "TBG"]`, the remainder are categorical

HINT: `pandas.Series.value_counts()` is a useful function for getting a frequency distribution from a dataframe column

Use the cell below to  work out your answer

In [None]:
#
# Use this cell to figure out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument  `my_frequency_dict` to refer to your dictionary. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_frequency_dists(my_key="my_key", my_frequency_dict=None, submit=True)

## Question 7

### Class for thyroid.csv outcomes (1)

*The objective of this question is to build a Python dictionary containing a `key` for each `outcome` code present in `thyroid.csv`*

*Each dictionary entry `value` should be an object from the class `Outcome`, as given in the cell below*

*You can see from the code that the `Outcome` class contains the `outcome` code, the `outcome` description (which you can get from `thyroid_outcomes.csv`), and a dataframe formed from all observations in `thyroid.csv` with that `outcome`*

*Each dataframe within an `Outcome` object must contain observations in the same sequence as they occur in `thyroid.csv`*

HINT: `outcome_subset_df = thyroid_df.loc[thyroid_df["outcome"] == "-"]` creates a dataframe containing only those observations in a dataframe loaded from `thyroid.csv`, with an outcome code of `-` 

HINT: `thyroid_outcomes_df.loc[thyroid_outcomes_df["outcome"] == "-", "outcome_description"].values[0]` fetches the `outcome_description` from a dataframe loaded from `thyroid_outcomes.csv`, with an outcome code of `-`

Run the cell below to create the class

In [None]:
#
# Class for thyroid outcomes
#
class Outcome:
    #
    # Constructor function
    #
    def __init__(self, outcome_code, obs_df, outcome_desc): 
        #
        # Store data in the new object
        #
        self.outcome_code = outcome_code
        self.outcome_desc = outcome_desc
        self.obs_df       = obs_df
        #
    #
    #
    def mean_age(self):
        #
        # Return the mean age of observations belonging to an outcome
        #
        return round(self.obs_df["age"].mean(), 2)
        #
    #
    #
#
#

Use the cell below to work out your answer

In [None]:
#
# Use this cell to work out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument `my_outcomes_dict` to refer to your dictionary. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_outcomes_dict(my_key="my_key", my_outcomes_dict=outcomes_dict, submit=False)

## Question 8

### Class for thyroid.csv outcomes (2)


*The objective of this question is to build a dictionary of tuples, where each `key` is an `outcome` code from `thyroid.csv`*

*Each `value` of your dictionary should be a `tuple` containing firstly, the the `outcome` description (which you can get from `thyroid_outcomes.csv`), and secondly, the average `age` of those rows in `thyroid.csv` which belong to the `outcome`*

*The average age should be rounded to 2 decimal places*

*You can use the `Outcome` class, which is given in the cell below. When you run the cell, you will also obtain a list called `outcome_objects_list`. Each element in this list is an object of class `Outcome` derived from `thyroid.csv`*

HINT: If you use `outcome_objects_list`, do you need to read `thyroid.csv` and/or `thyroid_outcomes.csv`?

Run the cell below to create the class, and obtain `outcome_objects_list`

In [None]:
#
# Class for thyroid outcomes
#
class Outcome:
    #
    # Constructor function
    #
    def __init__(self, outcome_code, obs_df, outcome_desc): 
        #
        # Store data in the new object
        #
        self.outcome_code = outcome_code
        self.outcome_desc = outcome_desc
        self.obs_df       = obs_df
        #
    #
    #
    def mean_age(self):
        #
        # Return the mean age of observations belonging to an outcome
        #
        return round(self.obs_df["age"].mean(), 2)
        #
    #
    #
#
# Get list of Outcome instances
#
outcome_objects_list = sf.get_outcome_objects_list()

Use the cell below to figure out out your answer

In [None]:
#
# Use this cell to work out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the argument `my_tuples_dict` to refer to your dictionary. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_tuples_dict(my_key="my_key", my_tuples_dict=None, submit=False)

## Question 9

### Selection using boolean operations

In health research, when we are working with structured data, we work predominantly with Pandas dataframes. When we are working with less structured data e.g. words making up a chunk of free text, Pandas is used less

Although Python has a clear syntax for the evaluation of `Boolean Operations` (i.e. `not`, `and`, `or`) in general, the same syntax cannot be used for applying compound (or multiple) boolean expressions to select rows from Pandas dataframes

The following cell illustrates general Python features, plus 2 approachs for Pandas

In [None]:
#
# 3 approaches for boolean expression evauation
#
# Python general boolean expressions (i.e Pandas is not involved). Use parentheses to control precedence
#
temperature      = 22.4
wind_speed       = 8
rain_probability = 0.25
#
if (temperature > 13 and wind_speed <= 9) or (temperature > 19 and wind_speed < 14) and rain_probability < 0.3:
    print("OK for tennis today")
else:
    print("Not OK for tennis today")
#
print()
#
# Pandas boolean expression #1 - basic. You must put () round each comparison
#
max_age = 60
print(thyroid_df.loc[(thyroid_df["outcome"] != "-") & ((thyroid_df["sex"] == "F") | (thyroid_df["age"] >= max_age))].head())
print()
#
# Pandas boolean expression #2 - using query(). Care needed with quotes + odd way of citing variables
# 
max_age = 60
print(thyroid_df.query("outcome != '-' and (sex == 'F' or age >= @max_age)").head())
print()

*The objective of this question is to generate 5 subsets of `thyroid.csv` as follows:*

*a) Male patients where **all** binary indicators from `on_thyroxine` to `psych` are `f`*

*b) Patients under 60 years of age which have `outcome`'s in the range `A` to `J`*

*c) Observations with `?` in any column, **excluding** the `TBG` column*

*d) Female patients under 60 years of age with `referral_source` = `other`, or Male patients over 65 years of age whose `referral_source` is **not** `other`*

*e) Any patient whose `age` is over the **mean + 1 standard devation** of `age` and who has any binary indicator, from `on_thyroxine` to `psych`, set to  `t`*

HINT: An advantage of using `query()` is that you can write code to build the query as a string variable

Use the cell below to  work out your answer

In [None]:
#
# Use this cell to work out your answer
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set the arguments  `my_part_a_df` ... `my_part_e_df` to refer to the dataframes containing your answers. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_my_booleans(my_key="my_key", my_part_a_df=None, my_part_b_df=None, my_part_c_df=None, 
                     my_part_d_df=None, my_part_e_df=None, submit=False)

## Question 10

### Read and analyse text


*a) Read the text file `declaration_of_independence.txt` such that you can count the number of lines*

*b) Use the `word_tokenize` function from the `nltk` library to count the number of words in the text file. The `import` statement is given in the code cell, below* 

*c) Create a `set` containing all the unique words in `declaration_of_independence.txt`* 

HINT: If two words are not separated by a space or some punctuation, then they will be counted as one word

Use the next cell to import `word_tokenize`, and to work out your answer

In [None]:
#
# Use this cell to work out your answer
#
from nltk.tokenize import word_tokenize
#

*Use the cell below to submit your answer. Change the string `my_key` to your assessment key, and set `my_line_count` to the number of lines, `my_word_count` to the number of words, and `my_word_set` to the variable containing your set of uniquely occuring words. Change `submit=False` to `submit=True` to save your score*

In [None]:
#
# Run this cell to submit your answer
#
sf.check_my_words(my_key="my_key", my_line_count=0, my_word_count=0, my_word_set=None, submit=False)

## Submit and end

Please run the following cell to save the overall result for this assessment try

You must run this cell even if you have run out of time

*Change the string `my_key` to your assessment key, and run the cell. Change `submit=False` to `submit=True` to save your result*

In [None]:
#
# Run this function to submit your overall result
#
sf.overall_submission(my_key="my_key", submit=False)