###  Islands: Python Foundations - Chapter 4

[Back to Main Page](0_main_page.ipynb)

[How to use this book interactively on Deepnote](99_how_to_use_this_book.ipynb)

[Download this book](99_how_to_use_this_book_local.ipynb)
<br>

<h1> <center> Functions & Plotting </center> </h1> 

## Importing Libraries

As before, the cell below imports the libraries we need. 

Once again, <b> it is very important you run each cell in this notebook in the order in which they appear. </b> Later cells depend on the activity of earlier cells.

<br>
<br>
<center> ↓↓↓ <b> Before reading on, please run the cell below</b>. Click on the cell and press `shift` and `Enter` together.↓↓↓ </center>

In [1]:
# run this cell (by pressing 'Control' and 'Enter' together) to import the libraries needed for 
# this page

# 'import' tells python to get a set of functions (which is called a library), in the first case this is 
# the numpy library. The 'as' tells python to name the library something (to save us typing out 'numpy'); in this case
# we name the library 'np'
import numpy as np

# in this case we import the pandas library and name it 'pd'
import pandas as pd

# here we import the matplotlib.pyplot library and name it 'plt'
import matplotlib.pyplot as plt

# import the py_found library, containing a set of custom functions for this page
import py_found

# make plots look like R
py_found.r_ify()

# generate the data for this page
psychosis_status_observations, observations_sex, psychosis_scores, names, psychosis_score_200_patients, num_hospitalisations = py_found.function_plot_page_setup()

# this imports the machinery for marking answers to questions
from client.api.notebook import Notebook
ok = Notebook('ok_tests/4_functions_plotting.ok')

Assignment: 4_functions_plotting
OK, version v1.18.1



## Functions

<br>

<center> <img src="https://github.com/pxr687/islands_python_foundations/blob/master/images/psychotic_island.png?raw=true" width="400"> </center>

<br>

Our [psychiatrist friend](2_lists_indexing.ipynb#A-psychotic-island) is being asked by their co-worker to report how many individuals they have sampled so far on the island. 

As a result, the psychiatrist wants to double-check how many people there are in total in the sample we took. (Remember that each element of our ```psychosis_status_observations``` list shows the psychosis status of one person, so the total number of elements is the list equals the number of people we observed):

In [2]:
# run this cell to view the contents of the psychosis_status_observations list
psychosis_status_observations

['psychotic', 'not_psychotic', 'not_psychotic', 'not_psychotic', 'psychotic']

We could check this by counting the number of elements in the list. This might work OK for small samples, but will get cumbersome if our sample had many people in it. (Imagine if our sample consisted of all the 10000 people on the island!). 

We can get python to count the number of elements in our list. To do this we use a *function*. You can think of a [function as a recipe that takes an ingredient, or set of ingredients, performs a procedure on them an returns a "meal"](https://matthew-brett.github.io/cfd2020/functions-conditionals/functions.html).

In this case we want to use the ```len()``` function. The ```len()``` function will take whatever ingredient it is given - technically, the 'ingredient' is called an 'argument' - and it will count how many elements are in that ingredient. 

```len``` is the name of the function, inside the ```()``` we put the 'ingredient' that we want the function to operate on. In this case, we want to count the number of elements in our ```psychosis_status_observations``` list. So we type:

In [3]:
len(psychosis_status_observations)

5

### Question 1

Here is the `observations_sex` list, which shows the biological sex of the five islanders in our sample:

In [4]:
# run this cell to view the contents of the list
observations_sex

['male', 'male', 'female', 'female', 'female']

Use the `len()` function to count the number of elements in the `observations_sex` list. Store the result in a variable called `len_obs_sex`:

In [5]:
# your code here
len_obs_sex = len(observations_sex) # !!! replace with ...


len_obs_sex  # this line just makes the cell output whatever value you have saved in the len_obs_sex variable

5

In [6]:
_ = ok.grade('q1')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



There are many other functions in python. ```print()``` is one of them. Whatever argument we pass to ```print()``` it will show the argument as an output when we run the cell:

In [7]:
print('The prevalence of psychotic disorders on this island is unusally high...')

The prevalence of psychotic disorders on this island is unusally high...


The `print()` function has many uses, for instance, we can use it to check what the value of a variable is before and after we've done something to the variable (this is very useful for checking our code is doing what we want it to do):

In [8]:
# a variable containing the value 0
my_variable = 0

# showing the value of the variable
print('my_variable =', my_variable)

# adding 1000 to the variable
my_variable = my_variable + 1000

# showing the value of the variable after adding 1000
print('my_variable =', my_variable)

my_variable = 0
my_variable = 1000


Try using the ```print()``` function to print out the phrase `We took a sample of 5 people on the island...`.

(Remember to include the quotation marks around the phrase, so that python recognises it as a string). 

In [9]:
print('We took a sample...

SyntaxError: EOL while scanning string literal (<ipython-input-9-ae6e1c3c918d>, line 1)

As mentioned previously, the phrase you just printed out is a string. You can check what type of thing something is in python by using the ```type()``` function. Let's use the ```type()``` fucntion to confirm what sort of things we've been dealing with so far:

In [None]:
type(psychosis_status_observations)

Remember that the ```psychosis_status_observations``` list contains a collection of strings, which show whether each person we observered was psychotic or not: ```['psychotic', 'not_psychotic', 'not_psychotic', 'not_psychotic', 'psychotic']```. Let's confirm this with ```type()```, run the two cells below:

In [None]:
# printing the first element of the psychosis_status_observations list
print(psychosis_status_observations[0])

In [None]:
# using the type() function to see what type of python object the first element of the list is
type(psychosis_status_observations[0]) 

Remember the list containing the psychosis scores for the 5 individuals we observed?

In [None]:
# run this cell to see the contents of the list
psychosis_scores

Let's use ```type()``` to see what type of data is in that list, first, let's look at the first element of the `psychosis_scores` list:

In [None]:
# printing the first element of the psychosis_scores list
print(psychosis_scores[0])

Now, in the cell below use the `type()` function to reveal what type of data the first element of the `psychosis_scores` is. If you do this correctly, then the word `int` should be printed out by the cell.

# CHANGE THIS INTO Q2, just store as variable so okpy can mark, and then adjust other question numbers

In [None]:
# using the type() function to see what type of python object the first element of the list is
type(...

It is very important, when analysing data, to be aware of what type of data it is (both in the theoretical sense (e.g. the type of variable) and in the pythonic sense (e.g. the `type` of data in python)). This is because certain functions expect certain types of data as input, and will generate errors if they are given the wrong type of data.

All the functions we have used so far are in-built python functions - they are part of the python language itself. 

However, there are many additional functions we can use which are stored in libraries. We imported the numpy library earlier, and we named it ```np``` (scroll to the first cell at the top of the notebook if you don't remember).

We can use the function ```np.mean()``` to calculate the [average](https://www.investopedia.com/terms/a/arithmeticmean.asp) of a set of numbers:

In [None]:
# run this cell to calculate the average value of the scores in the psychosis_scores list
np.mean(psychosis_scores)

### Question 2

The psychiatrist also would like to know the [median](https://www.investopedia.com/terms/m/median.asp) of the psychosis scores.

[brief description of what a median is]

Have a go at using the function ```np.median()``` to find the median of the ```psychosis_scores``` list, and store the result as a variable called `psychosis_median`:

In [None]:
psychosis_median = np.median(psychosis_scores) #!!! replace with ...

psychosis_median  # this line just makes the cell output whatever value you have saved in the psychosis_median variable

In [None]:
_ = ok.grade('q2')

# Another clinical trial 

A hospital on the island is conducting a clinical trial of a drug called Fentacriptine. Fentacriptine is hypothesized to reduce psychotic symptoms. 

200 islanders participated in the study. 100 of them were randomly assigned to receive Fentacriptine and 100 were randomly assigned to receive a placebo.

The graph below shows the results from the trial:

In [None]:
placebo, drug = py_found.another_trial_gen()

### Question 3

As part of the process of gathering descriptive statistics from the trial - that is, statistics that merely describe the characteristics of the sample, without making inferences about the underlying population - the psychiatrist wants to calculate the difference betweeb the means of each group.

Here are the arrays containing the scores of each group:

In [None]:
# run this cell to see the scores
placebo

In [None]:
# run this cell to see the scores
drug

In the cell below, use the function `np.median()` to subtract the mean of the `drug` group from the mean of the `placebo` group. Save the difference in a variable called `difference_in_medians`:

In [None]:
difference_in_medians = np.median(placebo) - np.median(drug) #!!! replace with ..

# show the difference in medians
difference_in_medians

In [None]:
_ = ok.grade('q3')

So is the difference in the medians convincing? 

In [None]:
py_found.another_trial_pop_illustration(placebo, drug)

If the Null World is true, then the result is a fluke. The drug doesn't make a difference to psychosis scores, and the difference between the two groups is due to random sampling. It just happened that in this sample more people in the drug group came from the lower tail of the distribution (the left hand side, lower psychosis scores).

The result is a fluke because it is not what you would typically find if your drew two random samples from the Null World.

Below is more typical:

In [None]:
py_found.another_trial_pop_resample(placebo, drug)

[Re-vist what a median is]

In [None]:
py_found.another_trial_pop_resample_with_medians(placebo, drug)

In [None]:
# demonstration of np.append()

a = [1,2,3]
b = ['x', 'y', 'z']

np.append(a, b)

## Question 4

Use `np.append()` to combine the scores of the `placebo` and the `drug` group into one array. Store the combined array as a variable called `combined_groups`:

In [None]:
combined_groups = np.append(drug, placebo) #!!! replace with ...

# show the array

combined_groups

In [None]:
_ = ok.grade('q4')

## Question 5

Calculate the median of the `combined_groups` array, store as `grand_median` ('gets the rather grand title...')

In [None]:
grand_median = np.median(combined_groups) #!!! replace with ...

# show the grand median

grand_median

In [None]:
_ = ok.grade('q5')

In [None]:
# introduce np.countnonzero()

np.count_nonzero([0,1,2,3])

In [None]:
np.count_nonzero(True)

In [None]:
np.count_nonzero([True, False, True, False, False])

In [None]:
placebo > grand_median

In [None]:
np.count_nonzero(placebo > grand_median)

## Question 6

Calculate the proportion of the `placebo` group change scores which are larger than the grand median. Store in a variable called `placebo_greater_grand_median`.

$ \Large \text{proportion} = \frac{\text{number_of_elements_of_interest}} {\text{total_number_of_elements}} $

Do you think this provides evidence that the drug is effective? [Explain null hypothesis, maybe illustrate with a pre-coded simulation]

In [None]:
placebo_greater_grand_median = np.count_nonzero(placebo > grand_median)/len(placebo) #!!! replace with ...

# show placebo_grtr_grand_med

placebo_greater_grand_median

In [None]:
_ = ok.grade('q6')

## Question 7



In [None]:
drug_greater_grand_median = np.count_nonzero(drug > grand_median)/len(drug) #!!! replace with ...

# show placebo_grtr_grand_med

drug_greater_grand_median

In [None]:
_ = ok.grade('q7')

The result we have observed is more likely if the world is more like Null World...but we need a measure of *how* unlikely...will come to this later:

In [None]:
py_found.another_trial_pop_resample_with_medians(placebo, drug)

[Show median test function, explain with reference to these distributions above, and working out if the difference in prop above grand median is large enough to be convincing]

In [None]:
# show the median test 
import scipy.stats 

scipy.stats.median_test(placebo, drug)

## Question 8

Record the len() of the placebo group
Record the len() of the drug group
Make an array called `fake_scores_1`...

In [None]:
n_placebo = len(placebo) #!!! replace with ...
n_drug = len(drug) #!!! replace with ..

# show the numbers in each group
print('The number of participants in the placebo group = ', n_placebo)
print('The number of participants in the drug group = ', n_drug)

In [None]:
_ = ok.grade('q8')

In [None]:
# demonstrate np.arange()
np.random.normal(loc = 10, size = len(placebo))

In [None]:
fake_placebo = np.arange()

In [None]:
import scipy.stats

In [None]:
scipy.stats.median_test(placebo, drug)

As mentioned earlier, it is important to know what type of data you are dealing with, because different functions expect different types of data.

You will get an error if you pass some data of the wrong type to a function that cannot operate on that type.

Remember that the ```psychosis_status_observations``` list contains the following strings: ```['psychotic', 'not_psychotic', 'not_psychotic', 'not_psychotic', 'psychotic']```. See what happens if you try and use ```np.mean()``` on the ```observations``` list:

In [None]:
np.mean(psychosis_status_observations)

Whilst error messages can look fairly verbose and intimidating, they are useful. The information they give allows us to work out where our code has gone wrong. Googling the error message is often a good way of seeing where other coders have come across it, and how to solve it. 

If you look at the error message above, you can see the phrase `TypeError` on the first and final lines. This tells us (albeit a bit cryptically!) that somewhere in our code a function has been given an argument of the wrong type (e.g. the wrong type of data).

If you look at the top of the error message, we can see that the line of code that generated the error is:

> `----> 1 np.mean(psychosis_status_observations)`

We tried to calculate the average of a set of words; `np.mean()` expects numerical data (`ints` or `floats`) as input, and has generated the error because we gave it the `psychosis_status_observations` list, which contains strings...

## More functions


Aside from the main hypothesis you are on the island to test ([about the high prevalence of psychotic disorders](2_lists_indexing.ipynb#A-psychotic-island)), your group wishes to collect a variety of data relevant to the study of psychotic disorders.

You obtain some data from a hospital on the island. The dataset contains the scores on a psychosis questionairre for 200 patients, as well as the number of times that patient was hospitalised during a psychotic episode during the last two years.

The psychosis scores are stored in a variable called `psychosis_score_200_patients` and are shown in the cell below (this variable was assigned 'behind the scenes' at the beginning of the notebook...):

In [None]:
psychosis_score_200_patients

The number of hospitalisations for each patient are stored in a variable called `num_hospitalisations` and are shown in the cell below:

In [None]:
num_hospitalisations

The data in the cells above show the data in its raw form. Look at both of the arrays, can you see any patterns?


It is very hard to see patterns in raw data when it is presented in this format. 


[A BASIC PLOTTING EXERCISE HERE TO KEEP FUNCTIONS RELEVANT BEFORE THE NEXT EXERCISE, e.g. avoid the feeling of 'but why would I need this'?]

In [None]:
plt.scatter(psychosis_score_200_patients, num_hospitalisations)



plt.xlabel('Psychosis Score')
plt.ylabel('Number of Hospitalisations in the Two Years')
plt.show()

Let's go through some useful functions which we might want to use on this data. Here are some other useful numpy functions, alongside a brief description of what they do:
<br>
<br>
`np.negative()` - when given a list or array, this function will make every element of the array negative
<br>
<br>
`np.round()` - when given a list or array, this function will round each element of the list or array to the given number of decimals
<br>
<br>
`np.max()` - when given a list or array, this function will return the largest element of the list or array
<br>
<br>
`np.min()` - when given a list or array, this function will return the smallest element of the list or array
<br>
<br>
`np.sqrt()` - when given a list or array, this function will return the squareroot of each element of the list or array
<br>
<br>
`np.sort()` - when given a list or array, this function will order the elements of the list or array from smallest to largest
<br>
<br>
`np.count_nonzero()` - when given a list or array, this function will count the number of elements which are <b>not</b> equal to 0.
<br>
<br>

### Question 5

Using any of the functions shown above, find the <b> lowest </b> psychosis score in the `psychosis_score_200_patients` array, store this in a variable called `lowest_score`.

In [None]:
# answer

lowest_score = np.min(psychosis_score_200_patients) #!!! replace with ...

# show the highest score
lowest_score

In [None]:
_ = ok.grade('q5')

### Question 6

Using any of the functions shown above, find the <b>highest</b> psychosis score in the `psychosis_score_200_patients` array, store this in a variable called `highest_score`.

In [None]:
highest_score = np.max(psychosis_score_200_patients) #!!! replace with ...

# show the highest score
highest_score

In [None]:
_ = ok.grade('q6')

### Question 7

Using any of the functions listed above, sort the  `psychosis_score_200_patients` from lowest to highest, and store the result in a variable called `sorted_scores`:

In [None]:
sorted_scores = np.sort(psychosis_score_200_patients) #!!! replace with ...


#show the sorted scores
sorted_scores

In [None]:
_ = ok.grade('q7')

## LINK BACK TO MEDIAN DIFFERENCE QUESTION EARLIER, TO USE ROUND FUNCTION, EXTRA QUESTION

## Plotting Functions

One useful type of functions are those used for plotting, that is, for creating graphical displays of data:

<i><center>'The dominant task of the human cortex is to extract visual information from the activity patterns on the retina. Our visual system is therefore exceedingly good at detecting patterns in visualized data sets. As a result, one can almost always see what is happening before it can be demonstrated through a quantitative analysis of the data. Visual data displays are also helpful at finding extreme data values, which are often caused by [...] mistakes in the data acquisition.'</i> </center>
<center>(page 51, Haslwanter, 2016, An Introduction to Statistics with Python)</center>


[SET UP GREATER DATA COLLECTION SCENARIO, SHOW SOME NICE PLOTS AND EXPLAIN HOW THEY ARE USEFUL FOR DIFFERENT TYPES OF DATA]

[RANDOMLY GENERATE DATA, SHOW THESE GRAPHS:

* Bar plot (categorical)
* Bar plot (ordinal)
* Histogram (quantitative)
* Scatterplot


*MARK USING IMAGES, EG GENERATE CORRECT GRAPH YOURSELF, define pre-existing hint functions to help users who get stuck*

In [None]:
plt.hist(psychosis_score_200_patients)
plt.show()

In [None]:
plt.scatter(psychosis_score_200_patients, num_hospitalisations)
plt.show()

## Bar plot here

In [None]:
# write a data generating function, and import it here

# show the data in one cell..

# then a plot in the next

# write a list of 'if this sort of data, then this plot' for the user to check against, have a 'show_plot' to 
# show the correct graph

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("ok_tests/4_functions_plotting") if q.startswith('q')]

## Summary

[summary here]

Or, you can [return to the main page](0_main_page.ipynb).

To navigate to any other page, the table of contents is below:

## Other Chapters

1. [Populations, Samples & Questions: Why Learn Python?](1_populations_samples_questions.ipynb)
2. [Lists & Indexing](2_lists_indexing.ipynb)
3. [Arrays & Boolean Indexing](3_arrays_booleans.ipynb)
4. [Functions & Plotting](4_functions_plotting.ipynb)
5. [For Loops - doing things over (and over and over...)](5_for_loops.ipynb)
6. [Testing via Simulation: Psychosis Prevalence](6_simulation_psychosis_prevalence.ipynb)

***
By [pxr687](99_about_the_author.ipynb) 