<center><h1>Data-specific problems in Python</h1></center>

This notebook is a collection of problems using Python to perform tasks relevant to data analysis. The aim is to develop your general Python skills in a data context; we will look at using Python libraries to simplify a lot of this work in the next DS bootcamp.

This version of the notebook uses single examples to illustrate a code structure, then gives you problems to try yourself. If you'd prefer more independence in your practice, try the "no examples" version for a challenge. If you'd prefer more support or find this too difficult, try the "faded examples" version.

<h3>Simple <em>for</em> loops</h3>

The following code computes the length of a list of numbers. Try adding more entries to data, or removing some, and re-running the cell.

In [1]:
data = [1,2,3,4,5]

result = 0
for number in data:
    result += 1 # this is equivalent to "result = result + 1" - i.e., add 1 to the number stored in result
print(result)

5


Note: there is a built in function, len(), to find the length of a list - so in reality you'd use that to find the length, rather than writing out your own code like above.

Write a <em>for</em> loop below to calculate the <em>sum</em> of the numbers in data. Modify the list data a few times to check the output.

In [1]:
data = [1,2,3,4,5]

# your code here

Note: again, there is a built-in sum() function to do this for us - but this is practice writing simple for loops.

Now write a for loop that calculates the <em>sum of squares</em> of a list of numbers. E.g., if $data=1+2+3$, your code should calculate $1^2+2^2+3^2=1+4+9=14$, so it should print 14 as the output. Calculating a sum of squares like this is important for finding the <em>standard deviation</em> of a dataset, which we will do later in this workbook.

In [2]:
# your code here


Test your code with the following datasets. The answer you should get is in a comment after each line.

In [2]:
test1 = [1,2,3,4,5,6,7,8,9] # answer = 285
test2 = [-3,-7,12,4,-8] # answer = 282
test3 = list(range(1, 100, 2)) # answer = 166650
# (note: range(1, 100, 2) means "start at 1 and go up to - but not including - 100") in steps of size 2
# so test3 = [1, 3, 5, ..., 95, 97, 99]

<h3>Functions for reusable code</h3>
Youve now written some <em>for</em> loops, but everytime you want to run them on a new dataset, you have to either redefine the "data" list (or whatever you chose to call it) or put the name of your new dataset into the list in place of the old one. Functions let you reuse code much more elegantly.

The code below takes the <em>for</em> loop from before for calculating the length of a list, but packages it up in a reusable function. Compare it with the original <em>for</em> loop.

In [3]:
def length(data):
    result = 0
    for number in data:
        result += 1 # this is equivalent to "result = result + 1" - i.e., add 1 to the number stored in result
    return result

Note that we have changed from "print(result)" to "return result." We could still use a print() statement if we wanted, but a return statement lets us pass the output from this function into another function, which is very useful. Below are some examples of using the function:

In [4]:
data = [1,2,3,4,5]
print(length(data)) # we can name a list data, and then pass the word "data" into the function

numbers = [100,-1]
print(length(numbers)) # we don't have to use the name "data", just because that was what we called it when defining
# the function - any name for the list will do

print(length([3,9,7,4,-2,-9,12])) # we don't need to give the input list a name at all, we can just put it in directly

print(length(['this', 'is', 'a', 'list', 'of', 'strings', 'but', 'that', 'doesn\'t', 'matter'])) # the length function
# works on any list - not just lists of numbers

5
2
7
10


Write a function that contains your <em>for</em> loop for the sum of numbers:

In [3]:
# your code here

Now create a function that contains your <em>for</em> loop for the sum of squares. Call the function "sum_squares()"

In [None]:
# your code goes here


Test your sum_squares() function with the same three test sets as earlier. This should be easier now - all you need to do is call sum_squares(test1), etc.

We can combine functions into new functions to do more complicated things. For example, the <em>mean</em> (average) of a list of numbers is found by adding them up and dividing by how many there are; we have an adder() function that adds them up, and a length() function that finds how many there are, so we can combine them like so:

In [None]:
def mean(data):
    output = adder(data) / length(data)
    return output

Note: we didn't need to define output on one line and return in on a separate line - we could have just written a single line definition of the function with "return adder(data) / length(data)." When the calculations are more complicated though, it's usually better to break them down into steps.

We also could use the built-in functions: "sum(data) / len(data)" - instead of using the ones we defined.

We also could have written this function without referring to other functions at all - we could have put the <em>for</em> loop(s) finding the length and sum directly into the definition of this function, like so:

In [5]:
def mean2(data):
    length = 0
    total = 0
    for number in data:
        length += 1
        total += number
    output = total / length
    return output

Note that this takes more effort to read and understand than the first version. Where possible, it is often good practice to write complicated functions in stages, giving each stage a sensible name, so that each definition is as short and easy to understand as possible.

Try out the mean() function on a few sample datasets. You might want to use mean2() as well and check they give the same answers.

We will now work towards writing a function that computes the <em>sample standard deviation</em> of a dataset. Note: the standard deviation is found by subtracting the mean from each number, adding up the squares of the results from that, dividing by <em>one less than</em> how many there are, and then square-rooting.

We'll start with the first part of this calculation - subtracting the mean from each number. Write a function meeting the specification below:

In [None]:
# this function should take a list of numbers and output a list of numbers, where the output list is made by taking each number
# in the input list and subtracting the mean of all the input numbers
# e.g., with input [1,2,3], the mean is 2, so the output should be [-1, 0, 1]
def subtract_mean(data):
    # your code here

Note the code pattern of setting up an empty list (differences = []) and then a for loop with differences.append() in. This is a common code pattern for building up a list one element at a time, and is similar to the pattern we saw earlier of setting result = 0 and then having a for loop where we kept adding something to result.

There is a more elegant and "Pythonic" way to build up a list like this, called "list comprehension." It looks like this:

In [None]:
differences = [number - average for number in data]

So instead of creating an empty list and then adding to it in a <em>for</em> loop, we just write the calculation that we would have done in the for loop "number - average", then indicate the for loop by writing "for number in data" and put the whole thing in square brackets to show it's supposed to be a list.

List comprehensions can make your code a little shorter, but they can be confusing. There's nothing wrong with sticking to the "empty list and <em>for</em> loop" pattern if you prefer that.

If you'd like more practice at list comprehension, try using it to double every number in a list:

In [None]:
data = [1,7,-9,12,-3]
doubles = # your code here

Now, back to calculating the standard deviation! Our subtract_mean() function takes a list of numbers and subtracts their mean from each of them. Now we need to square them all and add them together (remember your sum_squares() function from earlier?), divide by the length minus 1, and then square root the output (you can take a square root in Python with "number ** 0.5").

Write a function to do this:

In [None]:
def st_dev(data):
    # your code here!

Test your code with the following datasets:

In [6]:
test1 = [1,2,3,4,5,6,7,8,9] # answer = 2.738...
test2 = [-3,-7,12,4,-8] # answer = 8.384...
test3 = list(range(1, 100, 2)) # answer = 29.154...

<h3><em>If</em> statements</h3>
The following function checks the sign (positive, negative, or zero) of a number - this can be important for data validation. Check out the use of list comprehension when testing it.

In [7]:
def check_sign(number):
    if number > 0:
        return 'positive'
    elif number < 0:
        return 'negative'
    else:
        return 'zero'

numbers = [-3, 1, 724.6, -11, 0, 2, 200, -2]
print([check_sign(number) for number in numbers])

['negative', 'positive', 'positive', 'negative', 'zero', 'positive', 'positive', 'negative']


Suppose we want to compare Covid cases around the world. We might have a list of case numbers in each country as a proportion of population and we want to classify countries as "green", "amber", or "red" based on their case numbers.

One useful way to do this is with the mean and standard deviation. For Normally distributed data, 68% of data points are between (mean - stdev) and (mean + stdev), leaving 32% outside that range - 16% above and 16% below. So the bottom 16% (roughly one sixth) of the data are less than (mean - stdev), 68% are between (mean - stdev) and (mean + stdev), and the top 16% are above (mean + stdev).

Write a function that classifies a number as "red", "amber", or "green" based on how it compares to two thresholds:

In [None]:
def RAG_classifier(number, lower_threshold, upper_threshold):
    # your code here

Now write a function that uses this and the mean and standard deviation functions from before to convert a list of numbers into a list of RAG ratings, where a point is red if it is less than (mean - stdev), green if it is greater than (mean + stdev), and amber if it's in between. Hint: you can use a <em>for</em> loop for this, or a list comprehension; you won't need a new <em>if</em> statement for this, your RAG_classifier() function will handle that.

In [None]:
def stdev_classifier(data):
    # your code here

Test your code by running the following cell:

In [None]:
data = [20.09852562, 23.78812558, 17.16023224, 12.98506367, 21.14158154,
       21.49384295, 20.92017253, 20.20764812, 10.07496259, 23.80358356,
       24.41430014, 25.4310784 , 19.77776613, 22.08102474, 13.08124938,
       25.54101538, 22.02317736, 24.09559675, 18.52046219, 18.53615184]
results = ['A','A','A','R','A','A','A','A','R','A','A','G','A','A','R','G','A','A','A','A']
if stdev_classifier(data) == results:
    print("Correct")
else:
    print("Not quite there yet")

Often we have data that is updated regularly (every month, say) and are interested in flagging changes in the data. Write a function that will take two lists of RAG ratings (of the same size), and return a list of strings "same", "up", or "down" depending whether the rating for an entry has stayed the same, gone up, or gone down. Hint: you will need to use either a for loop or a list comprehension, with an if statement inside!

In [None]:
def detect_changes(old_ratings, new_ratings):
    # your code here

Test your code by running the following cell. Note the use of the enumerate() function in the <em>for</em> loop - this can be useful when looping over a list if you want to know both what index you're at in the list and what's in the list at that index. You might want to look into this more yourself!

In [None]:
old_ratings = ['A', 'R', 'A', 'A', 'G', 'R', 'A', 'G', 'G', 'A', 'A', 'A']
new_ratings = ['A', 'G', 'A', 'R', 'G', 'A', 'A', 'A', 'G', 'A', 'G', 'A']
correct_answers = ['same', 'up', 'same', 'down', 'same', 'up', 'same', 'down', 'same', 'same', 'up', 'same']
changes = detect_changes(old_ratings, new_ratings)
is_correct = True
for index, rating in enumerate(correct_answers):
    if rating != changes[index]:
        print(f"There's an error at index {index}")
        is_correct = False
if is_correct:
    print("Well done, all correct")

<h3>Nested <em>for</em> loops</h3>
We have looked at simple <em>for</em> loops, now we will consider <em>nesting</em> one <em>for</em> loop inside another. An example of when this is useful is when you have multiple datasets (or columns within a single dataset) and want to do iterate over each in turn.

The function below uses this to find the largest entry in each column. The data is given as a list of lists, where each of the sub-lists should be viewed as a single column.

In [8]:
def find_column_maxes(data):
    maxes = []
    for column in data:
        column_max = column[0]
        for number in column:
            if number > column_max:
                column_max = number
        maxes.append(column_max)
    return maxes

If you're not used to coding with nested loops, you might need a few passes to understand this code.

Start by looking at the middle bit of the code: everything inside the first <em>for</em> loop. We create a variable called column_max and set it equal to the first entry in the column. Then we loop through all the numbers in the column: what do we do with each number? What will the value of column_max be at the end of this for loop? That value is then appended to the list "maxes".

Then we just repeat this process for each different column in the dataset, so maxes builds up as a list of the different values of column_max for each column.

If you don't fully understand how this code works yet, try manually stepping through the code for the following example - pretend you're the computer and walk through what you do at each line of the code.

In [9]:
# this dataset should be interpreted as four columns, each with 3 rows
data = [[1,2,3], [-7, -4, -9], [12, 18, -6], [0, 0, 2]]
find_column_maxes(data)

[3, -4, 18, 2]

Write a function that counts the number of zeroes in each column:

In [None]:
# your code here

Test your code on the following data:

In [None]:
test = [[0,0,0,0], [7, 0, -1, 12], [3, 0, 9, 0], [-1,5,11,9]]
# answer = [4,1,2,0]

Write a function that will take a list of lists of lists (a collection of several datasets, where each dataset is a list of columns, and each column is a list of numbers) and return a list of lists, recording for each dataset and each column, what the mean is. To help you understand this, first look at the test data and expected output:

In [None]:
dataset1 = [[1,2,3], [4,5,6], [7,8,9]] # 3 columns, 3 rows
dataset2 = [[5,-1,12,-4], [8,9,-1,0]] # 2 columns, 4 rows
dataset3 = [[0,3,6,9,-12], [1,3,5,-7,4], [-1,-8,-6,3,-2], [0,0,1,0,1]] # 4 columns, 5 rows

test_data = [dataset1, dataset2, dataset3] # a collection of 3 datasets, each with multiple columns

# correct output of find_column_means(test_data) is [[2.0, 5.0, 8.0], [3.0, 4.0], [1.2, 1.2, -2.8, 0.4]]

In [None]:
# your code here