# Python Basics for Data Analytics
**NCG613: Data Analytics Project - Practical 1**

This notebook covers fundamental Python operations you'll need for data analytics and causal inference.

**Topics Covered:**
- Python as a calculator
- Variables and data types
- Basic operations

*Note: Setup and installation instructions are in the repository README.*

## Python as a Calculator 

First off,  Python can do simple arithmetic. 

In [None]:
print(3 + 7)

In [None]:
print(4.5 * 6.1)

In [None]:
# 
# A hashtag makes a comment - try some division
# 
print(4 / 2)

In [None]:
print(5.7 / 8.8)

Powers are possible, using `**`

In [None]:
print(2 ** 6)

In [None]:
print(2 ** 0.5)

**Self-test:** what do you think `%` does here?

In [None]:
print(8 % 3)

In [None]:
print(9 % 3)

In [None]:
print(17 % 8)

In [None]:
print(17 % 6)

In [None]:
print(17 % 4)

In [None]:
9 % 4

### More Than One Line of Code

Up until now,  most attention has been on representing information in Python,  rather than doing very much with it.  In this section Python programming will be introduced.  From one viewpoint,  Python programs can simply be lists of instructions such as introduced above, stored in a text file and executed in sequence - although some more sophisticated ideas will need to be considered.  However,  it is certainly the case that rather than just typing things into Jupyter line by line,  it is a good idea to create your code in a single chunk,  and store the programs. 
You can do this in Jupyter.  Noting that you need to press shift+enter to send the Python code to be run, just pressing *enter* on its own creates a new line of code in the box,  without running it. 
As an example of multiple line entry, note that you can also store the results of calculations in variables:

In [None]:
y = 7
x = 6
print(y)
print(x)

In [None]:
print(x * 6)

In [None]:
y = 9 / 4
print(y)

Can you explain the following?  What do you think `//` does?

In [None]:
z = 9 // 4
print(z)

There are different data types in Python - the two you saw there were *float* and *int* types.  To write an int in Python,  simply write an integer without any decimal - eg `25` or `-6`.  To write a float, enter a decimal (even if the value being written is a whole number) - e.g., `3.0` or `-9.54`.  If you mix an int and a float in a calculation,  the result is a float.

In [None]:
print(8 + 1.0)

If both numbers are floats,  then unsurprisingly the result here is a float.

In [None]:
print(8.0 - 7.0)

If both numbers are ints then the result is an int.

In [None]:
print(8 - 12)

Note that this is not the case for division when the result of dividing one int into another might be a float:

In [None]:
print(23 / 8)

The answer to the simple 'how many times does 8 go into 23' version of division is provided by the `//` operator:

In [None]:
print(23 // 8)

This time the result is an `int`.

Python also has functions - these are similar to those in R - for example

In [None]:
a_number = -7
another_number = 67
print(abs(a_number))

In [None]:
print(abs(another_number))

As well as int and float types,  Python handles quite a few other data types.  *String* is another commonly used one. 

In [None]:
my_name = "Kevin Credit"
print(my_name)

The operator `+` also works for strings - it joins them together. 

In [None]:
first_name = "Kevin"
surname = "Credit"
print(first_name + surname)
# Note that '+' doesn't insert spaces between strings 
# unless you explicitly tell it to

In [None]:
print(first_name + " " + surname)

The `*` operator works when it has one term as a string and the other as an int,  where it repeats the string argument `n` times,  if `n` is the int:

In [None]:
print("Ha " * 3)
# Order of int and string doesn't matter

In [None]:
underline = 20 * "-"
print(underline)

Note that the numerical term here *must* be an int - floats give an error. This makes sense,  as you can't repeat a string 3.4 times,  for example.  However,  note that you get an error even for a float value that is a whole number, such as `4.0`.

The `%` operator also has a use if the first argument is a string.  Here it represents a printing format.   For example

In [None]:
print("%6.4f" % (1/3))

Reformats the result of `1/3` as a floating point number (hence the `f`) taking up 6 characters in total,  with 4  digits after the decimal point.  To find out more about possible formats,  see this link: https://pyformat.info.

## Python Packages
In most languages,  other functions - such as log, sin and so on are provided.  They are available in Python as well,  but they are part of a library rather than in the core Python made available when you start up the REPL interpreter - as you did earlier on.  Individual items in the library are called *packages* and you can access them via the `import` statement.  Functions called `sin`, `cos` and so on are available in a library called `math` - a number of constants are also provided, such as `pi`.  Here is an example of their use:

In [None]:
import math
theta = math.pi / 3.0
print(math.cos(theta))

In [None]:
print(math.sin(theta))

Note that although the functions fpr `sin`,`cos` and so on are quite accurate,  they aren't perfect,   hence the slight error in `cos(theta)` here.   Its probably a good idea to round the number of decimals when printing out results like this - for example:

In [None]:
print('%7.4f' % math.cos(theta))

From the example,  you can see that functions imported from `math` are named `math.<fn>` where `<fn>` is the  function name.  Using functions from packages generally works this way. However,  sometimes if a lot of functions from a package are used this gets cumbersome - particular if thre package has a long name.  One way to get round this is to use the `import ... as` variation:

In [None]:
import math as m
theta = m.pi / 3.0
print('%7.4f' % m.cos(theta))

In [None]:
print('%7.4f' % m.sin(theta))

Here we tell Python to import the `math` package,  but to refer to it as `m` once it is imported.  Another approach is to use the `from ... import` variation.  Here the functions to be used from the package are directly stated,  and afterwards they may be referred to directly.

In [None]:
from math import sin, cos, pi
theta = pi / 3.0
print('%7.4f' % cos(theta))

In [None]:
print('%7.4f' % sin(theta))

However,  the complication with the above approach is that there is nothing to stop several packages having functions with the same names - it then becomes hard to distinguish between which one is to be used.  Thus the last approach, although the most convenient in some ways,  is best confined to short snippets of Python.

## Python Lists

Until now,  you have seen Python variables containing a single value - either float, int or string.  Python also has variables that contain lists of values.  Items in a list are separated by commas,  and enclosed in square brackets:

In [None]:
ages = [52,21,43,23,19]
print(ages)

A function that applies to lists is `len` - returning the length of the list (ie the number of items):

In [None]:
print(len(ages))

You can also pick out individual items in the list like an array in R:

In [None]:
print(ages[1], ages[0])

However, unlike R (but like C++ and Java) the first element in the list is indexed at zero,  not one.  Thus,  the last element in `ages` is `ages[4]` **not** `ages[5]` - forgetting this is a common source of errors in Python, particularly if you are used to coding in R or MATLAB.  

You can also use negative numbers to select items relative to the end of the list: `ages[-1]` refers to the last item in `ages`, `ages[-2]` the one before that,  and so on:

In [None]:
print(ages[-1], ages[-3])

You can also pick out sub-lists using *slicing* - specifying a sequential list of indices:

In [None]:
print(ages[0:2])

Note that this picks out elements 0 and 1 - but *not* 2 - slicing operators do not include the final element.  What this does mean is that, for example, `ages[0:len(ages)]` selects the entire list `ages` -

In [None]:
print(ages[0:len(ages)])

However,  a classic Python 'gotcha' is that `ages[len(ages)]` without slicing gives an error,  as it refers to `ages[5]`.
It is also possible to omit the expression before or after the `:`.  If the left hand is omitted it is assumed to be element zero;  if the last is omitted, it is assumed to be the last element (plus one).

In [None]:
print(ages[:2])

In [None]:
print(ages[1:])

In [None]:
print(ages[-2:])

In [None]:
print(ages[:-1])

Note the last example returns all of the elements in ages *except* the last one - the term to the right of the `:` isn't included,  as with the `ages[:2]` example.  You can also mix positive and negative indexes in a slice:

In [None]:
# print(all elements except the first and last)
print(ages[1:-1])

The expression `[]` refers to an empty list.  Sometimes you can get this as a result of a slice in which the left hand term exceeds the right.

In [None]:
dead_list = ages[2:1]
print(dead_list)

In [None]:
print(len(dead_list))

They are useful in other situations - as will be seen later. 

Another function for lists is `sorted`.  This sorts the items in the list.

In [None]:
print(sorted(ages))

You can apply slicing to results of functions,  provided they are also lists.  The following prints all values of `ages` except the largest and smallest.

In [None]:
print(sorted(ages)[1:-1])

### Lists of Lists

It is possible for the individual elements of lists to be of different types.

In [None]:
mixed_up = ['Kevin','Credit',8,'Nov',2024]
print(mixed_up)
print(mixed_up[0:2])

Interestingly,  it is also possible to have other lists as elements of lists.

In [None]:
my_details = [['Kevin','Credit'],[8,'Nov',2024]]
print(my_details[0])
# Use succesive indexing to access elements
# of lists inside other lists
print(my_details[1][2])

You can use this to represent contiguity between geographical zones - if the provinces of Ireland are indexed by the numbers 0 to 3 for Ulster, Connaught, Leinster, and Munster respectively then the province contiguities (ie the information as to which pairs of provinces share a boundary) can be represented by a list of lists:

In [None]:
province_nbrs = [[1,2],[0,2,3],[0,1,3],[1,2]]
print(province_nbrs[0])

The list of neighbours of province 0 (Ulster) is the first element of the list `province_nbrs` and is itself a list - `[1,2]` - meaning that the neighbouring provinces are Connaught and Leinster.


### Slicing and Strings
Python treats strings as lists of a kind - it is possible to access individual characters in a string via list item indexing.

In [None]:
name = 'Kevin Credit'
print(name[0])
print(name[0:5])
print(name[-1])

Functions that work on lists often work on strings by treating them as a list of characters:

In [None]:
print(len(name))
print(sorted(name))

Note that in the `sorted` example,  the result is literally a list of characters - they are sorted,  but not reconstructed into a string. As before,  the slicing operator could be applied to the right of any expression resulting in a string - for example to pick out the initial for my first name:

In [None]:
print(my_details[0][0][0])

and to create a list with both my initials:

In [None]:
my_inits = [my_details[0][0][0],my_details[0][1][0]]
print(my_inits)

### List Methods

As well as functions that return valus as lists, there are also *methods*.  Methods differ from functions in a number of ways - but in some ways are similar.  If `x` is some kind of Python object,  then a method is called by `x.<meth>()` where `<meth>` is the method name.  For example, the `append` method adds a new item to the end of a list.

In [None]:
print(ages)
ages.append(34)
print(ages)

A key point here is that `append` modified the actual list.  Whereas functions such as `sorted` left the variable `ages` unaltered,  append actually changed it. This is not always the case with methods,  but quite often it is.  There is also a `sort` method which sorts the items in a list,  but actually changes the list,  rather than providing a new list.

In [None]:
# Use the 'sort' method on 'ages'
ages.sort()
# Check 'ages' has actually been permanently altered
print(ages)

Some methods also return values - `pop` returns the value of a particular index in a list,  but then removes that value. 

In [None]:
# Pop the last value from 'ages'
oldest = ages.pop(-1)
print(oldest)

# Show that the last value has been removed
print(ages)

Another useful methods is `insert` - this places a new value inside an existing list *before* a specified position 

In [None]:
# Put the oldest value back in the 'ages' list, but just before position 2
ages.insert(2,oldest)
print(ages)
# Now add a new age at the beginning
ages.insert(0,63)
print(ages)
# Sort it to keep it in order
ages.sort()
print(ages)

### Joining Lists

Recall that `+` was used for joining strings together.  It can also be used for joining lists

In [None]:
list1 = ['Kevin','Credit']
list2 = ['NUIM',2024]
print(list1 + list2)

## Python Dictionaries

A `dictionary` is similar to a list,  but a key difference is that items are referred to by a name,  rather than by location:

In [None]:
new_car = {'colour':'blue','cylinders':4, 'capacity':1200}

The variable `new_car` is a kind of list,  but each of the three items are referred to by names.  The items in the dictionary are accessed in a similar way to lists,  except that a string containing the name is used,  rather than an int,  as with lists.

In [None]:
print(new_car['colour'])
print(new_car['cylinders'])

These are useful data types when you wish to associate items of information with a list of people, places and so on.  For example they can associate geographical data with the names of locations

In [None]:
population = {'Ulster':294803,'Connaught':542547,'Leinster':2504814,'Munster':1246088}
print(population['Leinster'])

The two entries in each dictionary element (ie the lookup-up name and the associate value) are called the `key`  and `value` respectively.  As with lists,  the value can take any form - including a list or another dictionary.  For example,  the contiguity information for provinces stored earlier as a list could also be stored as a dictionary:

In [None]:
neighbours = {'Ulster':['Connaught','Leinster'],'Connaught':['Ulster','Leinster','Munster'],'Leinster':['Ulster','Connaught','Munster'],'Munster':['Connaught','Leinster']}
print(neighbours['Munster'])

As there is the idea of an empty list,  there is also an empty dictionary,  represented by `{}`. This is useful,  as new items in a dictionary can be created by statements of the form `dict[key]=value`.  If `key` already exists in the dictionary the item will be overwritten,  but if it isn't, a new key/value pair is added.  Thus,  another way to enter the neighbours of the provinces in Ireland is

In [None]:
# Start with an empty dictionary
neighbours = {}
# Add entries one by one
neighbours['Ulster'] = ['Connaught','Leinster']
neighbours['Connaught'] = ['Ulster','Leinster','Munster']
neighbours['Leinster'] = ['Ulster','Connaught','Munster']
neighbours['Munster'] = ['Connaught','Leinster']
# Prove the dictionary works as before
print(neighbours['Munster'])

Two methods sometimes helpful for dictionaries are `keys` - which extracts all of the key fields for a dictionary as a list.  Similarly the method `values` extracts all of the values.

In [None]:
print(population.keys())
print(population.values())

Note that although the order of the values corresponds to the values of the keys,  the order they are extracted is not necessarily the order in which they were added to the dictionary. Dictionaries associate keys to values,  but unlike lists, no specific order for the items is implied.


## Python Programs



### Program Loops

You will have already encountered loops in other languages.  In Python a basic `for` loop looks like this:

In [None]:
for n in [2,4,6,9]:
    print(n)

The main ingredients are the looping variable `n` and the list to loop through - here `[2,4,6,9]`. Note that the 'body' of the loop is indented by a tab.  Typing the tab is essential - its actually part of Python's syntax.  In this case,  the loop takes every value in this list and prints it out. Also note that this is two lines of Python,  and that each line on its own is insufficient to create the loop.  Both lines must be entered in a box in Jupyter (lines separated by 'enter') and then shift+enter to run the loop.

For each cycle of the loop, `n` refers to the corresponding item in the list.  Looping through lists is a useful tool if you want to add up their values. Consider the following code:

In [None]:
age_total = 0.0
for age in ages:
    age_total = age_total + age
print(age_total)

Note that the last line in the code has no indent (ie the first character is not tab) - this tells Python that the line is executed after the loop is completed - if it had been indented,  Python would execute it on every cycle of the loop. This would result in all of the running totals to be printed as well as the final result.  In your editor, add a tab to the last line, copy the modified code to the clipoboard and paste and run again.

In [None]:
age_total = 0.0
for age in ages:
    age_total = age_total + age
    print(age_total)

The output now shows the running total as predicted.  If you had wanted an average age rather than a total age,  the code is relatively easy to modify - again do this by editing the code in your text editor, and run it.

In [None]:
age_total = 0.0
for age in ages:
    age_total = age_total + age
age_average = age_total / len(ages)
print(age_average)

Finally on this code snippet,  a useful tool is the use of  `+=` - this operator adds the right hand value to the left hand variable,  and stores it in that variable. For example, `x = x + 1` can be replaced by `x += 1`.  The code now becomes

In [None]:
age_total = 0.0
for age in ages:
    age_total += age
age_average = age_total / len(ages)
print(age_average)

Now you may see why the `keys` method for dictionaries is useful - it provides a list of keys in a dictionary to loop through.

In [None]:
for province in neighbours.keys():
    print(province,"has", len(neighbours[province]), "bordering provinces")

However,  the output is rather messy.  A new use of the `%` operator is as a formating tool.  The expression ` fmt % x ` creates a string in which the variable `x`  is formatted according to a specification in the string `fmt` - the range of possible formats is large,  but for now note that the format `'%10s'`  takes a string and pads it out with spaces to have a length of 10,  if it is shorter than this beforehand.  The code below adds a statement in the loop to do this.

In [None]:
for province in neighbours.keys():
    fmt_prov = '%10s' % province
    print(fmt_prov,"has", len(neighbours[province]), "bordering provinces")

Also,   if it is possible to left-justify the province names, by replacing `10` with `-10` in the format statements.

In [None]:
for province in neighbours.keys():
    fmt_prov = '%-10s' % province
    print(fmt_prov,"has", len(neighbours[province]), "bordering provinces")

### Defining Functions

As well as the functions that are built in to Python (such as `len`) it is possible to define your own functions.  For example,  to define a function to compute the average of a list of numbers,  the following can be used:

In [None]:
def average(data_set):
    total = 0.0
    for item in data_set:
        total += item
    return total / len(data_set)

As with `for` loops the indents (via tabbing) are actually part of the syntax.  The indented code under the `def` statement is part of the function - and when the indenting stops,  the function definition is complete.  Note also the loop inside the function.  The loop body is doubly indented, because 

  1. It is in a loop; and
  2. The loop is inside the body of a function.

If you enter this into a a Jupyter box and send to to Python by hitting shift+enter,  you have the added a new function to Python, called `average`.  At this stage you will see no print-out because you have only *defined* the function - not used it.   In the next box enter the following to test it out:

In [None]:
height = [160.0,163.0,157.0,171.0,168.0,176.0]
print(average(height))

You can see that `average` now works like any other Python function.
This could be made to look neater by formatting the result:

In [None]:
print('%8.2f' % average(height))

Note that functions can return lists and dictionaries as well as single values - for example the built-in function `range` returns a list of numbers from 0 to `n - 1`.

In [None]:
print(range(5))

This is useful in basic loops that count through the value of some index:

In [None]:
for i in range(10):
    print(i, i*i)

Again,  watch out for the zero indexing,  people often expect the index to run from `1` to `10` not `0` to `9`.

`range` is also useful for loops that have to be run a fixed number of times,  but without actually referring to the index variable.  For example,  to create a list with 10 entries,  all equal to zero, use

In [None]:
l = []
for i in range(10):
    l += [0]
print(l)

The `+=` here works in list mode,  ie the statement is equivalent to `l = l + [0]`,  which appends a new element `0` to the existing list `l`.  Because the list is initially empty,  doing this 10 times gives a list of 10 zeroes.  The value of `i` is not used in the loop,  but because it loops over 10 values,  the desired effect is achieved.

Putting some of these ideas together,  the function below returns a list of length `n` where each element in the list is twice the value of its predecessor.   Add the following to `average.py` and run it.

In [None]:
def doubling_up(n):
    latest = 1
    result = []
    for i in range(n):
        result += [latest]
        latest *= 2
    return result
    
print(doubling_up(4))

You can now combine the two functions you have written:

In [None]:
print(average(doubling_up(12)))

### Local Variables

When you create a function via `def` you often use variables inside the function.  These are known as `local` variables.  An interesting characteristic of these are that they only exist inside the function definition - so in the function `doubling_up` the variables `latest` and `result` do not exist once the function has been run.  The other characteristic of local variables is that if in the main program there are variables with the same names,  they won't be altered when you call the function.

Hence:

In [None]:
latest = 'Hello'
print(doubling_up(6))
print(latest)

You can also verify that `print(result` leads to an error,  as `result` is undefined outside of the function.)

### If - Then - Else

Python also supports *conditional statements* - these are sections of code that are only run if some condition is true or false.  To begin with,  note that Python can also evaluate *logical expressions*:

In [None]:
x = 6
y = 9
print(x < 8)
print(x == 6)
print(x >= y)

These are expressions that have the value `True` or `False` depending on the truth of the statement. The general comparison operators are:

| Operator | Meaning |
| :- | :- |
| == | Equal to |
| != | Not equal to |
| < | Less than |
| > | Greater than |
| <= | Less than or equal to |
| >= | Greater than or equal to |

Note the difference between `=` and `==`. `x == 6` is an expression having the value `True` or `False` depending on whether `x` has the value `6` or not,  but `x = 6` assigns the value `6` to `x`,  overwriting any previous value.  
  
In addition, these can be combined using `not`, `and` and `or`.

For example:

In [None]:
print(not x < 8)
print(x == 6 or y == 5)
print(x > 5 or y > 15)

Another useful operator is `in`:

In [None]:
print(x in [5,7,9])
print(x in range(12))

These can be used in conjunction with the `if` statement - this works using the tabbing approach,  in the same way as `def` and `for`:

In [None]:
z = 4
if z < 8 :
    print('z is less than 4')
    print('So it must be pretty small...')

The lines that are indented with tabs after the `if` statement are only executed if the logical expression is true. Once the tabbing stops,  the lines are executed regardless of the test condition:

In [None]:
if z < 8 :
    print('z is less than 8')
    print('So it must be pretty small...')
print('This gets printed anyway')

In [None]:
z = 12
if z < 8 :
    print('z is less than 8')
    print('So it must be pretty small...')
print('This gets printed anyway')

There is also an `else` statement - this specifies code to be executed if the test in the `if` statement *isn't* true.  Its use is demonstrated here:

In [None]:
z = 12

if z < 8 :
    print('z is less than 8')
    print('So it must be pretty small...')
else:
    print('z is at least 8')
    print('So it is fairly big')
print('This gets printed anyway')

Once again,  the code associated with the `else` statement is indented with a tab. 
Now set `z` to some value less than `8` and try running the code above again.

As before,  you can incorporate all of the ideas together.  For example, define a function to compute factorials.   The factorial of a number `n` is defined as `n * (n -1) * (n - 2) * ... * 2 * 1`, so `factorial(4)` is `4 * 3 * 2 * 1 = 24`.  An exception is that the factorial of zero is one.  A function to compute factorials is 

In [None]:
def factorial(n):
    if n == 0:
        return 1
    else:
        result = 1
        for i in range(n):
            result *= i + 1
        return result

If you define this in a Jupyter box,  you can then try it out.

In [None]:
print(factorial(5))
print(factorial(0))

There are a few things to note about the function definition.  Firstly,  note the multiple tab nesting - there is a `for` loop inside an `else` statement inside a `def` of a new function. Another thing to note is that the `result *= (i + 1)` statement - this makes a running result of the multiplications,  but because `range` gives a list going from `0` to `n - 1`, it is necessary to use `i + 1` as the multiplier. As a self test question,  what would happen if you used `result *= i` instead?
 
Next,  an interesting aside - factorials can get very large quite rapidly - for example the factorial of 20 is 2,432,902,008,176,640,000. Try to compute the factorial of 50:

In [None]:
print(factorial(50))

Python has another data type called `long` - these are basically integers of arbitrary length. When an integer calculation gets too large for standard 4-byte integers,  the result converts to a long.  Here is a more extreme example

In [None]:
print(factorial(1000))

Factorials are not defined for negative numbers.  However the current function does not check for this - 

In [None]:
print(factorial(-4))

so the answer does not make sense. It might be better to modify the function to test whether the number is negative,  and instead of returning a numerical result, return the string `'Undefined'`.  One way to do this is to test whether the number is negative, return `'Undefined'` if that is true, and then put the existing code in an `else` clause.

In [None]:
def factorial(n):
    if n < 0 :
        return 'Undefined'
    else:
        if n == 0:
            return 1
        else:
            result = 1
            for i in range(n):
                result *= i + 1
            return result

It is possible to click on a Jupyter cell you have already entered  and edit the code.  When you have done that,  pressing shift+enter re-submits the function to Python. Doing this to the factorial function, i.e., the edited version is the one above, results in new behaviour:

In [None]:
print(factorial(-4))
print(factorial(4))

note that the code above is checking for three statuses of `n` - either `n < 0`, or `n == 0` or `n > 0`.  The above approach deals with this,  but requires that you nest several `if` statements. A shorthand version uses `elif` - the template here is 

  1. `if` first condition to test
  2. 'tabbed in' code to execute if above is true
  3. `elif` Next condition (if first condition not true)
  4. 'tabbed in' code to execute (if above condition is true)
  5. repeat steps 3 and 4 if no previous conditions are true
  6. `else` Do this if none of the above conditions are true - this is the catch-all
  7. 'tabbed in' code to execute  

Steps 5 and 6 can be omitted if no catch-all code is required.

In [None]:
def factorial(n):
    if n < 0 :
        return 'Undefined'
    elif n == 0:
        return 1
    else:
        result = 1
        for i in range(n):
            result *= i + 1
        return result

The factorial function now checks for negative numbers - but an alternative to returning a value if one is found is to cause an error to be raised.  Python has its own errors,  but it is also possible to create new ones, via the `raise` statement - as in this code:

In [None]:
def factorial(n):
    if n < 0 :
        raise Exception('Factorial not defined for negative integers')
    elif n == 0:
        return 1
    else:
        result = 1
        for i in range(n):
            result *= i + 1
        return result

If you edit the function again then you can test it out: 

In [None]:
print(factorial(-6))

You will see an error of the form

`Exception: Factorial not defined for negative integers`

returned.   This functions in the same way as a Python built in error - but is more helpful in identifying the problem, since it relates directly to the function you are defining.

## The While Loop

The while loop is another kind of loop,  making use of a logical expression.   The number of times a `for` loop cycles is determined when the loop is started - it is just the length of the list in the expression `for i in list :`.  A while loop begins with a logical expression and loops as long as the expression is true.  In this case,  the number of times the loop cycles is not known.  Below a while loop is used to contruct a 'doubling up' sequence, as before,  but this time instead of going for a fixed length it carries on until the value exceeds 1000:

In [None]:
latest = 1
result = []
while latest <= 1000:
    result += [latest]
    latest *= 2
print(result)

This can also be used in functions - again note the use of tabbing. In this code block,  the function `double_up_until` is defined,  and then also run.

In [None]:
def double_up_until(n_max):
    latest = 1
    result = []
    while latest <= n_max:
        result += [latest]
        latest *= 2
    return result
print(double_up_until(10000))

## Saving your Python Session,  and Coming Back Later
When you have finshed the exercise (or at any stage during the exercise) it is possible to save your status. You do this by clicking on **File** and then **Save Notebook as...** from the menu.  Choose a suitable name for your notebook (it should have the file ending `.ipynb`) and click **Save**. When you load this notebook again (through Jupyter Notebook), all of the commands you entered but at this stage won't have been run.  To run all of the Jupyter cells in the same way you put them in,  click on **Kernel** and then **Restart & Run All**.  You can now see everything you already entered,  and any printouts they produced.  Also any variables you created will exist for use in future code you put in.

## Making Python more like R - working with `numpy`
Now you will look at more data types and tools in Python for advanced data analysis. In particular you will see how Python can handle matrices, and some basic 2d and 3d graphics tools.  One of the key packages to enable this is call `numpy`.  This package should be available on your standard Anaconda Python installations.  One of the key things that `numpy` does is to provide matrix and array data types.  Although similar to lists above,  in some ways they are more powerful.  The following code demonstrates this:

In [None]:
# First import numpy
import numpy as np
# Create two 1D arrays using np.array
x = np.array([5.0,6.0,8.0,10.0])
y = np.array([8.5,8.4,8.1,7.2])
# Note that adding up arrays works element by element
print(x + y)
# and numpy functions also work like this - so eg np.log 
# is better with arrays than math.log

In [None]:
print(np.log(x))
# Two dimensional arrays (ie matrices) are also possible

In [None]:
z1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3]])
z2 = np.array([[-2.0, 4.7, 3.1],[-1.0,-1.4,1.8]])
print(z1 + z2)

In [None]:
print(np.sin(z2))

The data types provided by `numpy` are quite similar to the arrays and matrices in R - and are useful for data analysis in the same way.  Some of the statistical functions are also provided:

In [None]:
# mean, std are supported
print(np.mean(x))

In [None]:
z_score = (x - np.mean(x))/np.std(x)
print(z_score)

You can use these functions in your own functions:

In [None]:
def zscore(dat):
    return (dat - np.mean(dat)) / np.std(dat)
print(zscore(y))

## 2D Arrays and Matrices

It is also possible to use `np.mean` and so on with higher dimensional data. Functions that work in this way include `np.mean`, `np.std`, `np.sum`, `np.min` and `np.max`. You can use them to compute values for the entire matrix, or by specifying `axis=0` provide column-wise results, or row-wise results for `axis=1`:

In [None]:
print(np.mean(z1))
print(np.sum(z1,axis=0))
print(np.min(z1,axis=1))

Vector and array objects also have **attributes** - these are basically properties of the objects - such as the number of rows or columns a matrix has.  That particular attribute is called `shape`:

In [None]:
print(y.shape)
print(z2.shape)
print(np.min(z1,axis=0).shape)

The result of these expressions a a `tuple` - a Python object similar to a list.  One useful proprty of tuples is that although it is possible to assign a whole tuple to a variable,  it is also possible to do this element by element:

In [None]:
num_rows, num_cols = z1.shape
print("Rows =", num_rows, "; Cols=", num_cols)

It is also possible to use *slicing* on arrays. For example,  to pick out columns 0 and 1 from `z1` use:

In [None]:
part_z1 = z1[:,0:2]
print(part_z1)

Note - a slice expression is needed in both dimensions,  hence the `:` in the row dimension - stating that `all` rows are wanted.  Also remember that the right-hand value in a slice expression refers to the element `after` the last one to be selected.

Note that for 2D arrays,  such as `z1` and `z2`,  arithmetic operators `+`, `-`, `*`, `/` and so on work on an element by element basis.   Thus,  for example:

In [None]:
v1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
v2 = np.array([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(v1 * v2)

Thus, the `[0,0]` element of the result above is `v1[0,0]` times `v2[0,0]` - this is different from matrix multiplication. `numpy` also offers a data type called `matrix` for which multiplication is actually matrix multiplication:

In [None]:
m1 = np.matrix([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
m2 = np.matrix([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(m1 * m2)

## Data via `pandas`

Another helpful Python package is `pandas`.  This provides a new data type which is very similar to an R **data frame**. This package is used for a number of data manipulation tasks - one of the most practical is its ability to read in and manipulate data stored in `csv` files.  For this part of the practical,  you will need some data relating to house prices.  Currently this file, called 'hpdemo.csv', is included in the repository's `data/` folder. If you open this with a text editor, the first few lines look like this:

```
ID,price,year,type,xcoord,ycoord
1,420000,2023,T,533096.228451334871352,168133.980206388747320
2,491000,2023,T,534812.198977316147648,168261.043896099610720
3,370500,2023,T,527355.000154171139002,168640.001432864984963
4,465000,2023,S,525864.000856507453136,164316.002183664124459
5,750000,2023,S,523119.409212186641525,166051.474063268280588
```
Respectively,  the columns are:

| Variable | Description |
| :- | :- |
| ID | An ID number for each sold house |
| price | The sale price of the house |
| year | The year the house was sold (2023) |
| type | The type of building: "D" = Detached, "S" = Semi-Detached, "T" = Terraced, "F" = Flats, "O" = Other |
| xcoord | The x-coordinate (easting) of the house in EPSG 7405 (British Naitonal Grid) |
| ycooord | The y-coordinate (northing) of the house in EPSG 7405 (British National Grid) |

The `pandas` package has a useful function for reading `csv` files.  Make sure the house price file is in the folder you are working in,  and then enter:

In [None]:
import pandas as pd
hp = pd.read_csv('../data/hpdemo.csv')
print(hp)

You can access the contents of the columns in the same way as you access items in dictionaries:

In [None]:
print(hp['price'])

This lists all of the items in the column `price`.  Sometimes rescaling is useful - particularly to obtain 'neater' axes.

In [None]:
print(hp['price']/1000)

## Basic Statistical Graphics

Having read in this data,  we can start to analysis and visualise it in Python.   Firstly,  to visualise it the `pyplot` package (a sub-package of `matplotlib`) can be used - this produces Python plots that look quite like ones that MATLAB produces. You wont have seen sub-packages yet,  but it really just means a package called `matplotlib.pyplot` that can be imported without the rest of `scipy`.

In [None]:
import matplotlib.pyplot as pl
# Try to draw a histogram
pl.hist(hp['price'])
pl.show()

You will see that at the `pl.hist` stage no graphic appears.  `pylab` works by building up the layers in a image, and then when everything in place,  using the `show` function to actually create the plot.

This would perhaps look better if the prices were in thousands of pounds and some titles and labels were provided.  A nicer result is obtained using:

In [None]:
# Try to draw a neater histogram
pl.close()
pl.hist(hp['price']/1000)
pl.xlabel('Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('2023 London House Prices')
pl.show()

Note the way the graphic is 'built up' by putting the data,  and then various text labels on step by step.  Finally, the `show` command is added,  to draw the completed graphic. The initial `pl.close()` function closes the previous plot (otherwise the new plot would be overplotted on the original axes).

You can use some of the `numpy` operations on data frames - for example it might be useful to see the distribution of log house prices - this sometimes makes the distribution more like a log normal.

In [None]:
# histogram of log house price
pl.close()
pl.hist(np.log10(hp['price']))
pl.xlabel('Logarithm of Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('Logged 2023 house prices')
pl.show()

Here the function `log10` is used - taking logs to the base 10 of the prices - so that $\log_{10}(10,000) = 4$, $\log_{10}(100,000) = 5$ and $\log_{10}(1,000,000) = 6$.

It might also be useful to look at the breakdown in price by property type. At a very basic level, we can view the first several (in this case, 3) rows of the data for the two relevant columns:

In [None]:
print(hp[['type','price']][:3])

To find the average of a given group (i.e., category) in the data, we can use the `groupby` function in `pandas`:

In [None]:
hp.groupby('type')['price'].mean().plot.bar()

It is also possible to create scatter plots between numeric variables using `scatter`. To do this, we're going to first calculate a new variable called `dist` which will be the Euclidean distance (in metres) between each house and the "[Kilometre Zero](https://en.wikipedia.org/wiki/Kilometre_zero#Great_Britain)" of London, which is a central point near Trafalgar Square ((530034.8622179187, 180378.69960442273) in the British National Grid coordinate system).

The formula to calculate Euclidean distance (which works in this coordinate system) is found by simply solving for the hypotenuse in the Pythagorean theorem, i.e., $\sqrt {(x_2-x_1)^2 + (y_2-y_1)^2 } $:

In [None]:
hp['dist'] = np.sqrt((hp.xcoord-530034.8622179187)**2 + (hp.ycoord-180378.69960442273)**2)

To create the scatterplot:

In [None]:
pl.close()
pl.scatter(hp['dist'],hp['price']/1000,color='r',marker='.')
pl.xlabel('Distance from London Centre (metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Distance (2023 London)')
pl.show()

Interestingly,  although there is some linkage,  the relationship is not excessively strong.  Note that the dot density is quite high in the lower left hand corner,  so it is hard to see relative numbers of points.  One way of addressing this is to use semi-transparent points.  In the `scatter` function,  the colour was specified as `r` (for red),  but it is also possible to denote colour as a red, green, blue, transparency 4-tuple.  In the case of the three colours,  levels range from 0 to 1 - for transparency the level also runs from 0 to 1,  with 0 being completely transparent,  and 1 not transparent at all. 

In [None]:
pl.close()
pl.scatter(hp['dist'],hp['price']/1000,color=(1,0,0,0.1),marker='.')
pl.xlabel('Distance from London Centre (metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Distance (2023 London)')
pl.grid(True)
pl.show()

Note a grid was also added here.  The log transform is also interesting here.  Suppose we take the log of  the price and plot the result.

In [None]:
pl.close()
pl.scatter(hp['dist'],np.log10(hp['price']),color='r',marker='.')
pl.xlabel('Distance from London Centre (metres)')
pl.ylabel('Price')
pl.title('House Price v. Log of Distance (2023 London)')
pl.grid(True)
pl.show()

Here it is worth noting that the distribution is elliptical - suggesting a correlation,  but also a possible bivariate normal shape.

## Some Statistical Analysis

Having plotted this data,  some statistical analysis may be appropriate.  A further package, called `scipy` (Scientific Python) can be useful here.   It is probably useful to fit a regression line to the data in the last figure you drew.  In `scipy` there is a sub-package called `stats`.  This has a function called `linregress`:

In [None]:
import scipy.stats as st
print(st.linregress(hp['dist'],hp['price']))

The five numbers returned by this function as a 5-tuple are as follows:

1. The slope of the regression line
2. The intercept of the regression line
3. The correlation of the regression line
4. The $p$-value of a significance test for the null hypothesis that the slope is zero
5. The standard deviation of the residuals

As with the dimensions of arrays with `shape` earlier,  you can assign all of these to five variables in a single statement:

In [None]:
# Here are the 5 variables being assigned
a,b,r,pval,sderr = st.linregress(hp['dist'],hp['price']/1000)
# Print some out neatly...
print("a = %6.3f (slope)" % a)
print("b = %6.3f (intercept)" % b)
print("r = %6.3f (correlation)" % r)
print("p = %6.3f (p-value)" % pval)

This just fitted a model of the form $price = a + b\log_{10}(dist)$ where $a$ is the intercept,  and $b$ is the slope. The fit is reasonably good,  suggesting a negative relation between price and floor area. 

You can also add the fitted line to the scatter plot.  The `plot` command in `pylab` can do this.  As before,  the plot is 'built up' and the line is added after the scatter points and labels.  `plot` is a general line drawing function which will join together a set of $(x,y)$ points,  but here the $x$-values are a set of regular points on the line,  and the $y$-values are the corresponding fitted regression values. The function `np.linspace(lowest,highest,n)` creates a set of linearly-spaced values from a `lowest` to a `highest` value,  incorporating `n` points. In the `plot` function, the colour `'b'` means blue.  Also,  the `lw` parameter controls the line width.

In [None]:
pl.close()
pl.scatter(hp['dist'],hp['price']/1000,color=(1,0,0,0.1),marker='.')
pl.xlabel('Distance from London Centre (metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Distance (2023 London)')
pl.grid(True)
x_reg = np.linspace(0,30000,100)
y_reg = b + a * x_reg
pl.plot(x_reg,y_reg,color='b',lw=2.0)
pl.show()

## Self-Test Exercises

If you have gotten this far, you will have a reasonable grasp of the key ideas of programming in Python.  To finish,  here are a few exercises that you can use to practice your programming skills for next week.  Also,  to learn more Python ideas, try visiting http://en.wikibooks.org/wiki/Python_Programming.

## Exercise 1
**Without external assistance**, write a Python function to add up all of the numbers from `1` to `n`.   

## Exercise 2
Euclid's algorithm to find the greatest common divisor (GCD) of two integers is one of the oldest documented algorithms in the world,  dating back to around 300BC. The GCD is the largest number that divides exactly into the two numbers supplied - so for example the GCD of 18 and 15 is 3. The algorithm can be described as follows:


1. Take two numbers, `a` and `b` - the aim is to find their GCD
2. While `b` is greater than zero,  repeat the following steps:
    1. Replace `b` with the remainder when `a` is divided by `b`
    2. Replace `a` with the old value of `b`
3. When `b` is zero, `a` is the GCD

**Without external assistance**, write a Python function, `gcd(a,b)` that returns the GCD of `a` and `b`.