# INFO 204 Lab 1 - Getting Started with Python & Jupyter

INFO 204 will make extensive use of Python for examples and labwork. You are not expected to become proficient in Python (indeed, most of your work will involve following a fairly set template), but you should have a working grasp of the basics (declaring variables and assigning values, looping, branching, working with lists). We will also be working with the following libraries:
* numpy (for manipulating numbers)
* pandas (for loading and manipulating tables of data)
* scikit-learn (for applying machine learning algorithms to our data)
* matplotlib (for graphing)
We will give you plenty of examples to work from during labs, but you should also feel free to ask for help/clarification as required!

If you have experience with another programming language (e.g., R, C, Java), then many of the concepts you know from these languages should transfer directly to learning Python. More often than not, the challenge in solving a problem is **understanding the problem**, the particular nuances of the programming language used to solve the problem can be googled! During INFO 204, if you find yourself struggling to solve a particular problem, first ask yourself this: *what am I trying to in this particular step?*


## Part A. Review

This lab is for us to get familiar with our lab environment and go over some basic Python/Numpy skills. 

We are using Jupyter Notebook, an interactive environment based on IPython - a high-performance command shell for Python. If you want to install Jupyter on your own laptop,  choose the **Python 3.7** version (or later) from [Anaconda](https://www.anaconda.com/distribution/). Otherwise you can always upload your notebook into [Google Colab](https://research.google.com/colaboratory/) and work from there. 

To proceed, read through and run the code cells along the way (if not otherwise instructed). 

If you need to put in some comment cells, use the plus sign - TEXT button above. A locally run Jupyter notebook have more formats to play with e.g. "Raw NBConvert", or "Markdown" if you want more fun :). 

Complete the scripts (for "TO-DO" tasks) and verify their outcome. Submit the completed notebook before the ***<span style="color: red">deadline posted on Blackboard</span>***.



Some useful tips for you to use the Jupyter system on Colab:

*   Double-click into a TEXT cell and you can edit it (but don't change the instruction texts!).
*   Clicking into a CODE cell will allow you to run it (by clicking on the PLAY button on the left, or press CTRL-Enter), or to edit it (just type away)
*   Unlike a conventional piece of code that runs from the beginning to the end, a Jupyter notebook can be played with more flexibility - you can click into cells in any arbitrary order, change them and run them. Of course this would lead into errors if the order doesn't make sense!

Okay, buckle your safety belt, and let's drive on ;-)

In [1]:
# first, checking the Python version... run the cell to ensure that we're running a Python 3.X variant.
import sys
sys.version

'3.8.2 (default, Apr  8 2021, 23:19:18) \n[Clang 12.0.5 (clang-1205.0.22.9)]'

### Basics

In [2]:
# universal greetings from all emerging programmers ;-)
print("Hello World!")

Hello World!


Variables in Python are assigned a value with the '=' operator (just like C and Java):

In [3]:
a = 3
b = 4

Arithmetic in Python is more or less the same as most other programming languages. For example: 

In [4]:
print(a + b)
print(b - a)
print(a / b)
print((a + 1) / b)
print(b * a)

7
1
0.75
1.0
12


Rasing a number $a$ to the $n$-th power (i.e., $a^{n}$) is done in Python with the '\*\*' operator:

In [5]:
# Assign a to 11, 111, 1111, ... each time click back to this cell, make change, and run; repeat
a = 1
print(a**2)

1


In the preceding cell, 'a\*\*2' gives $a$ to the power of 2: $a^2$. Change the a value and re-run the cell a few times. Hopefully, you'll find this a tedious task! We can use a *for* loop to save some time. Run the following code and see how the variable 'a' is updated:

In [6]:
a = 1
for i in range(9):
    print(a)
    a = a*10 + 1

1
11
111
1111
11111
111111
1111111
11111111
111111111


**<span style="color: red">TO-DO</span>**: complete the code below to get all $a^2$ printed for a=1,11,111,.... 

In [7]:
a = 1
for i in range(9):
    print(a**2)
    a = a*10 + 1 # update a value

1
121
12321
1234321
123454321
12345654321
1234567654321
123456787654321
12345678987654321


To automate things, it is often a good idea to wrap some code into a function for reuse. 

Here's an example - a function that calculates the hypotenuse (side opposite the right angle) of a rectangular triangle, given the lengths of its two sides. Suppose the two sides are $a$ and $b$, according to the Pythagoras theorem, the hypotenuse is $c=\sqrt{a^2+b^2}$. We import the "math" package for the purpose of using the function sqrt(), and then define a function. Run the following cell - it will produce '5.0' (of course):

In [4]:
import math
def pythagoras(a, b):
    return math.sqrt(a**2 + b**2)

print(pythagoras(3,4)) # An example. 

5.0


**<span style="color: red">TO-DO</span>**: modify the function defintion above so that it checks if either of a or b is negative; if either argument is negative, then print an error message and return None; test it using (-3, 4):

In [5]:
# Modified version - with a careful check on parameter values
def pythagoras(a, b):
    if(a < 0 or b < 0):
        return print("Error: parameter cannot be negative!");
    return math.sqrt(a**2 + b**2)

print(pythagoras(-3,4))
print(pythagoras(3, 5))

Error: parameter cannot be negative!
None
5.830951894845301


### Lists

Revisiting the $111...1^2$ task - we can use a "for ... in ..." loop to iterate through a list and generate the outcome:

In [10]:
# iterate through a list
alist=[1, 11, 111, 1111, 11111]
for item in alist:
    print(item**2)

1
121
12321
1234321
123454321


The List is a very useful data structure in Python. Like in C/Java, the index of an N-element list starts from 0, ends with $N-1$. The $N-1$ index has a handy shorthand: -1. Similarly, a -2 index is equivalent to $N-2$... Next time you see an expression of accessing a list or array using a negative index, you won't be surprised ;-)

In [11]:
print('Length of the list:', len(alist))
print('First entry:', alist[0])
print('Third entry:', alist[2])
print('Last entry:', alist[-1])
print('Second-last entry:', alist[-2])

Length of the list: 5
First entry: 1
Third entry: 111
Last entry: 11111
Second-last entry: 1111


It's often useful to navigate a list with indexing a range of elements. For those of you familiar with R's style of list indexing, you can use the ':' operator to define a range of indices to extract from the list:

In [12]:
print('First two elements of the list:', alist[0:2])
print('Third and fourth elements of list:', alist[2:4])
print('Everything but the last element of list:', alist[:-1])
print('Everything but the last two elements of list:', alist[:-2])


First two elements of the list: [1, 11]
Third and fourth elements of list: [111, 1111]
Everything but the last element of list: [1, 11, 111, 1111]
Everything but the last two elements of list: [1, 11, 111]


An *often-used* trick is to use an empty list to collect data or statistics progressively. The following blurb imports the 'random' package and generate 10 random numbers between 0 and 1.0. Note we use .append() to add new items onto the *end* of the list:

In [13]:
# import the 'random' package 
import random
rlst=[] # create an empty list
for i in range(10):
    rlst.append(random.random()) # add a random number between zero and one to the end of the list
print(rlst)

[0.46309679182886165, 0.21210409342614878, 0.22815764698397578, 0.6074721959079757, 0.23308499235687052, 0.8290668760588986, 0.9732335114069225, 0.9678043676276739, 0.08999625345364137, 0.8620217811255416]


Python also provides an alternative way to construct lists using *[list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*:

In [14]:
rlist = [ random.random() for i in range(9) ]
print(rlist)

[0.9066424440751952, 0.4057200468151294, 0.09004230933172752, 0.9383313068970119, 0.7080572919395901, 0.24376324022037943, 0.33621359558302, 0.8125797479126562, 0.573172248763454]


We 100% do not care which of these you use in INFO 204 (it's the result that counts!), so use what you're most comfortable with. (Like most things involving programming languages, you'll probably end up with using a mixture of both approaches where most convenient)

**<span style="color: red">TO-DO</span>**: generate a list that includes the outcome of 100 dice rolls. (Tips: you can also use the [random.randint(.)](https://docs.python.org/3/library/random.html) function). 

In [9]:
import random
lst = []
for i in range(0, 100):
    lst.append(random.randint(1, 6))
print(lst)

[6, 4, 6, 3, 3, 5, 6, 2, 4, 1, 6, 4, 1, 1, 2, 4, 4, 2, 5, 2, 2, 4, 2, 2, 3, 6, 5, 6, 2, 4, 1, 1, 6, 5, 6, 6, 1, 6, 1, 6, 1, 3, 4, 4, 1, 3, 2, 3, 6, 4, 4, 5, 5, 6, 2, 6, 5, 2, 2, 2, 3, 4, 2, 5, 3, 2, 5, 6, 6, 6, 3, 1, 4, 6, 3, 2, 1, 5, 6, 5, 1, 6, 5, 2, 6, 3, 5, 3, 3, 5, 6, 1, 2, 2, 2, 1, 4, 3, 1, 2]


**<span style="color: red">TO-DO</span>**: there's an even easier way to do generate random numbers. Use the [integers()](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.integers.html) function from the numpy library, wich includes a size parameter to indicate how many samples you would like:

In [6]:
import numpy as np
rng = np.random.default_rng()
lst = rng.integers(low = 1, high = 7, size = 100)
print(lst)

[1 3 2 3 2 4 6 1 5 4 5 1 1 1 5 4 2 4 2 5 4 3 3 2 5 5 4 3 2 3 6 3 3 1 5 5 3
 6 1 2 6 2 4 6 6 1 6 6 6 6 1 6 1 2 1 1 2 1 2 5 5 1 6 1 2 3 2 5 3 3 2 1 6 5
 2 2 3 1 2 6 6 4 6 6 6 5 3 1 1 2 6 1 6 2 3 5 6 2 3 6]
100


### Strings

Strings are just a special type of List - string manipulation in Python is flexible and easy. 

In [None]:
# treated as a list
hello='hello world'
print(hello[1])

e


In [None]:
# split by a specified separator
tokens = hello.split(' ')
print(tokens)

['hello', 'world']


In [None]:
# query / search etc.
print(hello.startswith('he'))

True


In [None]:
print(hello.find('world'))

6


**<span style="color: red">TO-DO</span>**: find and count all the words that start with 'dream' in the following "lyrics" string (tips: use split() and a for loop for matching): 

In [7]:
# the lyrics string
lyrics = 'I dreamed a dream in times gone by \
When hope was high and life worth living \
I dreamed, that love would never die \
I dreamed that God would be forgiving \
Then I was young and unafraid \
And dreams were made and used and wasted \
There was no ransom to be paid \
No song unsung, no wine untasted'

### perform your task here
keyWord = 'dream'
counter = 0
words = lyrics.split(" ")
for word in words:
    if(word.startswith(keyWord)):
        counter += 1
print(counter)
    


['I', 'dreamed', 'a', 'dream', 'in', 'times', 'gone', 'by', 'When', 'hope', 'was', 'high', 'and', 'life', 'worth', 'living', 'I', 'dreamed,', 'that', 'love', 'would', 'never', 'die', 'I', 'dreamed', 'that', 'God', 'would', 'be', 'forgiving', 'Then', 'I', 'was', 'young', 'and', 'unafraid', 'And', 'dreams', 'were', 'made', 'and', 'used', 'and', 'wasted', 'There', 'was', 'no', 'ransom', 'to', 'be', 'paid', 'No', 'song', 'unsung,', 'no', 'wine', 'untasted']
5


## Part B. Exercises

### Birthday problem
Have you ever been to a party and met a person with exactly the same birthday as yours? How likely this would happen, we wonder. Let's find it out using a bit of Python...

We consider the opposite situation, i.e, every one in the party has a unique birthday. For simplicity we assume that every day in the year can equally be a birthday, i.e., the distribution of birthdays is uniform throughout the year. 

Suppose we have four people in the room. For the person 1, out of 365 days (for sake of simplicity, let's ignore leap years), she can have any one day as her birthday. Note her chance as $p_1=\frac{365}{365}$. For the 2nd person, out of 365 days, she can now only choose one from 364 days (to avoid choosing the day chosen by person 1). Note her chance as $p_2=\frac{364}{365}$. So on and so forth. 

So the chance of everybody having a unique birthday is 
$$P=\frac{365}{365}\times \frac{364}{365}\times \frac{363}{365}\times \frac{362}{365}=0.98.$$
This means that in the 4-people party, the chance of having at least one birthday clash, is 1-0.98=0.02, i.e., only 2 percent. 

**<span style="color: red">TO-DO</span>**: Write a function to calculate the probability of, given $n$ people in a party, at least two sharing the same birthday. Test it on the cases of 4 and 23. (Tips: use a "for" loop to implement the calculation outlined as above.)

In [13]:
# your code here
def birthday_clash_prob(n):
    p = 1
    for i in range(0, n):
        p = p * (365 - i)/365
    return 1 - p  ### replace this with your required work


probs = [] ## replace this with something that actually computes the probabilities of birthday clashes from n=4 to n=23
for i in range(4, 23 + 1):
    probs.append(birthday_clash_prob(i))
print(probs)

[0.016355912466550326, 0.02713557369979369, 0.04046248364911165, 0.05623570309597559, 0.07433529235166925, 0.09462383388916695, 0.1169481777110779, 0.14114137832173335, 0.1670247888380647, 0.19441027523242982, 0.22310251200497344, 0.2529013197636867, 0.28360400525285023, 0.3150076652965609, 0.3469114178717895, 0.37911852603153684, 0.41143838358058016, 0.4436883351652059, 0.4756953076625502, 0.5072972343239855]


######  End of Lab 1.

* *Congratulations for completing your (maybe first) Python notebook!* Remember to **rename your file, download it (.ipynb), and submit** it through Blackboard. 

Not in a hurry? Take a look of the [Formatting colab guide](https://colab.research.google.com/notebooks/markdown_guide.ipynb) and make good use of the tips there to beautify your notebook submission. 