## Recording of this lecture

A recording covering (most of) this Python content:

::: {.content-hidden unless-format="revealjs"}


{{< video https://youtu.be/kLpXVzpDGps width="1068px" height="600px" >}}




:::
::: {.content-visible unless-format="revealjs"}


{{< video https://youtu.be/kLpXVzpDGps >}}




:::

# Data Science & Python {visibility="uncounted"}

## About Python

::: columns
::: {.column width="40%"}
![Free book [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/)](automate-the-boring-stuff-with-python.jpeg)
:::

::: {.column width="60%"}
It is _general purpose_ language

Python powers:

- Instagram
- Spotify
- Netflix
- Uber
- Reddit...

Python is on Mars.

:::
:::

::: footer
Sources: [Blog post](https://learn.onemonth.com/10-famous-websites-built-using-python/) and [Github](https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-github-profile/customizing-your-profile/personalizing-your-profile#list-of-qualifying-repositories-for-mars-2020-helicopter-contributor-badge).
:::


## Stack Overflow [2021 Dev. Survey](https://insights.stackoverflow.com/survey/2021)

::: columns
::: {.column width="55%"}

- Python is [3rd most popular language](https://insights.stackoverflow.com/survey/2021#section-most-popular-technologies-programming-scripting-and-markup-languages)
- Python is the [most wanted language](https://insights.stackoverflow.com/survey/2021#section-most-loved-dreaded-and-wanted-programming-scripting-and-markup-languages)
- In 'Other frameworks and libraries', they note that ["several data science libraries for Python make strong showings"](https://insights.stackoverflow.com/survey/2021#section-most-popular-technologies-other-frameworks-and-libraries
).

:::
::: {.column width="38%"}

![Popular languages.](so-popular-languages-all.png)

:::
:::

## Github's [2021 State of the Octoverse](https://octoverse.github.com/#top-languages-over-the-years)

![Top languages over the years](github-programming-languages.svg)


::: footer
Source: Kaggle (2021), [State of Machine Learning and Data Science](https://www.kaggle.com/kaggle-survey-2021).
:::


In [None]:
#| echo: false
#| eval: false
from IPython.display import display, HTML

html = '<div class="r-stack">'

html += f'<img src="kaggle/kaggle-0.png" width="1000px">'

for i in range(1, 18):
    if i in [1, 2, 6, 13, 14, 15, 16]:
        continue
    html += f'<img src="kaggle/kaggle-{i}.png" class="fragment" width="1000px">'

html += "</div>"

display(HTML(html))

## Python and machine learning


> ...[T]he entire machine learning and data science industry has been dominated by these two approaches: **deep learning** and **gradient boosted trees**...
> Users of gradient boosted trees tend to use Scikit-learn, XGBoost, or LightGBM. Meanwhile, most practitioners of deep learning use Keras, often in combination with its parent framework TensorFlow.
The common point of these tools is **they're all Python libraries**: Python is by far the most widely used language for machine learning and data science.

::: footer
Source: François Chollet (2021), _Deep Learning with Python, Second Edition_, Section 1.2.7.
:::

## Python for data science

::: columns
::: {.column width="50%"}
In R you can run:
```r
pchisq(3, 10)
```
:::
::: {.column width="50%"}
In Python it is
```python
from scipy import stats
stats.chi2(10).cdf(3)
```
:::
:::

![In Leganto](python-data-analysis.jpeg)

## Google Colaboratory

![An example notebook in Google Colaboratory.](google-colab.png)

[http://colab.research.google.com](http://colab.research.google.com)

# Python Data Types {visibility="uncounted"}

## Variables and basic types

::: columns
::: {.column width="50%"}


In [None]:
1 + 2

In [None]:
x = 1
x + 2.0

In [None]:
type(2.0)

In [None]:
type(1), type(x)

:::
::: {.column width="50%"}


In [None]:
does_math_work = 1 + 1 == 2
print(does_math_work)
type(does_math_work)

In [None]:
contradiction = 1 != 1
contradiction

:::
:::

## Shorthand assignments

If we want to add 2 to a variable `x`:

::: columns
::: {.column width="50%"}


In [None]:
x = 1
x = x + 2
x

:::

::: {.column width="50%"}


In [None]:
x = 1
x += 2
x

:::
:::

Same for:

- `x -= 2` : take 2 from the current value of `x` ,
- `x *= 2` : double the current value of `x`,
- `x /= 2` : halve the current value of `x`.

## Strings


In [None]:
name = "Patrick"
surname = "Laub"

In [None]:
coffee = "This is Patrick's coffee"
quote = 'And then he said "I need a coffee!"'

In [None]:
name + surname

In [None]:
greeting = f"Hello {name} {surname}"
greeting

In [None]:
"Patrick" in greeting

## `and` & `or`


In [None]:
name = "Patrick"
surname = "Laub"
name.istitle() and surname.istitle()

In [None]:
full_name = "Dr Patrick Laub"
full_name.startswith("Dr ") or full_name.endswith(" PhD")

::: {.callout-important}
The dot is used denote methods, it can't be used inside a variable name.


In [None]:
#| error: true
i.am.an.unfortunate.R.users = True

:::

## `help` to get more details


In [None]:
help(name.istitle)

## f-strings


In [None]:
print(f"Five squared is {5*5} and five cubed is {5**3}")
print("Five squared is {5*5} and five cubed is {5**3}")

::: {.callout-aside}
Use f-strings and avoid the older alternatives:


In [None]:
print(f"Hello {name} {surname}")
print("Hello " + name + " " + surname)
print("Hello {} {}".format(name, surname))
print("Hello %s %s" % (name, surname))

:::

## Converting types


In [None]:
digit = 3
digit

In [None]:
type(digit)

In [None]:
num = float(digit)
num

In [None]:
type(num)

In [None]:
num_str = str(num)
num_str

## Quiz

What is the output of:


In [None]:
#| eval: false
x = 1
y = 1.0
print(f"{x == y} and {type(x) == type(y)}")

::: fragment


In [None]:
#| echo: false
x = 1
y = 1.0
print(f"{x == y} and {type(x) == type(y)}")

:::

::: fragment
What would you add before line 3 to get "True and True"?
:::

::: fragment


In [None]:
x = 1
y = 1.0
x = float(x)  # or y = int(y)
print(f"{x == y} and {type(x) == type(y)}")

:::

# Collections {visibility="uncounted"}

## Lists


In [None]:
desires = ["Coffee", "Cake", "Sleep"]
desires

In [None]:
len(desires)

In [None]:
desires[0]

In [None]:
desires[-1]

In [None]:
desires[2] = "Nap"
desires

## Slicing lists


In [None]:
print([0, 1, 2])
desires

In [None]:
desires[0:2]

In [None]:
desires[0:1]

In [None]:
desires[:2]

## A common indexing  error


In [None]:
#| error: true
desires[1.0]

In [None]:
#| error: true
desires[: len(desires) / 2]

In [None]:
len(desires) / 2, len(desires) // 2

In [None]:
desires[: len(desires) // 2]

## Editing lists


In [None]:
desires = ["Coffee", "Cake", "Sleep"]
desires.append("Gadget")
desires

In [None]:
desires.pop()

In [None]:
desires

In [None]:
desires.sort()
desires

In [None]:
#| error: true
desires[3] = "Croissant"

## `None`


In [None]:
desires = ["Coffee", "Cake", "Sleep", "Gadget"]
sorted_list = desires.sort()
sorted_list

In [None]:
type(sorted_list)

In [None]:
sorted_list is None

In [None]:
bool(sorted_list)

In [None]:
desires = ["Coffee", "Cake", "Sleep", "Gadget"]
sorted_list = sorted(desires)
print(desires)
sorted_list

## Tuples ('immutable' lists)


In [None]:
weather = ("Sunny", "Cloudy", "Rainy")
print(type(weather))
print(len(weather))
print(weather[-1])

In [None]:
#| error: true
weather.append("Snowy")

In [None]:
#| error: true
weather[2] = "Snowy"

## One-length tuples


In [None]:
using_brackets_in_math = (2 + 4) * 3
using_brackets_to_simplify = (1 + 1 == 2)

In [None]:
failure_of_atuple = ("Snowy")
type(failure_of_atuple)

In [None]:
happy_solo_tuple = ("Snowy",)
type(happy_solo_tuple)

In [None]:
cheeky_solo_list = ["Snowy"]
type(cheeky_solo_list)

## Dictionaries


In [None]:
phone_book = {"Patrick": "+61 1234", "Café": "(02) 5678"}
phone_book["Patrick"]

In [None]:
phone_book["Café"] = "+61400 000 000"
phone_book

In [None]:
phone_book.keys()

In [None]:
phone_book.values()

In [None]:
factorial = {0: 1, 1: 1, 2: 2, 3: 6, 4: 24, 5: 120, 6: 720, 7: 5040}
factorial[4]

## Quiz


In [None]:
#| eval: false
animals = ["dog", "cat", "bird"]
animals.append("teddy bear")
animals.pop()
animals.pop()
animals.append("koala")
animals.append("kangaroo")
print(f"{len(animals)} and {len(animals[-2])}")

::: fragment


In [None]:
#| echo: false
animals = ["dog", "cat", "bird"]
animals.append("teddy bear")
animals.pop()
animals.pop()
animals.append("koala")
animals.append("kangaroo")
print(f"{len(animals)} and {len(animals[-2])}")

:::

# Control Flow {visibility="uncounted"}

## `if` and `else`


In [None]:
age = 50

In [None]:
if age >= 30:
    print("Gosh you're old")

In [None]:
if age >= 30:
    print("Gosh you're old")
else:
    print("You're still young")

## The weird part about Python...


In [None]:
#| error: true
if age >= 30:
    print("Gosh you're old")
else:
print("You're still young")

::: {.callout-warning}
Watch out for mixing tabs and spaces!
:::

## An example of aging


In [None]:
age = 16

if age < 18:
    friday_evening_schedule = "School things"
if age < 30:
    friday_evening_schedule = "Party 🥳🍾"
if age >= 30:
    friday_evening_schedule = "Work"

::: fragment


In [None]:
print(friday_evening_schedule)

:::

## Using `elif` {auto-animate=true}


In [None]:
age = 16

if age < 18:
    friday_evening_schedule = "School things"
elif age < 30:
    friday_evening_schedule = "Party 🥳🍾"
else:
    friday_evening_schedule = "Work"

print(friday_evening_schedule)

## `for` Loops


In [None]:
desires = ["coffee", "cake", "sleep"]
for desire in desires:
    print(f"Patrick really wants a {desire}.")

::: columns
::: {.column width="50%"}


In [None]:
for i in range(3):
    print(i)

In [None]:
for i in range(3, 6):
    print(i)

:::
::: {.column width="50%"}


In [None]:
range(5)

In [None]:
type(range(5))

In [None]:
list(range(5))

:::
:::

## Advanced `for` loops


In [None]:
for i, desire in enumerate(desires):
    print(f"Patrick wants a {desire}, it is priority #{i+1}.")

In [None]:
desires = ["coffee", "cake", "nap"]
times = ["in the morning", "at lunch", "during a boring lecture"]

for desire, time in zip(desires, times):
    print(f"Patrick enjoys a {desire} {time}.")

## List comprehensions


In [None]:
[x**2 for x in range(10)]

In [None]:
[x**2 for x in range(10) if x % 2 == 0]

They can get more complicated:


In [None]:
[x * y for x in range(4) for y in range(4)]

In [None]:
[[x * y for x in range(4)] for y in range(4)]

but I'd recommend just using `for` loops at that point.

## While Loops


In [None]:
#| echo: false
import numpy.random as rnd

rnd.seed(1234)
simulate_pareto = lambda: rnd.pareto(1)

Say that we want to simulate $(X \,\mid\, X \ge 100)$ where $X \sim \mathrm{Pareto}(1)$.
Assuming we have `simulate_pareto`,
a function to generate $\mathrm{Pareto}(1)$ variables:


In [None]:
samples = []
while len(samples) < 5:
    x = simulate_pareto()
    if x >= 100:
        samples.append(x)

samples

## Breaking out of a loop


In [None]:
#| eval: false
while True:
    user_input = input(">> What would you like to do? ")

    if user_input == "order cake":
        print("Here's your cake! 🎂")

    elif user_input == "order coffee":
        print("Here's your coffee! ☕️")

    elif user_input == "quit":
        break

In [None]:
#| echo: false
inputs = ["order cake", "order coffee", "order cake", "quit"]

for user_input in inputs:
    print(f">> What would you like to do? {user_input}")
    if user_input == "order cake":
        print("Here's your cake! 🎂")

    elif user_input == "order coffee":
        print("Here's your coffee! ☕️")

    elif user_input == "quit":
        break

## Quiz

What does this print out?


In [None]:
#| eval: false
if 1 / 3 + 1 / 3 + 1 / 3 == 1:
    if 2**3 == 6:
        print("Math really works!")
    else:
        print("Math sometimes works..")
else:
    print("Math doesn't work")

::: fragment


In [None]:
#| echo: false
if 1 / 3 + 1 / 3 + 1 / 3 == 1:
    if 2**3 == 6:
        print("Math really works!")
    else:
        print("Math sometimes works..")
else:
    print("Math doesn't work")

:::



What does this print out?


In [None]:
#| eval: false
count = 0
for i in range(1, 10):
    count += i
    if i > 3:
        break
print(count)

::: fragment


In [None]:
#| echo: false
count = 0
for i in range(1, 10):
    count += i
    if i > 3:
        break
print(count)

:::

## Debugging the quiz code


In [None]:
count = 0
for i in range(1, 10):
    count += i
    print(f"After i={i} count={count}")
    if i > 3:
        break

# Python Functions {visibility="uncounted"}

## Making a function


In [None]:
def add_one(x):
    return x + 1


def greet_a_student(name):
    print(f"Hi {name}, welcome to the AI class!")

In [None]:
add_one(10)

In [None]:
greet_a_student("Josephine")

In [None]:
greet_a_student("Joseph")

::: {.callout-aside}
Here, `name` is a _parameter_ and the value supplied is an _argument_.
:::

## Default arguments


In [None]:
#| echo: false
import numpy.random as rnd

rnd.seed(1234)
simulate_standard_normal = rnd.normal

Assuming we have `simulate_standard_normal`,
a function to generate $\mathrm{Normal}(0, 1)$ variables:


In [None]:
def simulate_normal(mean=0, std=1):
    return mean + std * simulate_standard_normal()

In [None]:
simulate_normal()  # same as 'simulate_normal(0, 1)'

In [None]:
simulate_normal(1_000)  # same as 'simulate_normal(1_000, 1)'

::: {.callout-note}
We'll cover random numbers next week (using `numpy`).
:::

## Use explicit parameter name


In [None]:
simulate_normal(mean=1_000)  # same as 'simulate_normal(1_000, 1)'

In [None]:
simulate_normal(std=1_000)  # same as 'simulate_normal(0, 1_000)'

In [None]:
simulate_normal(10, std=0.001)  # same as 'simulate_normal(10, 0.001)'

In [None]:
#| error: true
simulate_normal(std=10, 1_000)

## Why would we need that?

E.g. to fit a Keras model, we use the `.fit` method:


In [None]:
#| eval: false
model.fit(x=None, y=None, batch_size=None, epochs=1, verbose='auto',
        callbacks=None, validation_split=0.0, validation_data=None,
        shuffle=True, class_weight=None, sample_weight=None,
        initial_epoch=0, steps_per_epoch=None, validation_steps=None,
        validation_batch_size=None, validation_freq=1,
        max_queue_size=10, workers=1, use_multiprocessing=False)

Say we want all the defaults except changing `use_multiprocessing=True`:


In [None]:
#| eval: false
model.fit(None, None, None, 1, 'auto', None, 0.0, None, True, None,
        None, 0, None, None, None, 1, 10, 1, True)

but it is _much nicer_ to just have:


In [None]:
#| eval: false
model.fit(use_multiprocessing=True)

## Quiz

What does the following print out?


In [None]:
#| eval: false
def get_half_of_list(numbers, first=True):
    if first:
        return numbers[: len(numbers) // 2]
    else:
        return numbers[len(numbers) // 2 :]

nums = [1, 2, 3, 4, 5, 6]
chunk = get_half_of_list(nums, False)
second_chunk = get_half_of_list(chunk)
print(second_chunk)

::: fragment


In [None]:
#| echo: false
def get_half_of_list(numbers, first=True):
    if first:
        return numbers[: len(numbers) // 2]
    else:
        return numbers[len(numbers) // 2 :]

nums = [1, 2, 3, 4, 5, 6]
chunk = get_half_of_list(nums, False)
second_chunk = get_half_of_list(chunk)
print(second_chunk)

:::

::: fragment


In [None]:
f"nums ~> {nums[:len(nums)//2]} and {nums[len(nums)//2:]}"

In [None]:
f"chunk ~> {chunk[:len(chunk)//2]} and {chunk[len(chunk)//2:]}"

:::

## Multiple return values


In [None]:
def limits(numbers):
    return min(numbers), max(numbers)

limits([1, 2, 3, 4, 5])

In [None]:
type(limits([1, 2, 3, 4, 5]))

In [None]:
min_num, max_num = limits([1, 2, 3, 4, 5])
print(f"The numbers are between {min_num} and {max_num}.")

In [None]:
_, max_num = limits([1, 2, 3, 4, 5])
print(f"The maximum is {max_num}.")

In [None]:
print(f"The maximum is {limits([1, 2, 3, 4, 5])[1]}.")

## Tuple unpacking


In [None]:
lims = limits([1, 2, 3, 4, 5])
smallest_num = lims[0]
largest_num = lims[1]
print(f"The numbers are between {smallest_num} and {largest_num}.")

In [None]:
smallest_num, largest_num = limits([1, 2, 3, 4, 5])
print(f"The numbers are between {smallest_num} and {largest_num}.")

This doesn't just work for functions with multiple return values:


In [None]:
RESOLUTION = (1920, 1080)
WIDTH, HEIGHT = RESOLUTION
print(f"The resolution is {WIDTH} wide and {HEIGHT} tall.")

## Short-circuiting


In [None]:
def is_positive(x):
    print("Called is_positive")
    return x > 0

def is_negative(x):
    print("Called is_negative")
    return x < 0

x = 10

::: columns
::: column


In [None]:
x_is_positive = is_positive(x)
x_is_positive

:::
::: column


In [None]:
x_is_negative = is_negative(x)
x_is_negative

:::
:::


In [None]:
x_not_zero = is_positive(x) or is_negative(x)
x_not_zero

# Import syntax {visibility="uncounted"}

## Python standard library


In [None]:
import os
import time

In [None]:
time.sleep(0.1)

In [None]:
os.getlogin()

In [None]:
os.getcwd()

## Import a few functions


In [None]:
from os import getcwd, getlogin
from time import sleep

In [None]:
sleep(0.1)

In [None]:
getlogin()

In [None]:
getcwd()

## Timing using pure Python


In [None]:
from time import time

start_time = time()

counting = 0
for i in range(1_000_000):
    counting += 1

end_time = time()

elapsed = end_time - start_time
print(f"Elapsed time: {elapsed} secs")

## Data science packages

![Common data science packages](python-data-science-packages.png)

::: footer
Source: Learnbay.co, [Python libraries for data analysis and modeling in Data science](https://medium.com/@learnbay/python-libraries-for-data-analysis-and-modeling-in-data-science-c5c994208385), Medium.
:::

## Importing using `as`

::: columns
::: column

In [None]:
import pandas

pandas.DataFrame(
    {
        "x": [1, 2, 3],
        "y": [4, 5, 6],
    }
)

:::
::: column

In [None]:
import pandas as pd

pd.DataFrame(
    {
        "x": [1, 2, 3],
        "y": [4, 5, 6],
    }
)

:::
:::



## Importing from a subdirectory

Want `keras.models.Sequential()`.


In [None]:
#| output: false
import keras

model = keras.models.Sequential()

Alternatives using `from`:


In [None]:
from keras import models

model = models.Sequential()

In [None]:
from keras.models import Sequential

model = Sequential()

# Lambda functions {visibility="uncounted"}

## Anonymous 'lambda' functions {auto-animate="true"}

Example: how to sort strings by their second letter?


In [None]:
names = ["Josephine", "Patrick", "Bert"]

If you try `help(sorted)` you'll find the `key` parameter.


In [None]:
for name in names:
    print(f"The length of '{name}' is {len(name)}.")

In [None]:
sorted(names, key=len)

## Anonymous 'lambda' functions {auto-animate="true"}

Example: how to sort strings by their second letter?

In [None]:
names = ["Josephine", "Patrick", "Bert"]

If you try `help(sorted)` you'll find the `key` parameter.


In [None]:
def second_letter(name):
    return name[1]

In [None]:
for name in names:
    print(f"The second letter of '{name}' is '{second_letter(name)}'.")

In [None]:
sorted(names, key=second_letter)

## Anonymous 'lambda' functions {auto-animate="true"}

Example: how to sort strings by their second letter?

In [None]:
names = ["Josephine", "Patrick", "Bert"]

If you try `help(sorted)` you'll find the `key` parameter.


In [None]:
sorted(names, key=lambda name: name[1])

::: fragment

::: callout-caution
Don't use `lambda` as a variable name!
You commonly see `lambd` or `lambda_` or `λ`.
:::

:::



## with keyword

Example, opening a file:

::: columns
::: column
Most basic way is:


In [None]:
f = open("haiku1.txt", "r")
print(f.read())
f.close()

:::
::: column
Instead, use:


In [None]:
with open("haiku2.txt", "r") as f:
    print(f.read())

:::
:::

::: footer
Haikus from http://www.libertybasicuniversity.com/lbnews/nl107/haiku.htm
:::

## Package Versions {.appendix data-visibility="uncounted"}


In [None]:
from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tf_keras"))

## Links {.appendix data-visibility="uncounted"}

If you came from C (i.e. are a joint computer science student), and were super interested in Python's internals, maybe you'd be interested in this [How variables work in Python](https://youtu.be/0Om2gYU6clE?si=fdy_YpWbvfti8ZoD) video.

## Glossary {.appendix data-visibility="uncounted"}

::: columns
::: column

- default arguments
- dictionaries
- f-strings
- function definitions
- Google Colaboratory
- `help`
- list

:::
::: column

- `pip install ...`
- `range`
- slicing
- tuple
- `type`
- whitespace indentation
- zero-indexing

:::
:::