# Introduction to Python and Natural Language Technologies

__Lecture 01-2, Type system and built-in types__

__Sept 16, 2020__

__Judit Ács__

# Type system

__Dynamic__:
- No need to declare variables
- The `=` operator binds a reference to any arbitrary object

In [None]:
i = 2
type(i), id(i)

In [None]:
i = "foo"
type(i), id(i)

__Strongly typed__:
- Most implicit conversions are disallowed such as between strings and numeric types:

In [None]:
# print("I am " + 20 + " years old")

We must explicitely cast it:

In [None]:
print("I am " + str(20) + " years old")

Note that many other languages like Javascript allow implicit casting:

In [None]:
%%javascript

element.text("I am " + 20 + " years old")

In [None]:
%%javascript

element.text("1" + 1)

In [None]:
%%javascript

element.text(1 + "1")

Conversions between numeric types are OK:

In [None]:
i = 2
f = 1.2
s = i + f
print(type(i), type(f))
print(type(i + f))
print(s)

# Boolean operators and type

Three boolean operators: `and`, `or` and `not`

In [None]:
x = 2

x < 2 or x >= 2

In [None]:
x > 0 and x < 10

In [None]:
not x < 0

Two boolean values: `True` and `False` (must be capitalized)

In [None]:
x = True
type(x)

In [None]:
True and False

In [None]:
True or False

In [None]:
not True

# Numeric types

Three numeric types: __`int`__, __`float`__ and __`complex`__ <br/>

An object's type is derived from its initial value:

In [None]:
i = 2
f = 1.2
c = 1+2j

type(i), type(f), type(c)

Implicit conversion between numeric types is supported in arithmetic operations, the resulting type is the one with less data loss:

In [None]:
c2 = i + c
print(c2, type(c2))

Floats can be defined using __scientific notation__:

In [None]:
1e2, 1E2

## Underscores in numeric literals

Examples from [PEP 0515](https://www.python.org/dev/peps/pep-0515/):

In [None]:
# grouping decimal numbers by thousands
amount = 10_000_000.0

# grouping hexadecimal addresses by words
addr = 0xCAFE_F00D

# grouping bits into nibbles in a binary literal
flags = 0b_0011_1111_0100_1110

# same, for string conversions
flags = int('0b_1111_0000', 2)

# two or more consecutive underscores are not allowed
# 2__4  # SyntaxError

## Precision and range

Integers have unlimited precision.

In [None]:
type(2**63 + 1)

More information in `sys.int_info`:

In [None]:
sys.int_info

Floats are usually implemented using C's double.<br />
Complex numbers use two floats for their real and imaginary parts.<br/>
Check `sys.float_info` for more information:

In [None]:
import sys
sys.float_info

## Arithmetic operators

Addition, subtraction and product result in a type with the least loss in information:

In [None]:
i = 2
f = 4.2
c = 4.1-3j

s1 = i + f
s2 = f - c
s3 = i * c
print(s1, type(s1))
print(s2, type(s2))
print(s3, type(s3))

__quotient operator__

In Python3 operator/ computes the float quotient even if the operands are both integers unlike in C++ or Python2:

In [None]:
3 / 2

Explicit __floor quotient operator__

In [None]:
-3.0 // 2, 3 // 2

## Comparison operators

In [None]:
x = 23
x < 24, x >= 22

Operators can be chained:

In [None]:
23 < x < 100

In [None]:
23 <= x < 100

## Other operators for numeric types

__remainder__

In [None]:
5 % 3

__power__

In [None]:
2 ** 3

Using it for square root:

In [None]:
2 ** 0.5

__absolute value__

In [None]:
abs(-2 - 1j), abs(1+1j)

__round__

In [None]:
round(2.3456), round(2.3456, 2)

## Explicit conversions between numeric types

In [None]:
float(2)

In [None]:
int(2.5)

## `math` and `cmath`

Additional operations for real and complex numbers:

In [None]:
import math

math.log(16), math.log(16, 2), math.exp(2), \
math.exp(math.log(10))

# Mutable vs. immutable types

- Instances of mutable types can be modified in place
- Immutable objects have the same value during their lifetime
- Are numeric types mutable or immutable?

In [None]:
x = 2
old_id = id(x)
x += 1
print(id(x) == old_id)

Booleans are singleton immutable objects.

In [None]:
x = True
y = False
print(x is y)
x = False
print(x is y)

In [None]:
(2 == 2) is (3 == 3)

Lists are mutable:

In [None]:
l1 = [0, 1]
old_id = id(l1)
l1.append(2)
old_id == id(l1)

# Sequence types

All sequences support the following basic operations:

| operation | behaviour |
| :----- | :----- |
| `x in s` | 	True if an item of s is equal to x, else False |
| `x not in s` | 	False if an item of s is equal to x, else True |
| `s + t` | 	the concatenation of s and t |
| `s * n or n * s` | 	equivalent to adding s to itself n times |
| `s[i]` | 	ith item of s, origin 0 |
| `s[i:j]` | 	slice of s from i to j |
| `s[i:j:k]` | 	slice of s from i to j with step k |
| `len(s)` | 	length of s |
| `min(s)` | 	smallest item of s |
| `max(s)` | 	largest item of s |
| `s.index(x[, i[, j]])` | 	index of the first occurrence of x in s (at or after index i and before index j) |
| `s.count(x)` | 	total number of occurrences of x in s |

[Table source](https://docs.python.org/3/library/stdtypes.html#common-sequence-operations)

## Traversing sequences

All sequence types can be traversed with for loops:

In [None]:
l = [1, -1, "foo", 2, "bar"]
for element in l:
    print(element)

__`enumerate`__

If we need the indices too, the built-in `enumerate` function iterates over index-element pairs:

In [None]:
for i, element in enumerate(l):
    print(i, element)

## `list`

Mutable sequence type:

In [None]:
l = [1, 2, 2, 3]
print(l, l[1])

In [None]:
# l[4]  # raises IndexError

Negative indexing supported:

In [None]:
l[-1], l[len(l)-1]

But it can also me out of range

In [None]:
l[-5]  # raises IndexError

__`append`__ adds an element to the end of the list:

In [None]:
l = [1, 2, 3]
l.append(3)
l

__`insert`__ inserts an element at a specific index:

In [None]:
l = [1, 2, 3]
l.insert(1, 5)
l

__`extend`__

In [None]:
l = [1, 2]
l.extend([3, 4, 5])
len(l), l

### Under the hood

`append` is O(1), `insert` is O(n):

In [None]:
l = list(range(100))

In [None]:
%%timeit -n 100

l.append(0)

In [None]:
l = list(range(100))

In [None]:
%%timeit -n 100

l.insert(50, 0)

More about time complexity [here](https://wiki.python.org/moin/TimeComplexity).

__Advanced indexing, ranges__

In [None]:
l = []
for i in range(20):
    l.append(2*i + 1)
l[2:5]

In [None]:
l[:3], l[-4:]

This can be used for sliding windows:

In [None]:
for i in range(10):
    print(l[i:i+3])

Step argument:

In [None]:
l[2:10:3] 

Reversing a list:

In [None]:
l[::-1]

Lists are __mutable__, elements can be added or changed:

In [None]:
l = []
old_id = id(l)

for element in range(1, 3):
    l.append(element)
    print(id(l) == old_id)
l

op= performs reference assignment without creating a new object (similar to shallow copy):

In [None]:
l2 = l
print(l is l2)

l2.append(42)
l

elements don't need to be of the same type

In [None]:
l = [1, -1, "foo", 2, "bar"]
l

__Sorting lists__

Lists can be sorted using the built-in `sorted` function which returns a new list object.

In [None]:
l = [3, -1, 2, 11]

for e in sorted(l):
    print(e)

Inplace version of sort:

In [None]:
l = [3, -1, 2, 11]
l.sort()
l

In [None]:
l1 = [3, -1, 2]
l2 = sorted(l1)
print(f"{(l1 is l2) = }")
print(f"{l1 = }")
print(f"{l2 = }")

the sorting key can be specified using the `key` argument

In [None]:
shopping_list = [
    ["apple", 5],
    ["pear", 2],
    ["milk", 1],
    ["bread", 3],
]

for product in sorted(shopping_list, key=lambda x: -x[1]):
    print(product)

## tuple

Tuples are an immutable sequences:

In [None]:
t = ()  # empty tuple
print(f"{type(t) = }")
print(f"{len(t) = }")
print(f"{t = }")

In [None]:
t = (1, 2, 3, "foo")
print(f"{type(t) = }")
print(f"{len(t) = }")
print(f"{t = }")

Tuples can be indexed the same way as lists:

In [None]:
t[1], t[-1]

Tuples with one element can be defined with:

In [None]:
t = (2,)
type(t), len(t)

Tuples contain immutable references, however, the underlying objects may be mutable such as lists

In [None]:
t = ([1, 2, 3], "foo")
# t[0]= "bar"  # this raises a TypeError

In [None]:
t = ([1, 2, 3], "foo")

# save the id of the first element of the tuple (the list)
old_id = id(t[0])

# change an element of the list (not the tuple!)
t[0][1] = 11

# did the id of the list change?
id(t[0]) == old_id

# Strings

- Strings are **immutable** sequences of Unicode code points
- These code points range from 0 to 0x10FFF
- No separate character type. Some functions can only be used on characters a.k.a strings of length 1
- Strings can be constructed in various ways

In [None]:
single = 'ab\'c'
double = "ab\"c"
multiline = """
sdfajfklasj;
sdfsdfs
sdfsdf
"""
single == double

Immutability makes it impossible to change a string (unlike C-style string or C++'s `std::string`)

In [None]:
s = "abc"

# s[1] = "c"  # TypeError

All string operations create new objects:

In [None]:
s = "abc"
old_id = id(s)
s += "def"
id(s) == old_id

Strings support most operations available for lists such as __advanced indexing__:

In [None]:
s = "abcdefghijkl"
s[::2]

## Character encodings - Unicode

Unicode provides a mapping from letters to code points or numbers:
  
| character | Unicode code point |
| ---- | ---- |
| a | U+0061 |
| ő | U+0151 | 
| ش | U+0634 |
| گ | U+06AF |
| ¿ | U+00BF |
| ư | U+01B0 |
| Ң | U+04A2 |
| ⛵ | U+26F5 |

- These are abstract code points
- Actual text needs to be stored as a byte array/sequence (byte strings)
- Character encoding: code point - byte array correspondence

- **Encoding**: Unicode code point $\rightarrow$ byte sequence
- **Decoding**: byte sequence $\rightarrow$ Unicode code point
- Most popular encoding: UTF-8. Make sure that your terminal uses UTF-8. This can be verified with the `locale` command on Linux/MacOS.

| character | Unicode code point | UTF-8 byte sequence |
| ---- | ---- | ---- |
| a | U+0061 | 61 |
| ő | U+0151 | C5 91 |
| ش | U+0634 | D8 B4 |
| گ | U+06AF | DA AF |
| ¿ | U+00BF | C2 BF |
| ư | U+01B0 | C6 B0 |
| Ң | U+04A2 | D2 A2 |
| ⛵ | U+26F5 | E2 9B B5 |

__chr__ and __ord__

We can look up code points with `ord` and similarly convert them back to characters with `chr`:

In [None]:
ship = chr(0x26f5)
ship

In [None]:
ord(ship), hex(ord(ship)), oct(ord(ship)), bin(ord(ship))

Python3 automatically **encodes** Unicode strings when:
- Writing to file
- Printing
- Any kind of operation that requires byte string conversion
- And it automatically **decodes** byte sequnces when reading from file

**IMPORTANT** Python2 does not do this automatically.

## `bytes` in Python 3

- Immutable sequence of bytes.
    - This corresponds to Python2's `str` type.
    - Python3's `str` was `unicode` in Python2.
- Python3 strings can be encoded resulting in a bytes object:

In [None]:
unicode_string = "ábc"
utf8_string = unicode_string.encode("utf8")
type(utf8_string), utf8_string

Different encodings result in different byte sequences:

In [None]:
unicode_string = "ábc"
utf8_string = unicode_string.encode("utf8")
utf16_string = unicode_string.encode("utf16")
latin2_string = unicode_string.encode("latin2")

type(unicode_string), type(utf8_string), type(utf16_string), type(latin2_string)

Their length is different too:

In [None]:
len(unicode_string), len(utf8_string), len(utf16_string), len(latin2_string)

## String operations

Large variety of basic string manipulation: lower, upper, title

In [None]:
"abC".upper(), "ABC".lower(), "abc".title()

Concatenation with `+`:

In [None]:
s = "\tabc  \n"
print("<START>" + s + "<STOP>")

In [None]:
s.strip()

In [None]:
s.rstrip()

In [None]:
s.lstrip()

In [None]:
"abca".strip("ba")

since each function returns a new string, they can be chained after another

In [None]:
" abcd abc".strip().rstrip("c").lstrip("ab")

__Binary predicates (i.e. yes-no questions)__

In [None]:
"abc".startswith("ab"), "abc".endswith("cd")

In [None]:
"abc".istitle(), "Abc".istitle()

In [None]:
"  \t\n".isspace()

In [None]:
"989".isdigit(), "1.5".isdigit()

__`split`__

In [None]:
s = "the quick brown fox jumps over the lazy dog"
words = s.split()
words

In [None]:
s = "R.E.M."
s.split(".")

__`maxsplit` and `partition`__

In [None]:
config = "name=namewith=sign"

config.split("=", maxsplit=1)

In [None]:
"name=namewith=sign".partition("=")

__`join`__

In [None]:
"-".join(words)

use explicit token separators

In [None]:
" <W> ".join(words)

## String formatting

Python features several string formatting options.

__`str.format`__

- non-str objects are automatically cast to str
  - under the hood: the object's `__format__` method is called if it exists, otherwise its `__str__` is called

In [None]:
name = "John"
age = 25

print("My name is {0} and I'm {1} years old. "
      "I turned {1} last December".format(name, age))
print("My name is {} and I'm {} years old.".format(name, age))
# print("My name is {} and I'm {} years old. I turned {} last December".format(name, age))  # raises IndexError
print("My name is {name} and I'm {age} years old. I turned {age} last December".format(
    name=name, age=age))

[Format specification mini language](https://docs.python.org/3/library/string.html#formatspec)

__% operator__

- note that the arguments need to be parenthesized (make it a tuple)

In [None]:
print("My name is %s and I'm %d years old" % (name, age))

__f-strings or string interpolation__

f-strings were added in Python 3.6 in [PEP498](https://www.python.org/dev/peps/pep-0498/)

In [None]:
import sys

name = "John"
age = 42
age2 = 12

if sys.version_info >= (3, 6):
    print(f"My name is {name} and I'm {age} years old {age2}")

Aside from variables, expressions can be evaluated in f-strings:

In [None]:
s = "abc"
f"Length of string: {len(s)}, contents: {s}"

Self-documenting support was added in Python 3.8:

In [None]:
s1 = "abc"
s2 = "aBc"

print(f"{s1 = }")
print(f"{s2 = }")
print(f"{(s1 == s2) =}")
print(f"{(s1.lower() == s2.lower()) = }")

# dictionary

Basic and only built-in map type that maps keys to values.

Dictionaries can be defined in a number of ways:

In [None]:
d = {}  # empty dictionary  same as d = dict()
d["apple"] = 12
d["plum"] = 2
d

Equivalent to:

In [None]:
d = {"apple": 12, "plum": 2}
d

Or:

In [None]:
d = dict(apple=12, plum=2)
d

__removing keys__

In [None]:
del d["apple"]
d

__Iterating dictionaries__

Keys and values can be iterated separately or together.

Iterating keys:

In [None]:
d = {"apple": 12, "plum": 2}
for key in d.keys():
    print(key, d[key])

Iterating values:

In [None]:
for value in d.values():
    print(value)

Iterating both:

In [None]:
for key, value in d.items():
    print(key, value)

__Under the hood__

Dictionaries are hash tables (same as C++'s `std::unordered_map`).
- Constraints on key values: they must be hashable i.e. they cannot be or contain mutable objects
- Keys can be mixed type.

In [None]:
d = {}
d[1] = "a"  # numeric types are immutable
d[3+2j] = "b"
d["c"] = 1.0
d

- tuples are immutable too

In [None]:
d[("apple", 1)] = -2
d

- however lists are not

In [None]:
# d[["apple", 1]] = 12  # raises TypeError

__Q. Can these be dictionary keys?__

In [None]:
key1 = (2, (3, 4))
key2 = (2, [], (3, 4))

d = {}
d[key1] = 1
# d[key2] = 2
d

__Dictionaries preserve insertion order__

In [None]:
d1 = {}
d1['apple'] = 12
d1['plum'] = 3
for key, value in d1.items():
    print(key, value)

In [None]:
d2 = {}
d2['apple'] = 12
d2['plum'] = 3
for key, value in d2.items():
    print(key, value)

Regardless of insertion order `d1` and `d2` are different objects with the same content:

In [None]:
d1 == d2, d1 is d2

# set

- Collection of unique, hashable elements
- Implements basic set operations (intersection, union, difference)
- Sets are mutable

In [None]:
s = set()
s.add(2)
s.add(3)
s.add(2)
s

In [None]:
s = {2, 3, 2}  # d = {'a': 2}
type(s), s

__Deleting elements__

In [None]:
s.add(2)
s.remove(2)
# s.remove(2)  # raises KeyError, since we already removed this element
s.discard(2)  # removes if present, does not raise exception

## frozenset

Immutable counterpart of set:

In [None]:
fs = frozenset([1, 2])
# fs.add(2)  # raises AttributeError

In [None]:
fs = frozenset([1, 2])
s = {1, 2}

d = dict()
d[fs] = 1
# d[s] = 2  # raises TypeError
d

## set operations

- implemented as
  1. methods
  2. overloaded operators

In [None]:
s1 = {1, 2, 3, 4, 5}
s2 = {2, 5, 6, 7}

s1 & s2  # s1.intersection(s2) or s2.intersection(s1)

In [None]:
s1 | s2  # s1.union(s2) OR s2.union(s1)

In [None]:
s1 - s2, s2 - s1  # s1.difference(s2), s2.difference(s1)

These operations return new sets

In [None]:
s3 = s1 & s2
type(s3), s3 is s1, s3 is s2

Subset testing return a boolean

In [None]:
s1 < s2  # s1.issubset(s2) OR s2.issuperset(s1)

## Useful set properties

Creating a set is a convenient way of getting the unique elements of a sequence

In [None]:
l = [1, 2, 3, -1, 1, 2, 1, 0]
uniq = set(l)
uniq

## Under the hood

- sets and dictionaries provide O(1) lookup
- in contrast lists provide O(n) lookup

In [None]:
import random

# let's define our alphabet
letters = "abcdef"
# we generate string of length 1 to 5
word_len = [1, 2, 3, 4, 5]
# we generate 10000 examples
N = 10000
samples = []

for i in range(N):
    word = []
    # sample a word length
    this_len = random.choice(word_len)
    for j in range(this_len):
        # sample a character from the alphabet and add it to the 'word'
        word.append(random.choice(letters))
    samples.append("".join(word))
    
samples = list(set(samples))

__list lookup__

In [None]:
%%timeit

word = []
for j in range(random.choice(word_len)):
    word.append(random.choice(letters))
word = "".join(word)
word in samples

__set lookup__

In [None]:
samples_set = set(samples)
len(samples_set), len(samples)

In [None]:
%%timeit

word = []
for j in range(random.choice(word_len)):
    word.append(random.choice(letters))
word = "".join(word)
word in samples_set

# Miscellaneous

## Mutable default arguments

Mutable default arguments are bound to the function object so they are shared across all function calls:

In [None]:
def insert_value(value, l=[]):
    l.append(value)
    print(l)
    
l1 = []
insert_value(12, l1)
l2 = []
insert_value(14, l2)

In [None]:
insert_value(-1)

In [None]:
insert_value(-3)

It's best to avoid using mutable defaults.

One solution is to create a new list inside a function if no list is provided:

In [None]:
def insert_value(value, l=None):
    if l is None:
        l = []
    l.append(value)
    return l

l = insert_value(2)
l

In [None]:
insert_value(12)

## Lambda expressions

- Unnamed functions
- May take parameters
- Can access local scope

In [None]:
words = ["Plum", "pear", "Apple", "peach"]
sorted(words)

Let's sort this in a case insensitive way:

In [None]:
sorted(words, key=lambda w: w.lower())

Let's sort them by word length AND alphabetically (case insensitive).

We use the fact that tuples are compared elementwise:

In [None]:
(2, "b") < (3, "a")

In [None]:
(2, "b") < (2, "a")

We need two keys: the lengths and the lowercase word form:

In [None]:
sorted(words, key=lambda w: (len(w), w.lower()))

# Mandatory reading

- [A4 page printable PEP8 cheat sheet](https://www.kbsoftware.co.uk/docs/_downloads/pep8_cheat.pdf)
- [Introduction to character encodings](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) by Joel Spolsky, co-founder of Stack Overflow

# Suggested reading

- [Time complexity of various operations under CPython](https://wiki.python.org/moin/TimeComplexity)
- [String formatting mini language](https://docs.python.org/3/library/string.html#formatspec)

# Reference

- [Official documentation of built-in types](https://docs.python.org/3/library/stdtypes.html)