# A Tour of Python
## by Pierre Nugues
A quick introduction to Python’s syntax for readers with some knowledge in programming.

## Elementatry flow control

### Variables

We create variables and we assign them with values, numbers, and strings with the equal sign. Using them, we carry out a few arithmetic operations:

In [1]:
a = 1
b = 2
c = a / (b 
         + 1)
text = 'Result:'
print(text, c)

Result: 0.3333333333333333


## The `for` loop

In Python, blocks are identified by an indentation like in this `for` loop:

In [2]:
for i in [1, 2, 3, 4, 5, 6]:
       print(i)
print('Done')

1
2
3
4
5
6
Done


### Conditionals

Conditionals use the `if` and `else` keywords:

In [3]:
for i in [1, 2, 3, 4, 5, 6]:
    if i % 2 == 0:
        print('Even:', i)
    else:
        print('Odd:', i)
print('Done')

Odd: 1
Even: 2
Odd: 3
Even: 4
Odd: 5
Even: 6
Done


## Strings
We create strings with single quotes and multiline strings with triple double quotes

In [4]:
iliad = """Sing, O goddess, the anger of Achilles son of
Peleus, that brought countless ills upon the Achaeans."""
iliad

'Sing, O goddess, the anger of Achilles son of\nPeleus, that brought countless ills upon the Achaeans.'

In [5]:

iliad2 = 'Sing, O goddess, the anger of Achilles son of \
Peleus, that brought countless ills upon the Achaeans.'
iliad2

'Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans.'

We access the characters in a string through their index in square brackets:

In [6]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabet[0]  # ’a’
alphabet[1]  # ’b’
alphabet[25] # ’z’

'z'

Possibly with negative indices

In [7]:
alphabet[-1]  # the last character of a string: ’z’
alphabet[-2]  # the second last: ’y’
alphabet[-26] # ’a’

'a'

An index outside the range of the string throws an index error.

In [8]:
alphabet[27]

IndexError: string index out of range

We get the length with `len()`

In [9]:
len(alphabet)  # 26

26

Strings are immutable. Trying to change their value throws an error:

In [10]:
alphabet[0] = 'b'  # throws an error

TypeError: 'str' object does not support item assignment

### String Operations and Functions

String operations. We can concatenate and repeat strings with `+` and `*`:

In [11]:
'abc' + 'def'  # 'abcdef'

'abcdef'

In [12]:
'abc' * 3  # 'abcabcabc'

'abcabcabc'

Some string functions

In [13]:
# join()
''.join(['abc', 'def', 'ghi'])  # equivalent to a +:
# 'abcdefghi'

'abcdefghi'

In [14]:
' '.join(['abc', 'def', 'ghi'])  # places a space between the
# elements: 'abc def ghi'

'abc def ghi'

In [15]:
', '.join(['abc', 'def', 'ghi'])  # 'abc, def, ghi'

'abc, def, ghi'

In [16]:
# upper() and lower()
accented_e = 'eéèêë'
accented_e.upper()  # 'EÉÈÊË'

'EÉÈÊË'

In [17]:
accented_E = 'EÉÈÊË'
accented_E.lower()  # 'eéèêë'

'eéèêë'

In [18]:
alphabet.find('def')  # 3

3

In [19]:
alphabet.find('é')  # -1, not found

-1

In [20]:
alphabet.replace('abc', 'αβγ')  # 'αβγdefghijklmnopqrstuvwxyz'

'αβγdefghijklmnopqrstuvwxyz'

A program to extract the vowels

In [21]:
text_vowels = ''
for c in iliad:
    if c in 'aeiou':
        text_vowels = text_vowels + c
print(text_vowels)   # 'ioeeaeoieooeeuaououeiuoeaea' 


ioeeaeoieooeeuaououeiuoeaea


### Slices

Slides are substrings of strings

In [22]:
# Slices
alphabet[0:3]  # the three first letters of the alphabet: 'abc'

'abc'

In [23]:
alphabet[:3]  # equivalent to alphabet[0:3]

'abc'

In [24]:
alphabet[3:6]  # substring from index 3 to index 5: 'def'

'def'

In [25]:
alphabet[-3:]  # the three last letters of the alphabet: 'xyz'

'xyz'

In [26]:
alphabet[10:-10]  # 'klmnop'

'klmnop'

In [27]:
alphabet[:]  # all the letters: 'a...z'

'abcdefghijklmnopqrstuvwxyz'

The whole string

In [28]:
i = 10
alphabet[:i] + alphabet[i:]

'abcdefghijklmnopqrstuvwxyz'

Slices with a step

In [29]:
alphabet[0::2]  # acegikmoqzuwy

'acegikmoqsuwy'

### Special characters

Two characters have a special meaning in strings the quote and the backslash. They need to be escaped: `\'` and `\\\\`

In [30]:
'Python\'s strings'  # "Python's strings"

"Python's strings"

In [31]:
"Python's strings"  # "Python's strings"

"Python's strings"

Python defines escape sequences. It uses the UTF-8 standard

In [32]:
'\N{COMMERCIAL AT}'  # '@'

'@'

In [33]:
'\x40'  # '@'

'@'

In [34]:
'\u0152'  # 'Œ'

'Œ'

We use the `r` prefix to treat the backslashes as normal characters:

In [35]:
r'\N{COMMERCIAL AT}'  # '\\N{COMMERCIAL AT}'

'\\N{COMMERCIAL AT}'

In [36]:
r'\x40'  # '\\x40'

'\\x40'

In [37]:
r'\u0152'  # '\\u0152'

'\\u0152'

### Formatting strings

In [38]:
begin = 'my'
'{} string {}'.format(begin, 'is empty')
# 'my string is empty'

'my string is empty'

In [39]:
begin = 'my'
'{1} string {0}'.format('is empty', begin)
# 'my string is empty'

'my string is empty'

## Data identities and types

In [40]:
12


12

In [41]:
id(12)

140450262608528

In [42]:
print(12)


12


In [43]:
a = 12
id(a)

140450262608528

In [44]:
print(type(a))  # <class 'int'>

<class 'int'>


In [45]:
print(type(12.0))  # <class 'float'>

<class 'float'>


In [46]:
print(type(True))  # <class 'bool'>

<class 'bool'>


In [47]:
print(type(1 < 2))  # <class 'bool'>

<class 'bool'>


In [48]:
print(type(None))  # <class 'NoneType'>

<class 'NoneType'>


In [49]:
id('12')

140448385654512

In [50]:
print(type('12'))

<class 'str'>


In [51]:
alphabet       # abcdefghijklmnopqrstuvwxyz

'abcdefghijklmnopqrstuvwxyz'

In [52]:
id(alphabet)  

140448652709680

In [53]:

type(alphabet)     # <class 'str'>

str

Type conversions

In [54]:
int('12')  # 12

12

In [55]:
str(12)  # '12'

'12'

In [56]:
int('12.0')  # ValueError

ValueError: invalid literal for int() with base 10: '12.0'

In [57]:
int(alphabet)  # ValueError

ValueError: invalid literal for int() with base 10: 'abcdefghijklmnopqrstuvwxyz'

In [58]:
int(True)  # 1

1

In [59]:
int(False)  # 0

0

In [60]:
bool(7)  # True

True

In [61]:
bool(0)  # False

False

In [62]:
bool(None)  # False

False

## Data structures

### Lists

Lists are data structures that can hold any type of elements. We can read and write data in a list using indexes

In [63]:
list1 = []  # An empty list
list1 = list()  # Another way to create an empty list
list2 = [1, 2, 3]  # List containing 1, 2, and 3

Their Python type

In [64]:
print(type(list2))

<class 'list'>


In [65]:
list2[1]  # 2

2

In [66]:
list2[1] = 8
list2  # [1, 8, 3]

[1, 8, 3]

In [67]:
list2[4]  # Index error

IndexError: list index out of range

In [68]:
var1 = 3.14
var2 = 'my string'

In [69]:
list3 = [1, var1, 'Prolog', var2]
list3  # [1, 3.14, 'Prolog', 'my string']

[1, 3.14, 'Prolog', 'my string']

Slices

In [70]:
list3[1:3]  # [3.14, 'Prolog']
list3[1:3] = [2.72, 'Perl', 'Python']
list3  # [1, 2.72, 'Perl', 'Python', 'my string']

[1, 2.72, 'Perl', 'Python', 'my string']

Lists of lists

In [71]:
list4 = [list2, list3]
# [[1, 8, 3], [1, 2.72, 'Perl', 'Python', 'my string']]
list4

[[1, 8, 3], [1, 2.72, 'Perl', 'Python', 'my string']]

In [72]:
list4[0][1]  # 8

8

In [73]:
list4[1][3]  # 'Python'

'Python'

In [74]:
list5 = list2
[v1, v2, v3] = list5

In [75]:
[v1, v2, v3]

[1, 8, 3]

### List Copy
##### Shallow copy

In [76]:
list2                                    

[1, 8, 3]

In [77]:
list5 

[1, 8, 3]

In [78]:
print(id(list2))               
print(id(list5))

140450001547136
140450001547136


In [79]:
list6 = list2.copy()
id(list6)        

140448930426496

#### Identity and equality

In [80]:
list2 == list5    # True

True

In [81]:
list2 == list6    # True

True

In [82]:
list2 is list5    # True

True

In [83]:
list2 is list6    # False

False

In [84]:
list2[1] = 2  

In [85]:
print(list2)          
print(list5)                  
print(list6)                  

[1, 2, 3]
[1, 2, 3]
[1, 8, 3]


In [86]:
id(list2)

140450001547136

#### Deep copy

In [87]:
id(list4.copy()[0])

140450001547136

In [88]:
import copy

id(copy.deepcopy(list4)[0])

140448930371008

### List operations and functions

In [89]:
list2

[1, 2, 3]

In [90]:
  
list3[:-1]  # [1, 2.72, 'Perl', 'Python']

[1, 2.72, 'Perl', 'Python']

In [91]:
[1, 2, 3] + ['a', 'b']  # [1, 2, 3, 'a', 'b']

[1, 2, 3, 'a', 'b']

In [92]:
list2[:2] + list3[2:-1]

[1, 2, 'Perl', 'Python']

In [93]:
list2 * 2  

[1, 2, 3, 1, 2, 3]

In [94]:
[0.0] * 4  # Initializes a list of four 0.0s
# [0.0, 0.0, 0.0, 0.0]

[0.0, 0.0, 0.0, 0.0]

In [95]:
len(list2)  # 3

3

In [96]:
list2.extend([4, 5])  # [1, 2, 3, 4, 5]
list2

[1, 2, 3, 4, 5]

In [97]:
list2.append(6)  # [1, 2, 3, 4, 5, 6]
list2

[1, 2, 3, 4, 5, 6]

In [98]:
list2.append([7, 8])  # [1, 2, 3, 4, 5, 6, [7, 8]]
list2

[1, 2, 3, 4, 5, 6, [7, 8]]

In [99]:
list2.pop(-1)  # [1, 2, 3, 4, 5, 6]
list2

[1, 2, 3, 4, 5, 6]

In [100]:
list2.remove(1)  # [2, 3, 4, 5, 6]
list2

[2, 3, 4, 5, 6]

In [101]:
list2.insert(0, 'a')  # ['a', 2, 3, 4, 5, 6]
list2

['a', 2, 3, 4, 5, 6]

In [102]:
list5

['a', 2, 3, 4, 5, 6]

In [103]:
list6

[1, 8, 3]

### Tuples

Tuples are similar to list, but they are immutable

In [104]:
tuple1 = ()  # An empty tuple
tuple1 = tuple()  # Another way to create an empty tuple
tuple2 = (1, 2, 3, 4)

In [105]:
tuple2[3]  # 4

4

In [106]:
tuple2[1:4]  # (2, 3, 4)

(2, 3, 4)

In [107]:
tuple2[3] = 8  # Type error: Tuples are immutable

TypeError: 'tuple' object does not support item assignment

Tuple can include elements of different type, including lists that can be changed (not a good programming practice)

In [108]:
list7 = ['a', 'b', 'c']
tuple3 = tuple(list7)  # conversion to a tuple: ('a', 'b', 'c')
tuple3

('a', 'b', 'c')

In [109]:
type(tuple3)  # <class 'tuple'>

tuple

In [110]:
list8 = list(tuple2)  # [1, 2, 3, 4]

In [111]:
tuple([1])

(1,)

In [112]:
list((1,))

[1]

In [113]:
tuple4 = (tuple2, list7)  # ((1, 2, 3, 4), ['a', 'b', 'c'])
tuple4[0]  # (1, 2, 3, 4),

(1, 2, 3, 4)

In [114]:
tuple4[1]  # ['a', 'b', 'c']

['a', 'b', 'c']

In [115]:
tuple4[0][2]  # 3

3

In [116]:
tuple4[1][1]  # 'b'

'b'

In [117]:
tuple4[1][1] = 'β'  # ((1, 2, 3, 4), ['a', 'β', 'c'])
tuple4

((1, 2, 3, 4), ['a', 'β', 'c'])

### Sets

Sets are collections that have no duplicates

In [118]:
set1 = set()  # An empty set
set2 = {'a', 'b', 'c', 'c', 'b'}  # {'a', 'b', 'c'}
set2

{'a', 'b', 'c'}

In [119]:
print(type(set2))

<class 'set'>


In [120]:
set2.add('d')  # {'a', 'b', 'c', 'd'}
set2

{'a', 'b', 'c', 'd'}

In [121]:
set2.remove('a')  # {'b', 'c', 'd'}
set2

{'b', 'c', 'd'}

In [122]:
list9 = ['a', 'b', 'c', 'c', 'b']

In [123]:
set3 = set(list9)  # {'a', 'b', 'c'}
set3

{'a', 'b', 'c'}

In [124]:
iliad_chars = set(iliad.lower())
# The set of unique characters of the iliad string
iliad_chars

{'\n',
 ' ',
 ',',
 '.',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'l',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u'}

We can create a sorted list from a set

In [125]:
sorted(iliad_chars)

['\n',
 ' ',
 ',',
 '.',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'l',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u']

`sort()` calls the underlying operating system.
This means that it produces different results on different systems.
It does not work properly on macOS. (Update for macOS 11.5.1: Apparently it does)

In [126]:
import locale

loc = locale.getdefaultlocale()
loc

(None, 'UTF-8')

In [127]:
locale.setlocale(locale.LC_ALL, loc)
accented = 'aàäeéèêëiîïoôöœuûüαβγ'
print("Without locale:\t", sorted(accented))
print("With locale ", loc, '\t', sorted(accented, key=locale.strxfrm))

Without locale:	 ['a', 'e', 'i', 'o', 'u', 'à', 'ä', 'è', 'é', 'ê', 'ë', 'î', 'ï', 'ô', 'ö', 'û', 'ü', 'œ', 'α', 'β', 'γ']
With locale  (None, 'UTF-8') 	 ['a', 'à', 'ä', 'e', 'é', 'è', 'ê', 'ë', 'i', 'î', 'ï', 'o', 'ô', 'ö', 'u', 'û', 'ü', 'œ', 'α', 'β', 'γ']


With an English locale

In [128]:
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
print("With an English locale:\t", sorted(accented, key=locale.strxfrm))

With an English locale:	 ['a', 'à', 'ä', 'e', 'é', 'è', 'ê', 'ë', 'i', 'î', 'ï', 'o', 'ô', 'ö', 'u', 'û', 'ü', 'œ', 'α', 'β', 'γ']


With a Swedish locale

In [129]:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
accented_sv = 'aåäeéioöuαβγ'
print("With a Swedish locale:\t", sorted(accented_sv, key=locale.strxfrm))

With a Swedish locale:	 ['a', 'e', 'i', 'o', 'u', 'ä', 'å', 'é', 'ö', 'α', 'β', 'γ']


Operations on sets

In [130]:
set2.intersection(set3)  # {'c', 'b'}

{'b', 'c'}

In [131]:
set2.union(set3)  # {'d', 'b', 'a', 'c'}

{'a', 'b', 'c', 'd'}

In [132]:
set2.symmetric_difference(set3)  # {'a', 'd'}

{'a', 'd'}

In [133]:
set2.issubset(set3)  # False

False

In [134]:
iliad_chars.intersection(set(alphabet))
# characters of the iliad string that are letters:
# {'a', 's', 'g', 'p', 'u', 'h', 'c', 'l', 'i',
#  'd', 'o', 'e', 'b', 't', 'f', 'r', 'n'}

{'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'l',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u'}

### Dictionaries

Dictionaries are collections of values indexed by keys:

In [135]:
wordcount = {}  # We create an empty dictionary
wordcount = dict()  # Another way to create a dictionary
wordcount['a'] = 21  # The key 'a' has value 21
wordcount['And'] = 10  # 'And' has value 10
wordcount['the'] = 18

In [136]:
wordcount

{'a': 21, 'And': 10, 'the': 18}

In [137]:
print(type(wordcount))

<class 'dict'>


In [138]:
wordcount['a']  # 21

21

In [139]:
wordcount['And']  # 10

10

In [140]:
'And' in wordcount  # True

True

In [141]:
'is' in wordcount  # False

False

In [142]:
wordcount['is']  # Key error

KeyError: 'is'

### Dictionary functions

In [143]:
wordcount.get('And')  # 10

10

In [144]:
wordcount.get('is', 0)  # 0

0

In [145]:
wordcount.get('is')  # None

In [146]:
wordcount.keys()  # dict_keys(['the', 'a', 'And'])

dict_keys(['a', 'And', 'the'])

In [147]:
wordcount.values()  # dict_values([18, 21, 10])

dict_values([21, 10, 18])

In [148]:
wordcount.items()  # dict_items([('the', 18), ('a', 21),
# ('And', 10)])

dict_items([('a', 21), ('And', 10), ('the', 18)])

Keys must be immutable. We can use tuples, but not lists

In [149]:
my_dict = {}
my_dict[('And', 'the')] = 3  # OK, we use a tuple

In [150]:
my_dict[['And', 'the']] = 3  # Type error:
# unhashable type: 'list'

TypeError: unhashable type: 'list'

### Counting letters with a dictionary

In [151]:
letter_count = {}
for letter in iliad.lower():
    if letter in alphabet:
        if letter in letter_count:
            letter_count[letter] += 1
        else:
            letter_count[letter] = 1

print('Iliad')
letter_count

Iliad


{'s': 10,
 'i': 3,
 'n': 6,
 'g': 4,
 'o': 8,
 'd': 2,
 'e': 9,
 't': 6,
 'h': 6,
 'a': 6,
 'r': 2,
 'f': 2,
 'c': 3,
 'l': 6,
 'p': 2,
 'u': 4,
 'b': 1}

In [152]:
for letter in sorted(letter_count.keys()):
    print(letter, letter_count[letter])

a 6
b 1
c 3
d 2
e 9
f 2
g 4
h 6
i 3
l 6
n 6
o 8
p 2
r 2
s 10
t 6
u 4


Sorting the letters by frequency

In [153]:
for letter in sorted(letter_count.keys(),
                     key=letter_count.get, reverse=True):
    print(letter, letter_count[letter])

s 10
e 9
o 8
n 6
t 6
h 6
a 6
l 6
g 4
u 4
i 3
c 3
d 2
r 2
f 2
p 2
b 1


## Control structures

### Conditionals

In [154]:
digits = '0123456789'
punctuation = '.,;:?!'

In [155]:
char = '.'

In [156]:
if char in alphabet:
    print('Letter')
elif char in digits:
    print('Number')
elif char in punctuation:
    print('Punctuation')
else:
    print('Other')

Punctuation


### The `for...in` loop

In [157]:
sum = 0
for i in range(100):
    sum += i
print(sum)  # Sum of integers from 0 to 99: 4950
# Using the built-in sum() function,
# sum(range(100)) would produce the same result.

4950


Useful functions for `for`

In [158]:
list10 = list(range(5))  # [0, 1, 2, 3, 4]
list10

[0, 1, 2, 3, 4]

In [159]:
for inx, letter in enumerate(alphabet):
    print(inx, letter)

0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
19 t
20 u
21 v
22 w
23 x
24 y
25 z


We cannot change an iteration variable in Python

In [160]:
for i in list10:
    if i == 0:
        i = 10
list10    # [0, 1, 2, 3, 4]

[0, 1, 2, 3, 4]

### The `while` loop

A `while` loop

In [161]:
sum, i = 0, 0
while i < 100:
    sum += i
    i += 1
sum

4950

Another loop

In [162]:
sum, i = 0, 0
while True:
    sum += i
    i += 1
    if i >= 100:
        break
sum

4950

### Exceptions

All the exceptions in one block

In [163]:
try:
    int(alphabet)
    int('12.0')
except:
    pass
print('Cleared the exception!')

Cleared the exception!


Specific exceptions

In [164]:
try:
    int(alphabet)
    int('12.0')
except ValueError:
    print('Caught a value error!')
except TypeError:
    print('Caught a type error!')

Caught a value error!


## Functions

We define a function with the `def` keyword:

In [165]:
def count_letters(text, lc=True): #lc is for lowercase. It is to set the characters in lowercase
    letter_count = {}
    if lc:
        text = text.lower()
    for letter in text:
        if letter.lower() in alphabet:
            if letter in letter_count:
                letter_count[letter] += 1
            else:
                letter_count[letter] = 1
    return letter_count

We call the function with it default arguments

In [166]:
odyssey = """Tell me, O Muse, of that many-sided hero who
traveled far and wide after he had sacked the famous town
of Troy."""
print('Start')
od = count_letters(odyssey)
for letter in sorted(od.keys()):
    print(letter, od[letter])

Start
a 9
c 1
d 7
e 12
f 5
h 6
i 2
k 1
l 3
m 4
n 3
o 8
r 5
s 4
t 8
u 2
v 1
w 3
y 2


Or with lower case set to `False`

In [167]:
od = count_letters(odyssey, False)
for letter in sorted(od.keys()):
    print(letter, od[letter])

M 1
O 1
T 2
a 9
c 1
d 7
e 12
f 5
h 6
i 2
k 1
l 3
m 3
n 3
o 7
r 5
s 4
t 6
u 2
v 1
w 3
y 2


In [168]:
print(type(count_letters))

<class 'function'>


## Comprehensions and Generators

Comprehensions and generators are alternatives to loops

### Comprehensions

Generating a set of edits for a string with a comprehension:

In [169]:
word = 'acress'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
splits

[('', 'acress'),
 ('a', 'cress'),
 ('ac', 'ress'),
 ('acr', 'ess'),
 ('acre', 'ss'),
 ('acres', 's'),
 ('acress', '')]

In [170]:
deletes = [a + b[1:] for a, b in splits if b]
deletes

['cress', 'aress', 'acess', 'acrss', 'acres', 'acres']

And the same with loops. Comprehensions are more compact

In [171]:
splits = []
for i in range(len(word) + 1):
    splits.append((word[:i], word[i:]))
splits

[('', 'acress'),
 ('a', 'cress'),
 ('ac', 'ress'),
 ('acr', 'ess'),
 ('acre', 'ss'),
 ('acres', 's'),
 ('acress', '')]

In [172]:
deletes = []
for a, b in splits:
    if b:
        deletes.append(a + b[1:])
deletes

['cress', 'aress', 'acess', 'acrss', 'acres', 'acres']

### Generators

Generators are similar to comprehensions, but they create the elements on demand

In [173]:
splits_generator = ((word[:i], word[i:])
                    for i in range(len(word) + 1))

for i in splits_generator: print(i)

('', 'acress')
('a', 'cress')
('ac', 'ress')
('acr', 'ess')
('acre', 'ss')
('acres', 's')
('acress', '')


We can traverse a generator only once

In [174]:
for i in splits_generator: print(i) # Nothing

### Iterators and `zip()`

In [175]:
latin_alphabet = 'abcdefghijklmnopqrstuvwxyz'
len(latin_alphabet)  # 26

26

In [176]:
greek_alphabet = 'αβγδεζηθικλμνξοπρστυφχψω'
len(greek_alphabet)  # 24

24

In [177]:
cyrillic_alphabet = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя'
len(cyrillic_alphabet)  # 33

33

In [178]:
la_gr = zip(latin_alphabet[:3], greek_alphabet[:3])
la_gr

<zip at 0x7fbcf0ec2080>

In [179]:
list(la_gr)

[('a', 'α'), ('b', 'β'), ('c', 'γ')]

In [180]:
list(la_gr) # You can traverse it only once

[]

In [181]:
la_gr_cy = zip(latin_alphabet[:3], greek_alphabet[:3],
               cyrillic_alphabet[:3])
la_gr_cy

<zip at 0x7fbcd09807c0>

Iterators have a `__next__()` function:

We recreate the iterator

In [182]:
la_gr = zip(latin_alphabet[:3], greek_alphabet[:3]) # We recreate the iterator

In [183]:
la_gr.__next__()  # ('a', 'α')

('a', 'α')

In [184]:
la_gr.__next__()  # ('b', 'β')

('b', 'β')

In [185]:
la_gr.__next__()  # ('c', 'γ')

('c', 'γ')

Until we reach the end

In [186]:
la_gr.__next__()

StopIteration: 

We can traverse an iterator only once. To traverse it two or more times, we convert it to a list

In [187]:
la_gr_cy_list = list(la_gr_cy)

First time

In [188]:
la_gr_cy_list  # [('a', 'α', 'а'), ('b', 'β', 'б'), ('c', 'γ', 'в')]

[('a', 'α', 'а'), ('b', 'β', 'б'), ('c', 'γ', 'в')]

Second time, etc.

In [189]:
la_gr_cy_list  # [('a', 'α', 'а'), ('b', 'β', 'б'), ('c', 'γ', 'в')]

[('a', 'α', 'а'), ('b', 'β', 'б'), ('c', 'γ', 'в')]

In [190]:
la_gr_cy_list = list(la_gr_cy)  # []
la_gr_cy_list

[]

Zipping

In [191]:
la_gr_cy = zip(latin_alphabet[:3], greek_alphabet[:3],
               cyrillic_alphabet[:3])

And unzipping

In [192]:
list(zip(*la_gr_cy))  # [('a', 'b', 'c'), ('α', 'β', 'γ'), ('а', 'б', 'в')]

[('a', 'b', 'c'), ('α', 'β', 'γ'), ('а', 'б', 'в')]

## Modules

The `math` module

In [193]:
import math

math.sqrt(2)  # 1.4142135623730951

1.4142135623730951

In [194]:
math.sin(math.pi / 2)  # 1.0

1.0

In [195]:
math.log(8, 2)  # 3.0

3.0

In [196]:
print(type(math))

<class 'module'>


The `statistics` module

In [197]:
import statistics as stats

stats.mean([1, 2, 3, 4, 5])  # 3.0

3

In [198]:
stats.stdev([1, 2, 3, 4, 5])  # 1.5811388300841898

1.5811388300841898

Running the program or importing it

In [199]:
if __name__ == '__main__':
    print("Running the program")
    # Other statements
else:
    print("Importing the program")
    # Other statements

Running the program


## Basic File Input/Output

Before you run the code below, you will need a file. To follow the example, download Homer's _Iliad_ and _Odyssey_ from the department of classics at the Massachusetts Institute of Technology (MIT): http://classics.mit.edu and store them on your computer. Adjust the `PATH` variable.

In [200]:
CORPUS_PATH = '../../corpus/'

In [201]:
try:
    f_iliad = open(CORPUS_PATH + 'iliad.mb.txt', 'r', encoding='utf-8')  # We open a file and we get a file object
    iliad_txt = f_iliad.read()  # We read all the file
    f_iliad.close()  # We close the file
except:
    pass

In [202]:
iliad_stats = count_letters(iliad_txt)  # We count the letters
iliad_stats

{'t': 54177,
 'h': 50194,
 'e': 77466,
 'i': 38151,
 'l': 25311,
 'a': 51020,
 'd': 28333,
 'b': 8941,
 'y': 11908,
 'o': 51270,
 'm': 16648,
 'r': 36457,
 'n': 42194,
 's': 41243,
 'u': 18409,
 'k': 4413,
 'g': 12595,
 'f': 16114,
 'c': 11558,
 'p': 9104,
 'v': 6060,
 'w': 15665,
 'j': 1624,
 'q': 283,
 'z': 284,
 'x': 597}

In [203]:
with open('iliad_stats.txt', 'w') as f:
    f.write(str(iliad_stats))
    # we automatically close the file

## Collecting a Corpus

We create a dictionary with URLs

In [204]:
classics_url = {'iliad':'http://classics.mit.edu/Homer/iliad.mb.txt',
 'odyssey':'http://classics.mit.edu/Homer/odyssey.mb.txt',
 'eclogue':'http://classics.mit.edu/Virgil/eclogue.mb.txt',
 'georgics':'http://classics.mit.edu/Virgil/georgics.mb.txt',
 'aeneid':'http://classics.mit.edu/Virgil/aeneid.mb.txt'}

We read the texts from the URLs

In [205]:
import requests

classics = {}
for key in classics_url:
    classics[key] = requests.get(classics_url[key]).text

We remove the license information to keep only the text

In [206]:
text_bounds = {'iliad':(136, -486), 'odyssey':(138, -486), 
'eclogue':(139, -486), 'georgics': (140, -486), 'aeneid':(138, -486)}

In [207]:
for key in classics:
    classics[key] = classics[key][text_bounds[key][0]:text_bounds[key][1]]

In [208]:
classics['iliad'][:50]

'The Iliad\nBy Homer\n\n\nTranslated by Samuel Butler\n\n'

We additionally write the Iliad and the Odyssey in two text files

In [209]:
with open('iliad.txt', 'w') as f_il, open('odyssey.txt', 'w') as f_od:
    f_il.write(classics['iliad'])
    f_od.write(classics['odyssey'])

We store the corpus in a JSON file

In [210]:
import json

with open('classics.json', 'w') as f:
    json.dump(classics, f)

We read it again

In [211]:
with open('classics.json', 'r') as f:
    classics = json.loads(f.read())

## Decorators and memo-functions

In [212]:
__author__ = "Pierre Nugues"


def memo_function(f):
    cache = {}

    def memo(x):
        if x in cache:
            return cache[x]
        else:
            cache[x] = f(x)
            return cache[x]

    return memo


@memo_function
def fibonacci(n):
    """
    Fibonacci with memo function
    :param n:
    :return:
    """
    if n == 1:
        return 1
    elif n == 2:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


f_numbers = {}


def fibonacci2(n):
    """
    Fibonacci with memoization. Ad hoc implementation
    :param n:
    :return:
    """
    if n == 1:
        return 1
    elif n == 2:
        return 1
    elif n in f_numbers:
        return f_numbers[n]
    else:
        f_numbers[n] = fibonacci2(n - 1) + fibonacci2(n - 2)
        return f_numbers[n]


print(fibonacci(400))
print(fibonacci2(900))

176023680645013966468226945392411250770384383304492191886725992896575345044216019675
54877108839480000051413673948383714443800519309123592724494953427039811201064341234954387521525390615504949092187441218246679104731442473022013980160407007017175697317900483275246652938800


## Classes and Objects
Defining a class

In [213]:
class Text:
    """Text class to hold and process text"""

    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    def __init__(self, text=None):
        """The constructor called when an object
        is created"""

        self.content = text
        self.length = len(text)
        self.letter_counts = {}

    def count_letters(self, lc=True):
        """Function to count the letters of a text"""

        letter_counts = {}
        if lc:
            text = self.content.lower()
        else:
            text = self.content
        for letter in text:
            if letter.lower() in self.alphabet:
                if letter in letter_counts:
                    letter_counts[letter] += 1
                else:
                    letter_counts[letter] = 1
        self.letter_counts = letter_counts
        return letter_counts


In [214]:
print(type(Text))

<class 'type'>


Creating objects and calling methods

In [215]:
txt = Text("""Tell me, O Muse, of that many-sided hero who
traveled far and wide after he had sacked the famous town
of Troy.""")
print(type(txt))

<class '__main__.Text'>


In [216]:
print(txt.length)

print(txt.count_letters())
print(txt.count_letters(False))

111
{'t': 8, 'e': 12, 'l': 3, 'm': 4, 'o': 8, 'u': 2, 's': 4, 'f': 5, 'h': 6, 'a': 9, 'n': 3, 'y': 2, 'i': 2, 'd': 7, 'r': 5, 'w': 3, 'v': 1, 'c': 1, 'k': 1}
{'T': 2, 'e': 12, 'l': 3, 'm': 3, 'O': 1, 'M': 1, 'u': 2, 's': 4, 'o': 7, 'f': 5, 't': 6, 'h': 6, 'a': 9, 'n': 3, 'y': 2, 'i': 2, 'd': 7, 'r': 5, 'w': 3, 'v': 1, 'c': 1, 'k': 1}


Assigning the object variables

In [217]:
txt.my_var = 'a'
txt.content = classics['iliad']
print(txt.count_letters())
print(txt.my_var)

{'t': 54177, 'h': 50194, 'e': 77466, 'i': 38151, 'l': 25311, 'a': 51020, 'd': 28333, 'b': 8941, 'y': 11908, 'o': 51270, 'm': 16648, 'r': 36457, 'n': 42194, 's': 41243, 'u': 18409, 'k': 4413, 'g': 12595, 'f': 16114, 'c': 11558, 'p': 9104, 'v': 6060, 'w': 15665, 'j': 1624, 'q': 283, 'z': 284, 'x': 597}
a


### Subclassing

In [218]:
class Word(Text):
    def __init__(self, word=None):
        super().__init__(word)
        self.part_of_speech = None

    def annotate(self, part_of_speech):
        self.part_of_speech = part_of_speech


In [219]:
type(Word)

type

In [220]:
word = Word('Muse')

In [221]:
type(word)

__main__.Word

In [222]:
word.length        

4

In [223]:
word.count_letters(lc=False)

{'M': 1, 'u': 1, 's': 1, 'e': 1}

In [224]:
word.annotate('Noun')
word.part_of_speech

'Noun'

## Functional programming

`map()`

In [225]:
text_lengths = map(len, [iliad, odyssey])
list(text_lengths)  # [100, 111]

[100, 111]

In [226]:
def file_length(file):
    return len(open(file).read())

file_length('iliad.txt')

807676

In [227]:
files = ['iliad.txt', 'odyssey.txt']
files = [file for file in files]

text_lengths = map(lambda x: len(open(x).read()), files)
list(text_lengths)  # [807677, 610676]

[807676, 610676]

In [228]:
text_lengths = (
    map(lambda x: (open(x).read(), len(open(x).read())),
        files))
text_lengths = list(text_lengths)
[text_lengths[0][1], text_lengths[1][1]]  # [807676, 610676]

[807676, 610676]

In [229]:
text_lengths = (
    map(lambda x: (x, len(x)),
        map(lambda x: open(x).read(), files)))
text_lengths = list(text_lengths)
[text_lengths[0][1], text_lengths[1][1]]  # [807676, 610676]

[807676, 610676]

`reduce()`

In [230]:
import functools

char_count = functools.reduce(
    lambda x, y: x[1] + y[1],
    map(lambda x: (x, len(x)),
        map(lambda x: open(x).read(), files)))

char_count

1418352

In [231]:
iliad = """Sing, O goddess, the anger of Achilles son of
Peleus, that brought countless ills upon the Achaeans."""
iliad

'Sing, O goddess, the anger of Achilles son of\nPeleus, that brought countless ills upon the Achaeans.'

In [232]:
''.join(filter(lambda x: x in 'aeiou', iliad))

'ioeeaeoieooeeuaououeiuoeaea'

In [233]:
''.join(filter(lambda x: x in 'aeiou',
               open('iliad.txt').read()))[:100]

'eiaoeaaeaueueioeeaeoieooeeuaououeiuoeaeaaaaeouiieuiooaeaaaeoiiieaeooauueooeeeoueooeuieoeaoieooeuioea'

In [234]:
map(lambda y:
    ''.join(filter(lambda x: x in 'aeiou',
                   open(y).read())),
    files)

<map at 0x7fbcd0cf0c70>

In [235]:
list(map(len,
               map(lambda y:
                   ''.join(filter(lambda x: x in 'aeiou',
                                  open(y).read())),
                   files)))

# print(list(map(lambda x: x if x in 'aeiuo' else '', map(lambda x: open(x).read(), files))))

[230637, 176073]

## Numerical Computations and NumPy

In [236]:
import numpy as np

Let us read Homer's _Iliad_ and _Odyssey_ and Virgil's _Eclogue_, _Georgics_, and _Aeneid_.

In [237]:
titles = ['iliad', 'odyssey', 'eclogue', 'georgics', 'aeneid']
titles

['iliad', 'odyssey', 'eclogue', 'georgics', 'aeneid']

In [238]:
texts = []
for title in titles:
    texts += [classics[title]]

In [239]:
cnt_dicts = []
for text in texts:
    cnt_dicts += [Text(text).count_letters()]

In [240]:
cnt_lists = []
for cnt_dict in cnt_dicts:
    cnt_lists += [list(map(cnt_dict.get, alphabet))]

In [241]:
cnt_lists[0][:3]

[51020, 8941, 11558]

### Creating arrays
#### Vectors
Vectors of letter counts

In [242]:
iliad_cnt = np.array(cnt_lists[0])
odyssey_cnt = np.array(cnt_lists[1])
eclogue_cnt = np.array(cnt_lists[2])
georgics_cnt = np.array(cnt_lists[3])
aeneid_cnt = np.array(cnt_lists[4])

In [243]:
iliad_cnt

array([51020,  8941, 11558, 28333, 77466, 16114, 12595, 50194, 38151,
        1624,  4413, 25311, 16648, 42194, 51270,  9104,   283, 36457,
       41243, 54177, 18409,  6060, 15665,   597, 11908,   284])

In [244]:
odyssey_cnt

array([37630,  6598,  8580, 20738, 59783, 10449,  9803, 34787, 28793,
         424,  3631, 18951, 13060, 31889, 38778,  6679,   256, 25668,
       31352, 40483, 15406,  4803, 12989,   350, 10974,   124])

The vector dimension

In [245]:
odyssey_cnt.shape

(26,)

In [246]:
np.sum(odyssey_cnt)

472978

#### Matrices

We create a matrix from the list of lists

In [247]:
hv_cnts = np.array(cnt_lists)
hv_cnts

array([[51020,  8941, 11558, 28333, 77466, 16114, 12595, 50194, 38151,
         1624,  4413, 25311, 16648, 42194, 51270,  9104,   283, 36457,
        41243, 54177, 18409,  6060, 15665,   597, 11908,   284],
       [37630,  6598,  8580, 20738, 59783, 10449,  9803, 34787, 28793,
          424,  3631, 18951, 13060, 31889, 38778,  6679,   256, 25668,
        31352, 40483, 15406,  4803, 12989,   350, 10974,   124],
       [ 2716,   578,   723,  1440,  4366,   846,   808,  2509,  2252,
           22,   268,  1809,  1043,  2248,  2948,   569,    12,  2236,
         2618,  2940,  1031,   361,  1023,    38,   906,    22],
       [ 6841,  1619,  2017,  4027, 12112,  2424,  2150,  6988,  6038,
           59,   782,  4309,  2027,  6552,  6958,  1669,    53,  6704,
         7143,  8713,  2583,   903,  2480,    85,  1458,    64],
       [36678,  6869, 10023, 23866, 55372, 11618,  9607, 33057, 30579,
          908,  2702, 18768, 10201, 32258, 32595,  8343,   530, 32077,
        36430, 39481, 13714,  

In [248]:
np.size(hv_cnts)

130

The order

In [249]:
hv_cnts.shape

(5, 26)

### Indices and slices

In [250]:
iliad_cnt[2]

11558

In [251]:
hv_cnts[1, 2]

8580

In [252]:
hv_cnts[:, 2]

array([11558,  8580,   723,  2017, 10023])

In [253]:
hv_cnts[1, :]

array([37630,  6598,  8580, 20738, 59783, 10449,  9803, 34787, 28793,
         424,  3631, 18951, 13060, 31889, 38778,  6679,   256, 25668,
       31352, 40483, 15406,  4803, 12989,   350, 10974,   124])

In [254]:
hv_cnts[1, :2]

array([37630,  6598])

In [255]:
hv_cnts[3, 2:4]

array([2017, 4027])

In [256]:
hv_cnts[3:, 2:4]

array([[ 2017,  4027],
       [10023, 23866]])

### Simple Operations

In [257]:
iliad_cnt + odyssey_cnt

array([ 88650,  15539,  20138,  49071, 137249,  26563,  22398,  84981,
        66944,   2048,   8044,  44262,  29708,  74083,  90048,  15783,
          539,  62125,  72595,  94660,  33815,  10863,  28654,    947,
        22882,    408])

In [258]:
iliad_cnt - odyssey_cnt

array([13390,  2343,  2978,  7595, 17683,  5665,  2792, 15407,  9358,
        1200,   782,  6360,  3588, 10305, 12492,  2425,    27, 10789,
        9891, 13694,  3003,  1257,  2676,   247,   934,   160])

In [259]:
hv_cnts - 2 * hv_cnts

array([[-51020,  -8941, -11558, -28333, -77466, -16114, -12595, -50194,
        -38151,  -1624,  -4413, -25311, -16648, -42194, -51270,  -9104,
          -283, -36457, -41243, -54177, -18409,  -6060, -15665,   -597,
        -11908,   -284],
       [-37630,  -6598,  -8580, -20738, -59783, -10449,  -9803, -34787,
        -28793,   -424,  -3631, -18951, -13060, -31889, -38778,  -6679,
          -256, -25668, -31352, -40483, -15406,  -4803, -12989,   -350,
        -10974,   -124],
       [ -2716,   -578,   -723,  -1440,  -4366,   -846,   -808,  -2509,
         -2252,    -22,   -268,  -1809,  -1043,  -2248,  -2948,   -569,
           -12,  -2236,  -2618,  -2940,  -1031,   -361,  -1023,    -38,
          -906,    -22],
       [ -6841,  -1619,  -2017,  -4027, -12112,  -2424,  -2150,  -6988,
         -6038,    -59,   -782,  -4309,  -2027,  -6552,  -6958,  -1669,
           -53,  -6704,  -7143,  -8713,  -2583,   -903,  -2480,    -85,
         -1458,    -64],
       [-36678,  -6869, -10023, -238

### NumPy Functions

In [260]:
np.set_printoptions(precision=3)

In [261]:
np.sqrt(iliad_cnt)

array([225.876,  94.557, 107.508, 168.324, 278.327, 126.941, 112.227,
       224.04 , 195.323,  40.299,  66.43 , 159.094, 129.027, 205.412,
       226.429,  95.415,  16.823, 190.937, 203.084, 232.76 , 135.68 ,
        77.846, 125.16 ,  24.434, 109.124,  16.852])

In [262]:
np.cos(hv_cnts)

array([[ 0.86 ,  1.   , -0.997, -0.52 ,  0.821, -0.717, -0.938, -0.715,
         0.877, -0.979, -0.592, -0.688, -0.765, -0.745,  0.702,  0.944,
         0.967, -0.378,  0.985, -0.973,  0.743, -0.991,  0.524,  0.995,
         0.205,  0.309],
       [ 1.   ,  0.793, -0.952, -0.94 ,  0.063,  0.998,  0.333, -0.99 ,
        -0.954, -0.993,  0.777,  0.611, -0.921, -0.261, -0.246,  1.   ,
        -0.04 ,  0.373,  0.458,  0.906,  0.932, -0.88 , -0.085, -0.284,
        -0.914, -0.093],
       [-0.093,  0.999,  0.907,  0.408,  0.687, -0.613, -0.819, -0.424,
        -0.867, -1.   , -0.57 ,  0.849,  1.   ,  0.189,  0.375, -0.932,
         0.844,  0.687, -0.495,  0.862,  0.849, -0.96 ,  0.4  ,  0.955,
         0.342, -1.   ],
       [ 0.181, -0.472,  0.995,  0.867, -0.399,  0.258,  0.408,  0.455,
         0.99 , -0.771, -0.967,  0.301, -0.782,  0.207, -0.809, -0.686,
        -0.918,  0.987,  0.556, -0.206,  0.819, -0.206, -0.283, -0.984,
         0.955,  0.392],
       [-0.996,  0.092,  0.249, -0.7

In [263]:
math.sqrt(iliad_cnt)

TypeError: only size-1 arrays can be converted to Python scalars

In [264]:
np_sqrt = np.vectorize(math.sqrt)
np_sqrt(iliad_cnt)

array([225.876,  94.557, 107.508, 168.324, 278.327, 126.941, 112.227,
       224.04 , 195.323,  40.299,  66.43 , 159.094, 129.027, 205.412,
       226.429,  95.415,  16.823, 190.937, 203.084, 232.76 , 135.68 ,
        77.846, 125.16 ,  24.434, 109.124,  16.852])

In [265]:
np.sum(hv_cnts)

1706121

In [266]:
np.sum(hv_cnts, axis=0)

array([134885,  24605,  32901,  78404, 209099,  41451,  34963, 127535,
       105813,   3037,  11796,  69148,  42979, 115141, 132549,  26364,
         1134, 103142, 118786, 145794,  51143,  16502,  43254,   1630,
        33302,    764])

In [267]:
np.sum(hv_cnts, axis=1)

array([630019, 472978,  36332,  96758, 470034])

### Transposing and Reshaping Arrays

In [268]:
hv_cnts.T

array([[51020, 37630,  2716,  6841, 36678],
       [ 8941,  6598,   578,  1619,  6869],
       [11558,  8580,   723,  2017, 10023],
       [28333, 20738,  1440,  4027, 23866],
       [77466, 59783,  4366, 12112, 55372],
       [16114, 10449,   846,  2424, 11618],
       [12595,  9803,   808,  2150,  9607],
       [50194, 34787,  2509,  6988, 33057],
       [38151, 28793,  2252,  6038, 30579],
       [ 1624,   424,    22,    59,   908],
       [ 4413,  3631,   268,   782,  2702],
       [25311, 18951,  1809,  4309, 18768],
       [16648, 13060,  1043,  2027, 10201],
       [42194, 31889,  2248,  6552, 32258],
       [51270, 38778,  2948,  6958, 32595],
       [ 9104,  6679,   569,  1669,  8343],
       [  283,   256,    12,    53,   530],
       [36457, 25668,  2236,  6704, 32077],
       [41243, 31352,  2618,  7143, 36430],
       [54177, 40483,  2940,  8713, 39481],
       [18409, 15406,  1031,  2583, 13714],
       [ 6060,  4803,   361,   903,  4375],
       [15665, 12989,  1023,  24

In [269]:
iliad_cnt.T

array([51020,  8941, 11558, 28333, 77466, 16114, 12595, 50194, 38151,
        1624,  4413, 25311, 16648, 42194, 51270,  9104,   283, 36457,
       41243, 54177, 18409,  6060, 15665,   597, 11908,   284])

In [270]:
np.array([iliad_cnt])

array([[51020,  8941, 11558, 28333, 77466, 16114, 12595, 50194, 38151,
         1624,  4413, 25311, 16648, 42194, 51270,  9104,   283, 36457,
        41243, 54177, 18409,  6060, 15665,   597, 11908,   284]])

In [271]:
np.array([iliad_cnt]).shape

(1, 26)

In [272]:
np.array([iliad_cnt]).T

array([[51020],
       [ 8941],
       [11558],
       [28333],
       [77466],
       [16114],
       [12595],
       [50194],
       [38151],
       [ 1624],
       [ 4413],
       [25311],
       [16648],
       [42194],
       [51270],
       [ 9104],
       [  283],
       [36457],
       [41243],
       [54177],
       [18409],
       [ 6060],
       [15665],
       [  597],
       [11908],
       [  284]])

In [273]:
iliad_cnt.reshape(1, -1)

array([[51020,  8941, 11558, 28333, 77466, 16114, 12595, 50194, 38151,
         1624,  4413, 25311, 16648, 42194, 51270,  9104,   283, 36457,
        41243, 54177, 18409,  6060, 15665,   597, 11908,   284]])

In [274]:
iliad_cnt.reshape(-1, 1)

array([[51020],
       [ 8941],
       [11558],
       [28333],
       [77466],
       [16114],
       [12595],
       [50194],
       [38151],
       [ 1624],
       [ 4413],
       [25311],
       [16648],
       [42194],
       [51270],
       [ 9104],
       [  283],
       [36457],
       [41243],
       [54177],
       [18409],
       [ 6060],
       [15665],
       [  597],
       [11908],
       [  284]])

### Elementwise and Hadamard Products

Relative frequencies of the letter counts

In [275]:
iliad_dist = (1/np.sum(iliad_cnt)) * iliad_cnt
odyssey_dist = (1/np.sum(odyssey_cnt)) * odyssey_cnt

In [276]:
iliad_cnt / np.sum(iliad_cnt)

array([0.081, 0.014, 0.018, 0.045, 0.123, 0.026, 0.02 , 0.08 , 0.061,
       0.003, 0.007, 0.04 , 0.026, 0.067, 0.081, 0.014, 0.   , 0.058,
       0.065, 0.086, 0.029, 0.01 , 0.025, 0.001, 0.019, 0.   ])

In [277]:
odyssey_cnt / np.sum(odyssey_cnt)

array([0.08 , 0.014, 0.018, 0.044, 0.126, 0.022, 0.021, 0.074, 0.061,
       0.001, 0.008, 0.04 , 0.028, 0.067, 0.082, 0.014, 0.001, 0.054,
       0.066, 0.086, 0.033, 0.01 , 0.027, 0.001, 0.023, 0.   ])

We can apply an elementwise multiplication or division

In [278]:
np.array([np.sum(hv_cnts, axis=1)]).T

array([[630019],
       [472978],
       [ 36332],
       [ 96758],
       [470034]])

In [279]:
hv_dist = hv_cnts / np.array([np.sum(hv_cnts, axis=1)]).T
hv_dist

array([[0.081, 0.014, 0.018, 0.045, 0.123, 0.026, 0.02 , 0.08 , 0.061,
        0.003, 0.007, 0.04 , 0.026, 0.067, 0.081, 0.014, 0.   , 0.058,
        0.065, 0.086, 0.029, 0.01 , 0.025, 0.001, 0.019, 0.   ],
       [0.08 , 0.014, 0.018, 0.044, 0.126, 0.022, 0.021, 0.074, 0.061,
        0.001, 0.008, 0.04 , 0.028, 0.067, 0.082, 0.014, 0.001, 0.054,
        0.066, 0.086, 0.033, 0.01 , 0.027, 0.001, 0.023, 0.   ],
       [0.075, 0.016, 0.02 , 0.04 , 0.12 , 0.023, 0.022, 0.069, 0.062,
        0.001, 0.007, 0.05 , 0.029, 0.062, 0.081, 0.016, 0.   , 0.062,
        0.072, 0.081, 0.028, 0.01 , 0.028, 0.001, 0.025, 0.001],
       [0.071, 0.017, 0.021, 0.042, 0.125, 0.025, 0.022, 0.072, 0.062,
        0.001, 0.008, 0.045, 0.021, 0.068, 0.072, 0.017, 0.001, 0.069,
        0.074, 0.09 , 0.027, 0.009, 0.026, 0.001, 0.015, 0.001],
       [0.078, 0.015, 0.021, 0.051, 0.118, 0.025, 0.02 , 0.07 , 0.065,
        0.002, 0.006, 0.04 , 0.022, 0.069, 0.069, 0.018, 0.001, 0.068,
        0.078, 0.084, 0.029, 0

The Hadamard product

In [280]:
hv_dist * hv_dist

array([[6.558e-03, 2.014e-04, 3.366e-04, 2.022e-03, 1.512e-02, 6.542e-04,
        3.997e-04, 6.347e-03, 3.667e-03, 6.645e-06, 4.906e-05, 1.614e-03,
        6.983e-04, 4.485e-03, 6.622e-03, 2.088e-04, 2.018e-07, 3.349e-03,
        4.285e-03, 7.395e-03, 8.538e-04, 9.252e-05, 6.182e-04, 8.979e-07,
        3.572e-04, 2.032e-07],
       [6.330e-03, 1.946e-04, 3.291e-04, 1.922e-03, 1.598e-02, 4.881e-04,
        4.296e-04, 5.409e-03, 3.706e-03, 8.036e-07, 5.893e-05, 1.605e-03,
        7.624e-04, 4.546e-03, 6.722e-03, 1.994e-04, 2.930e-07, 2.945e-03,
        4.394e-03, 7.326e-03, 1.061e-03, 1.031e-04, 7.542e-04, 5.476e-07,
        5.383e-04, 6.873e-08],
       [5.588e-03, 2.531e-04, 3.960e-04, 1.571e-03, 1.444e-02, 5.422e-04,
        4.946e-04, 4.769e-03, 3.842e-03, 3.667e-07, 5.441e-05, 2.479e-03,
        8.241e-04, 3.828e-03, 6.584e-03, 2.453e-04, 1.091e-07, 3.788e-03,
        5.192e-03, 6.548e-03, 8.053e-04, 9.873e-05, 7.928e-04, 1.094e-06,
        6.218e-04, 3.667e-07],
       [4.999e-03, 

### Dot Products

Between two vectors

In [281]:
np.dot(iliad_dist, odyssey_dist)

0.06581109734214702

Or two rows of a matrix

In [282]:
np.dot(hv_dist[0, :], hv_dist[1, :])

0.065811097342147

Finally, we compute the cosine 
$$
\frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}|| . ||\mathbf{y}||}.
$$

In [283]:
np.dot(hv_dist[0, :], hv_dist[1, :]) / (
        np.linalg.norm(hv_dist[0, :]) *
        np.linalg.norm(hv_dist[1, :]))

0.9990782943375431

### Matrix Products

The product of a matrix  $\mathbf{X}$ by a vector $\mathbf{y}$, $\mathbf{X}\mathbf{y}$, is a sequence of dot products between the matrix rows and the vector resulting in a column vector:
$$
\mathbf{X}\mathbf{y} = 
\begin{bmatrix*}
\mathbf{X}_{1 .} \cdot \mathbf{y} \\
\mathbf{X}_{2 .} \cdot \mathbf{y} \\
...\\
\mathbf{X}_{n .} \cdot \mathbf{y} \\
\end{bmatrix*},
$$
where $\mathbf{X}_{i .}$ denotes the $i^\text{th}$ row of matrix $\mathbf{X}$. If $\mathbf{X}$ consists of only one row, we have a matrix product of a row vector by a column vector, which is equivalent to a dot product:
$$
\mathbf{x} \cdot \mathbf{y} =
\begin{bmatrix*}
x_1,&x_2,& ...& x_n\\
\end{bmatrix*}
\begin{bmatrix*}
y_1\\
y_2\\ 
...\\
y_n\\
\end{bmatrix*}
= \sum_{i = 1}^n x_i y_i.
$$

In [284]:
hv_dist[0, :].reshape(1, -1) @ hv_dist[1, :].reshape(-1, 1)

array([[0.066]])

We will now compute the cosine of all the pairs of vectors representing the works in the `hv_dist` matrix, _i.e._ the rows of the matrix. For this, we will first compute the dot products of all the pairs, $\mathbf{u} \cdot \mathbf{v}$, then the norms $||\mathbf{u}||$ and  $||\mathbf{v}||$, the products of the norms, $||\mathbf{u}|| \cdot||\mathbf{v}||$, and finally the cosines, $\displaystyle{\frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot||\mathbf{v}||}}$.

The dot product, $\mathbf{u} \cdot \mathbf{v}$, of all the rows of a matrix $\mathbf{X}$ is simply $\mathbf{X} \mathbf{X}^\intercal$:

In [285]:
hv_dot = hv_dist @ hv_dist.T
hv_dot

array([[0.066, 0.066, 0.065, 0.066, 0.065],
       [0.066, 0.066, 0.065, 0.066, 0.065],
       [0.065, 0.065, 0.064, 0.065, 0.064],
       [0.066, 0.066, 0.065, 0.066, 0.065],
       [0.065, 0.065, 0.064, 0.065, 0.065]])

For the vector noms, $||\mathbf{u}||$ and  $||\mathbf{v}||$, we can use `np.linalg.norm()`. Here we will break down the computation with elementary operations. We will apply the Hadamard product to have the square of the coordinates, then sum along the rows, and finally extract the square root:

In [286]:
hv_norm = np.sqrt(np.sum(hv_dist * hv_dist, axis=1))
hv_norm

array([0.257, 0.257, 0.253, 0.257, 0.255])

We compute the product of the norms, $||\mathbf{u}|| \cdot||\mathbf{v}||$, as a matrix product of a column vector by a row vector as with:
$$
\begin{bmatrix*}
x_1\\
x_2\\
 ...\\
 x_n\\
\end{bmatrix*}
\begin{bmatrix*}
y_1, y_2, ..., y_n\\
\end{bmatrix*}
=
\begin{bmatrix*}
x_1 y_1& x_1 y_2&...&x_1 y_n\\
x_2 y_1& x_2 y_1&...&x_2 y_n\\
 ...\\
 x_ny_1& x_n y_2&...&x_n y_n \\
\end{bmatrix*}. 
$$


In [287]:
hv_norm_pairs = hv_norm.reshape(-1, 1) @ hv_norm.reshape(1, -1)
hv_norm_pairs

array([[0.066, 0.066, 0.065, 0.066, 0.065],
       [0.066, 0.066, 0.065, 0.066, 0.065],
       [0.065, 0.065, 0.064, 0.065, 0.064],
       [0.066, 0.066, 0.065, 0.066, 0.065],
       [0.065, 0.065, 0.064, 0.065, 0.065]])

We now nearly done with the cosine. We only need to divide the matrix elements by the norm products, $\displaystyle{\frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot||\mathbf{v}||}}$

In [288]:
hv_cos = hv_dot / hv_norm_pairs
hv_cos

array([[1.   , 0.999, 0.997, 0.996, 0.995],
       [0.999, 1.   , 0.997, 0.995, 0.994],
       [0.997, 0.997, 1.   , 0.996, 0.995],
       [0.996, 0.995, 0.996, 1.   , 0.998],
       [0.995, 0.994, 0.995, 0.998, 1.   ]])

### Matrices and Rotations

To finish this notebook, we will have a look at vector rotation. From algebra courses, we know that we can use a matrix to compute a rotation of angle $\theta$. For a two-dimensional vector, the rotation matrix is:
$$
\mathbf{R}_{\theta} =
\begin{bmatrix*}
\cos \theta &-\sin \theta \\
\sin \theta & \cos \theta \\
\end{bmatrix*}.
$$

In [289]:
theta_45 = np.pi/4
rot_mat_45 = np.array([[np.cos(theta_45), -np.sin(theta_45)],
          [np.sin(theta_45), np.cos(theta_45)]])
rot_mat_45

array([[ 0.707, -0.707],
       [ 0.707,  0.707]])

we rotate vector (1, 1) by this angle

In [290]:
rot_mat_45 @ np.array([1, 1])

array([1.110e-16, 1.414e+00])

The matrix of a sequence of rotations, for instance a rotation of $\pi/6$ followed by a rotation of $\pi/4$, is simply the matrix product of the individual rotations $\mathbf{R}_{{\theta}_1} \mathbf{R}_{{\theta}_2}  = \mathbf{R}_{{\theta}_1 + {\theta}_2}$, here $\mathbf{R}_{\pi/4} \mathbf{R}_{\pi/6}  = \mathbf{R}_{5\pi/12}$. 

In [291]:
theta_30 = np.pi/6
rot_mat_30 = np.array([[np.cos(theta_30), -np.sin(theta_30)],
          [np.sin(theta_30), np.cos(theta_30)]])
rot_mat_30

array([[ 0.866, -0.5  ],
       [ 0.5  ,  0.866]])

In [292]:
rot_mat_30 @ rot_mat_45

array([[ 0.259, -0.966],
       [ 0.966,  0.259]])

In [293]:
rot_mat_45 @ rot_mat_30

array([[ 0.259, -0.966],
       [ 0.966,  0.259]])

In [294]:
np.arccos(0.25881905)

1.3089969339255036

In [295]:
np.pi/4 + np.pi/6

1.308996938995747