The learning outcomes of this session are

-   to have an introductory understanding of Jupyter notebooks.

-   To have an introductory understanding of the library NumPy,
    particularly with respect to vectors.

-   To examine a case where vectors can be analysed that have a much
    larger dimension than 3.

Notebooks
=========

If you haven’t encountered this before this is an example of a notebook
(specifically a Jupyter notebook). This brings together text and code
into one place.

The parts of this screen which have "In \[ \]:" to the left are referred
to as cells. Cells are pieces of code that can be run. You can run the
cells by clicking on the cell and pressing Control + Enter.

If the cell generates output it will appear immediately after the cell
in question when you do this.

Do not just click through all the cells without thinking about what each
cell is going to do.

**There will be cells that are empty - you will need to fill them!**

It’s important to note that the cells are in order and you should enter
them in that order.

For this section, you will be largely running things interactively but
it’s best to set up an editor window, type in there and then run the
script (The F5 key will very much be your friend here).

If you click on Help and Keyboard Shortcuts you will get a list of
shortcuts for doing particular operations (I find the one for creating a
cell below the one you are using is very useful).

If you want to save your notebook you can click on the diskette icon but
remember that if you restart your binder session the information will be
lost. To keep a permanent version on your machine, click on File and
then Download as and then Notebook. When you restart your binder session
on the directory page select Upload and then upload your file. You can
then run it from there.

You may also wish to run Jupyter from your own computer. Instructions on
this can be found at
<a href="https://jupyter.org/install" class="uri">https://jupyter.org/install</a>.

NumPy Introduction
==================

We will make use of NumPy for this course to work with vectors and
matrices. It’s worth keeping an eye on the QuickStart tutorial to NumPy
(which we copy from a great deal here)

<a href="https://docs.scipy.org/doc/numpy/user/" class="uri">https://docs.scipy.org/doc/numpy/user/</a>

and the NumPy manual at

<a href="https://docs.scipy.org/doc/numpy/index.html" class="uri">https://docs.scipy.org/doc/numpy/index.html</a>
.

The following will import this library into your session

In [None]:
import numpy as np

You can define a vector as follows

In [None]:
a = np.array(\[1,2,3\])

(Note the ordering of round and square brackets).

Inspect the contents of `a` by executing

In [None]:
a

You can access individual components directly as you would like any
other array

In [None]:
a\[0\]

**CHECKPOINT Enter in what you would need to inspect the second entry in
`a`.**

The vector `a` has an associated set of methods including a nicer print
statement.

In [None]:
print(a)

The entries in `a` can be integers, floats and so on. Since Python is a
dynamically typed language, it tries to figure that out on the fly. One
can check the type as follows

In [None]:
a.dtype.name

*Question* - what type did you get?

**CHECKPOINT Create another vector `b` with same dimensions and size of
entries as `a`, but where the entries are of type float.**

There is a transpose operation, but for vectors defined this way it
doesn’t really do anything. We’ll get back to this in the second lab.

In [None]:
a.T

NumPy - Operations
==================

We define the following set of new vectors.

In [None]:
c = np.array(\[1.,2.\])

In [None]:
d = np.array(\[2.,1.\])

We can do many of the simple operations such as multiplication by a
scalar, adding and subtracting vectors. **Ask yourself what the output
should be before pressing enter.**

Type

In [None]:
2. \* c

In [None]:
-.123 \* d

In [None]:
c + d

In [None]:
c - d

As these vectors are objects, one needs to be careful when copying them
as using $=$ only passes the reference.

Type

In [None]:
u = c

In [None]:
print u

In [None]:
c \*= 3

In [None]:
print u

If you use the `copy` method then the values of the vector are copied
across.

Type

In [None]:
u = c.copy()

In [None]:
print u

In [None]:
c \*= 3

**What will `u` look like now?**

In [None]:
print u

One can compute the dot product between any two vectors. Type

In [None]:
c = np.array(\[1.,2.\])

In [None]:
c.dot(d)

**CHECKPOINT Create a new vector that is orthogonal to `c`. Demonstrate
numerically that it is orthogonal.**

There isn’t a length function in NumPy but you can easily construct one.

Type

In [None]:
import math

In [None]:
math.sqrt(d.dot(d))

**CHECKPOINT Write a short function to compute the length of a vector.**

One can determine a unit vector `du` for a vector `d`.

Type

In [None]:
du=d/math.sqrt(d.dot(d))

**CHECKPOINT Write a short function to compute the unit vector that is
parallel with an inputted vector. Do a check to make sure that vector
inputted is not the Null vector.**

Higher dimensional example - character frequencies in different languages.
==========================================================================

If you are given a piece of text, then it is possible to compute the
frequency of occurence of each of the lower case letters in that text.
It turns out that the frequency of these letters is relatively fixed
from language to language (although this should not seen as an
absolutely fixed set of frequencies as every text can represent a
different style). It turns out that these frequencies are surprisingly
constant between different texts taken from the same language. We can
even use it to decrypt texts.

Put more mathematically for each text $\mathcal T$ we can compute the
frequency of each letter $f_a^{\mathcal T}$, $f_b^{\mathcal T}$ ,
$\dots$ where $f_a^{\mathcal T}$ is the frequency of lower case “a”’s in
$\mathcal T$ and so on. For the English alphabet there are 26 letters,
so we can construct a vector $\underline{f}^{\mathcal T}$ of those
frequencies. So, $\underline{f}^{\mathcal T}$ is vector *with 26
entries*.

We can then compare how similar two texts $A$ and $B$ are by comparing
their frequencies. Let’s suppose that their corresponding frequency
vectors are $\underline{f}^A$ and $\underline{f}^B$. We can easily
compute cosine of the angle between these two vectors by computing
$$\cos \theta \;\; = \;\; \frac{{\underline{f^A}^\intercal} \cdot {\underline{f^B}}}{|\underline{f^A}||\underline{f^B}|} \;\; .$$

Remember -

1.  if $\cos \theta = 1$ then $\underline{f}^A$ and $\underline{f}^B$
    are parallel to each other, which indicates their frequencies are
    very similar.

2.  It doesn’t matter about the number of dimensions, the above
    relationship still holds (not just for two or three dimensions).

We can test this in Python. To do this you will write Python code to
compute the frequency of letters for a text and then to compute the dot
product between different frequency vectors. Without looking at the
files one can check if texts are from the same language or not.

In the cell below are two functions that will read in a text and return
a string and another function which will return a vector with the
*incidence* of lower case letters (i.e. how many times the letter “a”
occurs and so on) in a string.

In [None]:

import re

func getString(fn):
f = open(fn,'r')
myString = f.read(10000)
return myString

func getIncidence(string):
letters = \"abcdefghijklmnopqrstuvwxyz\"
Nc = len(letters)
n = np.zeros(Nc,dtype=np.float64)
i = 0
for c in letters:
x = re.findall(c,string)
n\[i\] = len(x) \* 1.0
i += 1
return n


**CHECKPOINT Write a function that will compute the frequency vector of
the text found in a text file. The input is the file name of the text
file name and the output is the frequency vector for that file. If
$\underline{N}$ is the vector of incidences then
$$f_i \;\; = \;\; \frac{N_i}{\sum N_j} \;\; .$$ If you look in the NumPy
documentation you will a function to compute the denominator of this.
*This is not the same as computing the unit vector - prove this to
yourself*.**

**CHECKPOINT Write a function to compute the cosine of the angle between
two text frequency vectors. You will need the `math` library.**

**CHECKPOINT In the Jupyter folder there are three files, Dutch.txt,
English.txt and unknown.txt. The first two files are the same text in
Dutch and English respectively. Compute the cosines between the
resulting frequency vectors of the three texts and try and determine the
language of the file unknown.txt.**

Extension work
==============

Rotation encryption is one of the oldest methods for encrypting
texts.[1] The idea is relatively simple. Each letter in a text is
exchanged with a letter that is a specific number down from it in the
alphabetical sequence. So with a encryption rotation with an offset of
one would exchange

`a` $\rightarrow$ `b`,

`b` $\rightarrow$ `c`,

$\dots$

`z` $\rightarrow$ `a`.

The cipher rotation with an offset of 13 (or ROT-13) would be `a`
$\rightarrow$ `n` and so on.

If ones knows the offset, then one can decrypt the text by either
performing the reverse transformation on the encrypted text or
equivalently apply the transformation with a new offset where the sum of
the offsets is 26 (the number of letters in the alphabet). So in the
first case if a text has an encryption rotation with a offset of 1
applied to it and is then followed by an encryption rotation of 25 we
end up with the same text. Two applications of ROT-13 likewise return
you to the same text.

This encryption method can be very easily broken using the letter
frequency method that was discussed in the last set of lab materials. In
particular, *the frequencies of the letters will remain the same,
regardless of how they are transformed*. Furthermore if you have a
reference text, which you know is from the same language as the
encrypted text, then you can compare the frequencies to determine what
was the offset was.

If one applies the encryption rotation algorithm with right offset on
the encrypted text then the cosine of the angle between the two
frequency vectors should be close to 1.

The following function takes a string and carries out a rotation of a
string.

[1] The fact that it’s sometimes called the “Caesar Cipher” gives you an
idea of how old it is!

In [None]:

import sys

func rot(s,shift):
result = \"\"
numLetters = 26
if ( shift \< -(numLetters-1) or shift \> (numLetters-1) ):
print(\"Error in rot function - shift must be \> -25 and \< 25\")
sys.exit(-1)

for v in s:
c = ord(v)

if c \>= ord('a') and c \<= ord('z'):
deltaO = c - ord('a')
deltaN = (deltaO + shift)
c = ord('a') + deltaN

elif c \>= ord('A') and c \<= ord('Z'):
deltaO = c - ord('A')
deltaN = (deltaO + shift)
c = ord('A') + deltaN

result += chr(c)

return result


**CHECKPOINT In the Jupyter folder there is a piece of English text
encrypted using this approach called Englishencrypt.txt Using the above
function try and decrypt the text just by comparing the cosine of the
frequency vectors between the rotated text and the reference English.txt
file.**