# Homework 01

**1**. (25 points)

The code below gives five "documents" with titles in `titles` and text in `contents`. 

- Convert each text into "words" by converting to lower case, removing punctuation and splitting on whitespace
- Make a list of all unique "words" in any of the texts
- Create an pandas DataFrame whose rows are words, columns are titles, and values are counts of the word in the document
- Add a column `total` that counts the total number of occurrences for each word across all documents
- Show the rows for the 5 most commonly used words

In [1]:
import sklearn
from sklearn.datasets import fetch_20newsgroups
twenty = fetch_20newsgroups(subset='train')
target_names = twenty['target_names']
titles = [target_names[i] for i in twenty['target'][2:7]]
contents = twenty['data'][2:7]

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [19]:
import string
import numpy as np
import pandas as pd
words = [content.lower().translate(str.maketrans("","",string.punctuation)).split() for content in contents]
vocab = set([word for i in words for word in i])
table = np.zeros(shape = (len(vocab),len(titles)), dtype = int)
for i,col in enumerate(contents):
    for j,word in enumerate(vocab):
        table[j,i] = col.count(word)
data = pd.DataFrame(table, columns = titles, index = vocab)
data["total"] = data.sum(axis = 1)
data.sort_values("total", ascending=False).head(3)

Unnamed: 0,comp.sys.mac.hardware,comp.graphics,sci.space,talk.politics.guns,sci.med,total
e,155,64,88,216,48,571
a,103,40,63,141,38,385
i,116,47,53,97,25,338


**2**. (75 points)

A Caesar cipher is a very simple method of encoding and decoding data. The cipher simply replaces characters with the character offset by $k$ places. For example, if the offset is 3, we replace `a` with `d`, `b` with `e` etc. The cipher wraps around so we replace `y` with `b`, `z` with `c` and so on. Punctuation, spaces and numbers are left unchanged.

- Write a function `encode` that takes as arguments a string and an integer offset and returns the encoded cipher.
- Write a function `decode` that takes as arguments a cipher and an integer offset and returns the decoded string. 
- Write a function `auto_decode` that takes as argument a cipher and uses a statistical method to guess the optimal offset to decode the cipher, assuming the original string is in English which has the following letter frequency:

```python
freq = {
 'a': 0.08167,
 'b': 0.01492,
 'c': 0.02782,
 'd': 0.04253,
 'e': 0.12702,
 'f': 0.02228,
 'g': 0.02015,
 'h': 0.06094,
 'i': 0.06966,
 'j': 0.00153,
 'k': 0.00772,
 'l': 0.04025,
 'm': 0.02406,
 'n': 0.06749,
 'o': 0.07507,
 'p': 0.01929,
 'q': 0.00095,
 'r': 0.05987,
 's': 0.06327,
 't': 0.09056,
 'u': 0.02758,
 'v': 0.00978,
 'w': 0.0236,
 'x': 0.0015,
 'y': 0.01974,
 'z': 0.00074
}
```

- Encode the following nursery rhyme using a random offset from 10 to 20, then recover the original using `auto_decode`:

```text
Baa, baa, black sheep,
Have you any wool?
Yes, sir, yes, sir,
Three bags full;
One for the master,
And one for the dame,
And one for the little boy
Who lives down the lane.
```

In [20]:
freq = {
 'a': 0.08167,
 'b': 0.01492,
 'c': 0.02782,
 'd': 0.04253,
 'e': 0.12702,
 'f': 0.02228,
 'g': 0.02015,
 'h': 0.06094,
 'i': 0.06966,
 'j': 0.00153,
 'k': 0.00772,
 'l': 0.04025,
 'm': 0.02406,
 'n': 0.06749,
 'o': 0.07507,
 'p': 0.01929,
 'q': 0.00095,
 'r': 0.05987,
 's': 0.06327,
 't': 0.09056,
 'u': 0.02758,
 'v': 0.00978,
 'w': 0.0236,
 'x': 0.0015,
 'y': 0.01974,
 'z': 0.00074
}
txt = """Baa, baa, black sheep,
Have you any wool?
Yes, sir, yes, sir,
Three bags full;
One for the master,
And one for the dame,
And one for the little boy
Who lives down the lane.
"""

In [22]:
def encode(text, offset):
    new = []
    for i in text:
        if i in string.ascii_lowercase:
            if ord(i) + offset > ord("z"):
                new.append(chr(ord(i) + offset - 26))
            else:
                new.append(chr(ord(i) + offset))
        elif i in string.ascii_uppercase:
            if ord(i) + offset > ord("Z"):
                new.append(chr(ord(i) + offset - 26))
            else:
                new.append(chr(ord(i) + offset))
        else:
            new.append(i)
    return "".join(new)

In [24]:
def decode(text, offset):
    new = []
    for i in text:
        if i in string.ascii_lowercase:
            if ord(i) - offset < ord("a"):
                new.append(chr(ord(i) - offset + 26))
            else:
                new.append(chr(ord(i) - offset))
        elif i in string.ascii_uppercase:
            if ord(i) - offset < ord("A"):
                new.append(chr(ord(i) - offset + 26))
            else:
                new.append(chr(ord(i) - offset))
        else:
            new.append(i)
    return "".join(new)

In [81]:
for offset in range(26): print(decode(txt,offset))

Baa, baa, black sheep,
Have you any wool?
Yes, sir, yes, sir,
Three bags full;
One for the master,
And one for the dame,
And one for the little boy
Who lives down the lane.

Azz, azz, akzbj rgddo,
Gzud xnt zmx vnnk?
Xdr, rhq, xdr, rhq,
Sgqdd azfr etkk;
Nmd enq sgd lzrsdq,
Zmc nmd enq sgd czld,
Zmc nmd enq sgd khsskd anx
Vgn khudr cnvm sgd kzmd.

Zyy, zyy, zjyai qfccn,
Fytc wms ylw ummj?
Wcq, qgp, wcq, qgp,
Rfpcc zyeq dsjj;
Mlc dmp rfc kyqrcp,
Ylb mlc dmp rfc bykc,
Ylb mlc dmp rfc jgrrjc zmw
Ufm jgtcq bmul rfc jylc.

Yxx, yxx, yixzh pebbm,
Exsb vlr xkv tlli?
Vbp, pfo, vbp, pfo,
Qeobb yxdp crii;
Lkb clo qeb jxpqbo,
Xka lkb clo qeb axjb,
Xka lkb clo qeb ifqqib ylv
Tel ifsbp altk qeb ixkb.

Xww, xww, xhwyg odaal,
Dwra ukq wju skkh?
Uao, oen, uao, oen,
Pdnaa xwco bqhh;
Kja bkn pda iwopan,
Wjz kja bkn pda zwia,
Wjz kja bkn pda heppha xku
Sdk herao zksj pda hwja.

Wvv, wvv, wgvxf nczzk,
Cvqz tjp vit rjjg?
Tzn, ndm, tzn, ndm,
Ocmzz wvbn apgg;
Jiz ajm ocz hvnozm,
Viy jiz ajm ocz yvhz,
Viy jiz a

In [97]:
def auto_decode(new):
    ref = list(map(lambda item:item[1],freq.items()))
    result = []
    for offset in range(26):
        text = decode(new, offset)
        alpha = text.lower().translate(str.maketrans("","",string.punctuation)).translate(str.maketrans("","",string.whitespace))
        count = np.array([[i,alpha.count(i)] for i in string.ascii_lowercase])[:,1]
        propo = np.array(count, dtype=int)/np.array(count, dtype = int).sum()
        result.append(sum((ref-propo)**2))
    best_off = np.argmin(result)
    return decode(new, best_off)

In [99]:
auto_decode(encode(txt,3))

'Baa, baa, black sheep,\nHave you any wool?\nYes, sir, yes, sir,\nThree bags full;\nOne for the master,\nAnd one for the dame,\nAnd one for the little boy\nWho lives down the lane.\n'