# Laboratory 02

## Requirements

For the second part of the exercises you will need the `wikipedia` package. On Windows machines, use the following command in the Anaconda Prompt (`Start --> Anaconda --> Anaconda Prompt`):

    conda install -c conda-forge wikipedia
    
This command should work with other Anaconda environments (OSX, Linux).

If you are using virtualenv directly instead of Anaconda, the following command installs it in your virtualenv:

    pip install wikipedia

or

    sudo pip install wikipedia
    
installs it system-wide.

You are encouraged to reuse functions that you defined in earlier exercises.

## 1.1 Define a function that takes a sequence as its input and returns whether the sequence is symmetric. A sequence is symmetric if it is equal to its reverse.

In [1]:
def is_symmetric(l):
    for x, y in zip(l[::1], l[::-1]):
        if x != y:
            return False
    return True

assert(is_symmetric([1]) == True)
assert(is_symmetric([]) == True)
assert(is_symmetric([1, 2, 3, 1]) == False)
assert(is_symmetric([1, "foo", "bar", "foo", 1]) == True)
assert(is_symmetric("abcba") == True)

## 1.2 Define a function that takes a sequence and an integer $k$ as its input and returns the $k$ largest element. Do not use the built-in `max` function. Do not change the original sequence. If $k$ is not specified return one element in a list.

In [2]:
def k_largest(l, k=1, key=None):
    if key is None:
        return sorted(l, reverse=True)[:k]
    else:
        return sorted(l, reverse=True, key= lambda x: x[key])[:k]

l = [-1, 0, 3, 2]

assert(k_largest(l) == [3])
assert(k_largest(l, 2) == [2, 3] or k_largest(l, 2))
assert(k_largest([{'a': 1}, {'a': 2}], key='a') == [{'a': 2}] )

## \*1.3 Add an optional `key` argument that works analogously to the built-in `sorted`'s key argument.

Define a function that takes a matrix as an input represented as a list of lists (you can assume that the input is a valid matrix). Return its transpose without changing the original matrix.

In [3]:
import numpy as np

def transpose(M):
    return np.array(M).T.tolist()

m1 = [[1, 2, 3], [4, 5, 6]]
m2 = [[1, 4], [2, 5], [3, 6]]

assert(transpose(m1) == m2)
assert(transpose(transpose(m1)) == m1)

## 2.1 Define a function that takes a string as its input and return a dictionary with the character frequencies.

In [4]:
def char_freq(s):
    freq = {}
    for char in s:
        if char in freq:
            freq[char] += 1
        else:
            freq[char] = 1
    return freq
    
assert(char_freq("aba") == {"a": 2, "b": 1})

## 2.2 Add an optional `skip_symbols` to the `char_freq` function. `skip_symbols` is the set of symbols that should be excluded from the frequence dictionary. If this argument is not specified, the function should include every symbol.

In [5]:
def char_freq_with_skip(s, skip_symbols=None):
    freq = {}
    for char in s:
        if char not in skip_symbols:
            if char in freq:
                freq[char] += 1
            else:
                freq[char] = 1
    return freq
    
assert(char_freq_with_skip("ab.abc?", skip_symbols=".?") == {"a": 2, "b": 2, "c": 1})

## 2.2 Define a function that computes word frequencies in a text.

In [6]:
def word_freq(s):
    words = s.split(' ')
    freq = {}
    for word in words:
        if word in freq:
            freq[word] += 1
        else:
            freq[word] = 1
    return freq
    
    
s = "the green tea and the black tea"

assert(word_freq(s) == {"the": 2, "tea": 2, "green": 1, "black": 1, "and": 1})

## 2.3 Define a function that count the uppercase letters in a string.

In [7]:
def count_upper_case(s):
    upper_count = 0
    for char in s:
        if char.isupper():
            upper_count +=1
    return upper_count
    
assert(count_upper_case("A") == 1)
assert(count_upper_case("abA bcCa") == 2)

## 2.4 Define a function that takes two strings and decides whether they are anagrams. A string is an anagram of another string if its letters can be rearranged so that it equals the other string.

For example:

```
abc -- bac
aabb -- abab
```

Counter examples:

```
abc -- aabc
abab -- aaab
```

In [8]:
def anagram(s1, s2):
    return sorted(s1) == sorted(s2)

assert(anagram("abc", "bac") == True)
assert(anagram("aabb", "abab") == True)
assert(anagram("abab", "aaab") == False)

## 2.5. Define a sentence splitter function that takes a string and splits it into a list of sentences. Sentences end with `.` and the new sentence must start with a whitespace (`str.isspace`) or be the end of the string. See the examples below.

In [9]:
def sentence_splitter(s):
    dot_splitted = s.split('.')
    result = []
    string = ''
    print(dot_splitted)
    for i in range(len(dot_splitted)-1):
        if dot_splitted[i][0].isspace():
            result.append(string)
            string = dot_splitted[i].strip()
        else:
            if len(string) > 0:
                string += '.'
            string += dot_splitted[i]
    result.append(string)
    return result
            
        
assert(sentence_splitter("A.b. acd.") == ['A.b', 'acd'])
assert(sentence_splitter("A. b. acd.") == ['A', 'b', 'acd'])

['A', 'b', ' acd', '']
['A', ' b', ' acd', '']


## Wikipedia module

The following exercises use the `wikipedia` package. The basic usage is illustrated below.

The documentation is available [here](https://pypi.python.org/pypi/wikipedia/).

Searching for pages:

In [10]:
import wikipedia

results = wikipedia.search("Budapest")
results

['Budapest',
 'The Grand Budapest Hotel',
 'Vilmos Kondor',
 'Siege of Budapest',
 'Budapest (song)',
 'List of films shot in Budapest',
 'Hungarian Parliament Building',
 'Budapest Metro',
 'Budapest Ferenc Liszt International Airport',
 'Budapest Highflyer']

Downloading an article:

In [11]:
article = wikipedia.page("Budapest")

article.summary[:100]

'Budapest (Hungarian: [ˈbudɒpɛʃt]) is the capital and by far the most populous city of Hungary and on'

The content attribute contains the full text:

In [12]:
type(article.content), len(article.content)

(str, 83831)

By default the module downloads the English Wikipedia. The language can be changed the following way:

In [13]:
wikipedia.set_lang("fr")

In [14]:
wikipedia.search("Budapest")

['Budapest',
 'Budapest Honvéd',
 'Gare de Budapest-Nyugati',
 'Arrondissements de Budapest',
 'Métro de Budapest',
 'Bataille de Budapest',
 'Papp László Budapest Sportaréna',
 'MTK Budapest FC',
 'Gare de Budapest-Déli',
 'MTK Budapest']

In [15]:
fr_article = wikipedia.page("Budapest")
fr_article.summary[:100]

'Budapest (prononcé [by.da.ˈpɛst] , hongrois : Budapest [ˈbu.dɒ.pɛʃt]  ; allemand : Budapest ou ancie'

## 3.0 Change the language back to English and test the package with a few other pages.

In [16]:
wikipedia.set_lang("en")

## 3.1 Download 4-5 arbitrary pages from the English Wikipedia (they should exceed 100000 characters combined) and compute the word frequencies using your previously defined function(s). Print the most common 20 words in the following format (the example is not the correct answer):

```
unintelligent <TAB>  123456
moribund <TAB>   123451
...
```

The words and their frequency are separated by TABS and no additional whitespace should be added.

In [17]:
import operator

def load_articles(pages):
    articles = []
    for page in pages:
        articles.append(wikipedia.page(page))
    return articles

def get_content_of_articles(articles):
    content = ''
    for article in articles:
        content += article.content + '\n'
    return content

def n_most_frequent(d, n=1, rev=True):
    most_freq = sorted(d.items(), reverse=rev, key=operator.itemgetter(1))[:n]
    for element in most_freq:
        print(element)

def word_freq_of_pages(pages):
    freqs = {}

    for page in pages:
        article = wikipedia.page(page)
        article_freq = word_freq(article.content)
        for word in article_freq.keys():
            if word in freqs:
                freqs[word] += article_freq[word]
            else:
                freqs[word] = article_freq[word]


    most_freq = sorted(freqs.items(), reverse=True, key=operator.itemgetter(1))[:20]
    for element in most_freq:
        print(element)
    
wikipedia.set_lang("en")
pages = ["Hungary", "Budapest", "New York City", "United Kingdom", "London"]
english_articles = load_articles(pages)
english_content = get_content_of_articles(english_articles)
n_most_frequent(word_freq(english_content), 20)
print(len(english_content))

# word_freq_of_pages(pages)
    

('the', 5759)
('of', 3196)
('and', 2766)
('in', 2327)
('to', 1113)
('a', 1018)
('is', 1009)
('The', 743)
('by', 530)
('as', 525)
('was', 478)
('for', 475)
('are', 453)
('with', 449)
('New', 385)
('from', 335)
('has', 329)
('York', 323)
('London', 307)
('on', 297)
482812


## 3.2 Repeat the same exercise for your native language if it denotes word boundaries with spaces. If it doesn't choose an arbitrary language other than English.

In [18]:
wikipedia.set_lang("hu")
pages = ["Magyarország", "Budapest", "New York", "Egyesült Királyság", "London"]
hungarian_articles = load_articles(pages)
hungarian_content = get_content_of_articles(hungarian_articles)
n_most_frequent(word_freq(hungarian_content), 20)
print(len(hungarian_content))

# word_freq_of_pages(pages)

('a', 2694)
('és', 920)
('az', 816)
('A', 535)
('is', 255)
('Az', 179)
('város', 113)
('New', 103)
('ország', 99)
('mint', 99)
('legnagyobb', 94)
('magyar', 91)
('egy', 91)
('több', 87)
('nem', 82)
('–', 79)
('vagy', 78)
('York', 76)
('London', 73)
('Magyarország', 71)
246592


## 3.3 Define a function that takes a string and returns its bigram frequencies as a dictionary.

Character bigrams are pairs of subsequent characters. For example word `apple` contains the following bigrams: `ap, pp, pl, le`.

They are used for language modeling. 

In [19]:
def bigram_freq(s, bi_freq=None):
    if bi_freq is None:
        bi_freq = {}
    for i in range(len(s)-1):
        sub = s[i:i+2]
        if sub in bi_freq:
            bi_freq[sub] += 1
        else:
            bi_freq[sub] = 1
    return bi_freq
    
print(bigram_freq('apple', bigram_freq('apple')))

{'ap': 2, 'pp': 2, 'pl': 2, 'le': 2}


## 3.4 Using your previous English collection compute bigram frequencies.

What are the 10 most common and 10 least common bigrams?

In [20]:
english_bifreq = bigram_freq(english_content)
print("most comon:")
n_most_frequent(english_bifreq, 10, True)
print("least comon:")
n_most_frequent(english_bifreq, 10, False)

most comon:
('e ', 12734)
(' t', 9166)
('th', 8991)
('he', 8425)
('s ', 8262)
('n ', 7533)
(' a', 7413)
('in', 7384)
('an', 7180)
('d ', 6956)
least comon:
('(;', 1)
('ˈm', 1)
('mɒ', 1)
('ɒɟ', 1)
('ɟɒ', 1)
('ɒr', 1)
('ːɡ', 1)
('ɡ]', 1)
('o–', 1)
('H"', 1)


## \*3.5 Define a function that takes two parameters: a string and an integer N and returns the N-gram frequencies of the string. For $N=2$ the function works the same as in the previous example.

Try the function for $N=1..5$. How many unique N-grams are in your collection?

In [21]:
def ngram_freq(s, n, n_freq=None):
    if n_freq is None:
        n_freq = {}
    for i in range(len(s)-n+1):
        sub = s[i:i+n]
        if sub in n_freq:
            n_freq[sub] += 1
        else:
            n_freq[sub] = 1
    return n_freq

def unique_ngrams(s, n):
    unique_count = 0
    stats = ngram_freq(s, i)
    for key, value in stats.items():
        if value == 1:
            unique_count += 1
    return unique_count
    
for i in range(1, 6):
    print("Count of unique", i, "-grams: ", unique_ngrams(english_content, i))
    

Count of unique 1 -grams:  34
Count of unique 2 -grams:  619
Count of unique 3 -grams:  4993
Count of unique 4 -grams:  20776
Count of unique 5 -grams:  52681


## 3.6 Compute the same statistics for your native language.

In [22]:
for i in range(1, 6):
    print("Count of unique", i, "-grams: ", unique_ngrams(hungarian_content, i))

Count of unique 1 -grams:  16
Count of unique 2 -grams:  462
Count of unique 3 -grams:  5175
Count of unique 4 -grams:  22580
Count of unique 5 -grams:  50922


In [33]:
from numpy.random import choice

text = '. '
N=4
ngrams = ngram_freq(english_content, N)

for i in range(2):
    possible_next = []
    ending = text[-N+1:]
    if len(ending) < N:
        char_num = len(ending)
    else:
        char_num = N
    for ngram in ngrams:
        if (ngram[:char_num] == ending):
            possible_next.append(ngram)
    print(possible_next)


['. It', '. Wi', '. Th', '. Hu', '. Ma', '. Hi', '. By', '. Fo', '. On', '. As', '. We', '. Ac', '. Fr', '. He', '. A ', '. La', '. Af', '. Bo', '. In', '. Un', '. Ap', '. Kö', '. Up', '. Ki', '. Ov', '. Si', '. Ab', '. Am', '. Or', '. Co', '. To', '. Ec', '. Au', '. Ro', '. De', '. Ká', '. Le', '. Sz', '. Du', '. Ne', '. Ru', '. 20', '. So', '. Na', '. Rá', '. Be', '. Al', '. Jó', '. Li', '. Av', '. Te', '. 19', '. Tr', '. Im', '. Ho', '. El', '. Se', '. Bu', '. 13', '. Fa', '. Mo', '. Ot', '. Lo', '. Po', '. 1,', '. Ou', '. En', '. Sm', '. St', '. 10', '. Je', '. Ca', '. No', '. Wh', '. Pr', '. An', '. Pl', '. 28', '. Ta', '. Bé', '. Is', '. 79', '. Ol', '. Ja', '. Zr', '. (S', '. Já', '. Di', '. Pe', '. Pa', '. Gu', '. Sp', '. Gr', '. Wo', '. Ea', '. Fl', '. At', '. Ce', '. \nS', '. GD', '. Op', '. Fu', '. Re', '. 18', '. Et', '. Sw', '. Wa', '. Fi', '. Of', '. Sn', '. Su', '. 31', '. Va', '. Mi', '. Ar', '. Pu', '. Nu', '. 30', '. 39', '. BS', '. Ga', '. Go', '. Ri', '. Sa', '. Ad'