# Exercise 1


We have seen how to implement a new iterator (just like a function) but with `yield` statement (just like `return` in a function). This model of computation is called **continuation**. This is very useful in combinatorics, especially when combined with recursion (*Computational Mathematics with SageMath, SIAM, 2019, p. 346*). Below is an iterator called `generateWords(alphabet,L)` that can generate all words of of a given length `L` on a given `alphabet`.

Your task is simple! 

- Just understand what the following iterator is doing from the comments in code and explanations earlier. 
- how we are computing the number of words of length `L` equalling 3 and then 23 using `sum`:
  - via list comprehension
  - via generator expression
- You *don't need to change any of the code in the next 4 cells, but just understand it*. 
- Finally, try to explain by chosing the right answer below as to why the list comprhension is taking longer to compute than the generator expression as evident by the `Wall time` (see [Wall Time](https://en.wikipedia.org/wiki/Elapsed_real_time), it's just the elapsed real time from the start to end of a computation).

---

```
%%time 
# time for list comprehension to compute the sum of [1,1,1,...,2^23]
sumFromListCom = sum( [ 1 for w in generateWords(['H','T'], 23) ]  ) 
```
will result in output:
```
CPU times: user 6.94 s, sys: 200 ms, total: 7.14 s
Wall time: 7.11 s
```
---

---
```
%%time 
# time for generator expression to compute the sum of [1,1,1,...,2^23]
sumFromGenEx = sum( ( 1 for w in generateWords(['H','T'], 23) )  ) 
```
will result in output:
```
CPU times: user 5.51 s, sys: 0 ns, total: 5.51 s
Wall time: 5.52 s
```
---

(you may have slightly different numbers for `time` and `Wall time` based on your machine details at the time of computation). 

**Multiple-choice Question:**

- Why is the `Wall time` for generator expression (genex) smaller that for the list comprehension (listcomp) here? 

**Answer Choices**

- **A.** genex if faster because the individual words are not allocated space in memory, i.e., materialised in memory
- **B.** listcomp is slower because the list of all words is allocated space in memory
- **C.** both **A** and **B** are true


In [1]:
choiceForProblem0 = 'C' 


- **List Comprehension** (`listcomp`): When we use list comprehension, it creates the entire list of words in memory before performing any computation. This means all the words of length `23` on the alphabet `['H', 'T']` are generated and stored in memory at once, which can take up a significant amount of memory and time.

- **Generator Expression** (`genex`): On the other hand, a generator expression generates each word one at a time and yields it. This means it doesn't store the entire list in memory. Instead, it computes each word on demand, which uses less memory and results in faster computation in terms of wall time.

- **C. both A and B are true**: The generator expression is faster because it does not allocate space in memory for all words at once, and the list comprehension is slower because it stores all words in memory.


In [3]:
# This cell is to help you make the right choice between A, B and C
def generateWords(alphabet, L):
    if L == 0:
        yield []
    else:
        for word in generateWords(alphabet, L-1): # here is the recursion when we cann the iterator again on L-1
            for L in alphabet: 
                yield word + [L]

print( [ w for w in generateWords(['H','T'], 3) ] )# now call the iterator to find all words of length 3 in ['H','T']

print( sum( [ 1 for w in generateWords(['H','T'], 3) ]  )) # these words can then be counted by list comprehension
print( sum( ( 1 for w in generateWords(['H','T'], 3) )  )) # these words can then be counted by generator expression

print( 'The number of words of length 3 from an alphabet of size 2 is 2^3 = ', 2^3) # the above sum`s makes sense

[['H', 'H', 'H'], ['H', 'H', 'T'], ['H', 'T', 'H'], ['H', 'T', 'T'], ['T', 'H', 'H'], ['T', 'H', 'T'], ['T', 'T', 'H'], ['T', 'T', 'T']]
8
8
The number of words of length 3 from an alphabet of size 2 is 2^3 =  1


# Exercise 2


Recall how we downloaded *Pride and Prejudice* and processed it as a String and split it by `Chapter`s. These code snippets are at our disposal now - all we need to do is copy-paste the right set of cells from earlier into the cells below here to have the string from that Book for more refined processing.

Think about what algorithmic constructs and methods one will need to `split` each sentence by the **English words** it contains and then count the number of each distinct word.

Now that you have understood `for` loops, `list` comprehensions and anonymous `function`s, and can learn about the needed methods on strings for splitting (which you can search by adding a `.` after a `srt` and hitting the `Tab` button to look through existing methods and followed by `?` for their docstrings), the `dictionary` data structure, and already seen how to count the number of ball labels, you are ready for this problem stated below. If you attended the lab then you have an advantage if you tried to work on this with some help from your instructors.

**Problem:** Process the English words in a text file, such as those in the book *Pride and Prejudice* by Jane Austin, and obtain the top `K` most frequent *words that are longer than* a given parameter `wordLongerThan` which can be any value in $\mathbb{Z}_+ := \{ 0, 1, 2, 3, 4, \ldots \}$, i.e., *words that are longer than* `wordLongerThan` many characters in length. 

Your function must be generic and named as follows including input parameter order and names: 

- `frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K)`

This function must be capable of:
- reading any available text file in the `data/` directory that can be passed as the parameter `thisTextFile` 
- and return a `dict` type whose:
  - key is the word whose character length is longer than the parameter `wordlongerThan` and 
  - value is the frequency of this word in the text file. 
  - Yor returned `dict` should only contain the top `K` most frequent words longer than `wordLongerThan` and be already sorted in descending order of in frequency.

Use the next cell to submit your answer and for rough-work use more cells as needed in order to copy-paste code snippets from earlier content to get this working. But please remove the cells for rough-work when done.

*Note: that you may not import libraries that have not been introduced in the course so far.*

In [6]:
import string
# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies

# Do NOT change the name of the function and names of paramaters !

thisTextFile = 'data/pride_and_prejudice.txt' # try a text file in data/ directory
wordLongerThan = 0 # this can be any larger integer also
K = 20 # this can be any integer larger than 0 also

def frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K):
    '''explain what the function is supposed to do briefly'''
    # write the body of the function and replace 'None' with the correct return value
    with open(thisTextFile, 'r') as file:
        text = file.read()
    
    # Remove punctuation and split by spaces
    translator = str.maketrans('', '', string.punctuation)
    words = text.translate(translator).lower().split()
    
    # Filter words longer than 'wordLongerThan'
    filtered_words = [word for word in words if len(word) > wordLongerThan]
    
    # Count the frequency of each word
    word_freq = {}
    for word in filtered_words:
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1
    
    # Sort the words by frequency in descending order and get top K
    sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    
    # Return top K words and their frequencies as a dictionary
    top_k_words = dict(sorted_word_freq[:K])
    return top_k_words

print(frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K))

{'the': 4479, 'to': 4189, 'of': 3707, 'and': 3507, 'her': 2191, 'a': 1983, 'in': 1904, 'was': 1835, 'i': 1751, 'she': 1684, 'that': 1517, 'it': 1431, 'not': 1424, 'he': 1307, 'you': 1271, 'his': 1259, 'be': 1251, 'as': 1175, 'had': 1173, 'with': 1090}


# Exercise 3


Recall the problem above on counting the number of votes by party across all of Sweden from the **Swedish 2018 National Election Data**.

Your task is to adapt the code snippets there and others we have encountered thus far to count the total number of votes by each **district** and return a `list` of `Integers` giving the number of votes for the top `K` districts with the most votes. Your function `numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', K)` should work for any valid integer `K`. 

*Note: that you may not import libraries that have not been introduced in the course so far.*

---
*unzip issues:* If you are unable to call `unzip final.csv.zip` on your windows laptop. You can either do it in the computer lab or do the following with internet access to download the large `final.csv` file from the internet:

```
%%sh
cd data
 
curl -O http://lamastex.org/datasets/public/elections/2018/sv/final.csv
```

Then you should have the needed `data/final.csv` file.

---

In [7]:
import csv
# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies

# Do NOT change the name of the function and names of paramaters !

K = 20 # this can be any integer larger than 0 also, change K and make sure your function works
filename = 'data/final.csv' # this has to be a csv file with the same structure as out final.csv

def numberOfVotesInKMostVotedDistrictsInSE(filename, K):
    '''explain what the function is supposed to do briefly'''
    # write the body of the function and replace 'None' with the correct return value
    # Read the CSV file and extract the necessary columns
    district_votes = {}
    
    with open(filename, 'r') as file:
        reader = csv.DictReader(file)
        
        for row in reader:
            district = row['district']  # Assuming the 'district' column exists
            votes = int(row['votes'])  # Assuming the 'votes' column exists
            
            # Sum the votes by district
            if district in district_votes:
                district_votes[district] += votes
            else:
                district_votes[district] = votes
    
    # Sort the districts by the total number of votes in descending order
    sorted_districts = sorted(district_votes.items(), key=lambda x: x[1], reverse=True)
    
    # Extract the top K districts and their vote counts
    top_k_votes = [votes for _, votes in sorted_districts[:K]]
    return top_k_votes

print(numberOfVotesInKMostVotedDistrictsInSE(filename, K))

[13435, 10625, 7910, 7094, 6182, 6118, 6022, 5454, 5286, 4919, 4916, 4839, 4728, 4709, 4411, 4399, 4300, 4146, 4137, 4132]



A disadvantage of using list comprehension is that we cannot create a lot of random numbers as we will have to store the returned list. Since you know about generators your task is to use the following warm-up on generating natural numbers and write an iterator version called `lcg` of the function `LinConGen` we have been seeing thus far.

In [9]:
def naturals():
    '''define the countably infinite set of natural numbers using an iterator'''
    n = 1 # the first natural number 1
    while True: # an infinite while loop
        yield n # output n
        n = n + 1 # increment n by 1   

In [10]:
# Example run - keep printing the natural numbers using the iterator until we hit 5
for n in naturals():
      print(n)
      if n >= 5:
          break

1
2
3
4
5


In [11]:
# printing next from our iterator
generateNaturals = naturals() # let's assign our iterator
print(next(generateNaturals))
print(next(generateNaturals))

1
2


In [12]:
list(zip(naturals(), ['a', 'b', 'c', 'd'])) # the second list stops at 4 to give an enumeration that ends

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

In [14]:
# Here is the actual task 
# just replace XXX with the right values to make an iterator of function LinConGen
#ef lcg(m, a, c, x0):
    #x = XXX
    #while True:
        #yield XXX
        #x = XXX

def lcg(m, a, c, x0):
    '''Linear Congruential Generator (LCG) as an iterator using yield'''
    x = x0  # Initialize with the seed (x0)
    while True:
        yield x  # Yield the current value of x
        x = (a * x + c) % m  # Update x using the LCG formula
