# DSGA1007 - Programming for Data Science Lab
- Comprehensions (List, Dictionaries, Sets)
- Lambda Functions
- Functional Programming
- Verify program correctness (some testing cases / assertions)

In [1]:
import numpy as np
import string
from numpy import random
from functools import reduce
from operator import add
from collections import Counter

# Let's get to work right away!
### <font color='blue'>Exercise 1 – Alphabet count</font>
In this exercise you will implement a counter using the intuition of the MapReduce paradigm, which is inspired by the **map** and **reduce** functions from Functional Programming that you saw in Lecture.

Follow the instructions in the code comments, when necessary.

In [2]:
alphabet = string.ascii_lowercase
original_count_letters = # Initialize with a ramdom integer between 1 and 10 for each symbol (letter) in the alphabet
random_letters = # Build a list where the letter is includded as many times as the count you just calculated indicates
                 # ex. [1, 2, 3] ==> ['a', 'b', 'b', 'c', 'c', 'c'] (for all letters, this is just an example)

How long is the sequence of letters? Is this sequence always the same length? Why?

In [3]:
# Get length of sequence of random letters

134

Implement the assert_generation function as described by its docstring.

The idea is to validate that we generated the data correctly.

In [4]:
def assert_generation(original_count_letters, alphabet, random_letters):
    """Make sure that the generation process is correct.

    There should be original_count_letters[i] elements for alphabet[i]
    symbol in the random_letters list.
    
    Args:
        original_count_letters (list): contains number of elements for each alphabet symbol
        alphabet (str): Alphabet string
        random_letters (list): generated ramdom letters
    
    Returns:
        bool: Assertion 
    """
    pass

In [5]:
# This should run without raising exceptions
assert_generation(original_count_letters, alphabet, random_letters)

True

Shuffle random_letters order randomly

In [6]:
# Get a shuffled version of random_letters
# Assign the original (unshuffled version to a variable called original_random_letters)

How would you asset that the result is different from the original list? What would the assert_generation function's result be for the suffled version?

In [7]:
# You assertion goes here!

Now we will start playing with the map, filter and reduce functions. The idea is to count the number of times a letter of the alphabet occurs from the shuffled array of generated random letters.

In the MapReduce programming model the Map maps input elements to (key value) pairs, then each key is assignmed to a specific Reduce job that will process all the elements sharing the same key and normally computing a single result.

Inspired in this programming model we will build:
- A **mapper** function that will output (key, value) pairs where key are the letters of the alphabet and value is a count that corresponds to that key.
- A **split_by_key** function that will build a lists of lists in which each internal list corresponds to a specific key (letter)
- A **reducer** function that will calculate the sum of partial counted values for each key

In [8]:
def mapper(letters):
    pass

In [9]:
# Checking ot what the result of the mapper is (or should be)
for (i, (k, v)) in enumerate(mapper(random_letters)):
    if i >= 10:
        break
    print('i:', i, '(key, value):',(k, v))

i: 0 (key, value): ('s', 1)
i: 1 (key, value): ('c', 1)
i: 2 (key, value): ('v', 1)
i: 3 (key, value): ('e', 1)
i: 4 (key, value): ('g', 1)
i: 5 (key, value): ('n', 1)
i: 6 (key, value): ('s', 1)
i: 7 (key, value): ('p', 1)
i: 8 (key, value): ('z', 1)
i: 9 (key, value): ('i', 1)


In [10]:
def split_by_key(alphabet, map_persist):
    """Try to solve it by using list comprehension"""
    pass

In [11]:
splitted_by_keys = split_by_key(alphabet, mapper(random_letters))

In [12]:
def reducer(splitted_by_keys):
    pass

In [13]:
calculated_count_letters = reducer(splitted_by_keys)

Assert in two different ways that the result you obtained is correct.

In [14]:
# Normal assertion goes here

In [15]:
# Numpy assertion goes here

### <font color='blue'>Exercise 2 – Science Fiction words</font>
File 'pg31547.txt' contain the Science Fiction short story Youth by Isaac Asimov.

Read the file and assign a list of its splitted lines to the variable story.

In [16]:
# Your code goes here, the variable story should hold a list with file lines

Use list comprehension to build a list of words in the story.

In [17]:
words = # Get words

In [18]:
unique_words = # From the words list, get a list containing distinct/unique words
count_words = # Use comprehension to build a dictionary with word counts

In [19]:
# Check out what the result should look like
count_words

{'eBooks,': 2,
 'mold."': 1,
 'muttered,': 2,
 'alone,"': 1,
 'Dad': 11,
 'RIGHT': 1,
 "'Come": 1,
 'lad."': 2,
 'tradition.': 1,
 'XIV': 1,
 'brood."': 1,
 'each': 4,
 'Industry': 1,
 'height': 1,
 'position': 2,
 '"There\'ll': 1,
 'Lunch': 1,
 'Before': 1,
 'money': 3,
 'due': 1,
 'Real': 1,
 'appearance': 1,
 'License': 8,
 'especially,': 1,
 'denser': 1,
 'felt': 9,
 'methodically': 1,
 'Arcturian': 2,
 'used': 5,
 'laughed': 1,
 'workmanship': 1,
 'obtain': 3,
 'best': 3,
 'landed,': 1,
 'species,': 1,
 'build': 1,
 'vessel.': 1,
 'worlds,"': 1,
 'ashes': 1,
 'variety': 1,
 'Did': 1,
 'around.': 2,
 'chose': 1,
 'who': 10,
 'circus."': 1,
 'rimmed': 1,
 'read,': 1,
 'land."': 1,
 'controls': 1,
 'sure': 4,
 'charger': 1,
 'dry': 1,
 'Business!': 1,
 'cautious': 1,
 'on,': 3,
 'word': 2,
 'critical': 1,
 'violates': 1,
 'refund.': 2,
 'mastered': 1,
 'inner': 1,
 'annoyed.': 1,
 'berries': 2,
 '(any': 1,
 'publicity': 1,
 'repaired."': 1,
 'early?': 1,
 'right,': 3,
 'peace': 1,
 '

Achieve the same results by using a Counter object

In [20]:
count_words_alt = # Alternative way to calculate word counts

In [21]:
# This assertion should hold
assert(count_words == count_words_alt)