# Wordcount

- [Wikipedia](https://en.wikipedia.org/wiki/Word_count)

- Word count example reads text files and counts how often words occur. 
- Word count is commonly used by translators to determine the price for the translation job.
- This is the "Hello World" program of Big Data.

## Create sample text file

In [1]:
from lorem import text

with open("sample.txt", "w") as f:
    for i in range(2):
        f.write(text())

### Exercise 4.1

Write a python program that counts the number of lines, different words and characters in that file.

In [2]:
%%bash
wc sample.txt
du -h sample.txt

  14  388 2758 sample.txt
4.0K	sample.txt


In [8]:
import os
 
def counter(fname):    
    num_words = 0
    num_lines = 0
    num_charc = 0
    num_spaces = 0
     
    with open(fname, 'r') as f:
        # line by line
        for line in f:
            line = line.strip(os.linesep) 
            
            wordslist = line.split() 
            
            num_lines = num_lines + 1
              
            num_words = num_words + len(wordslist)
              
            num_charc = num_charc + sum(1 for c in line if c not in (os.linesep, ' '))
              
            num_spaces = num_spaces + sum(1 for s in line if s in (os.linesep, ' '))
     
    print("Number of words in text file: ", num_words)
    print("Number of lines in text file: ", num_lines)
    print("Number of characters in text file: ", num_charc)
    print("Number of spaces in text file: ", num_spaces)

In [9]:
counter('sample.txt')

Number of words in text file:  388
Number of lines in text file:  15
Number of characters in text file:  2350
Number of spaces in text file:  380


### Exercise 4.2

Create a function called `map_words` that take a file name as argument and return a lists containing all words as items.

```pytb
map_words("sample.txt")[:5] # first five words
['adipisci', 'adipisci', 'adipisci', 'adipisci', 'adipisci']
```

In [10]:
import re
def map_words(filename):
    data = []
    with open(filename, "r") as file:
        data = file.read()
    text = re.sub('[^a-z\ \']+', " ", data)
    words = list(text.split())
    return words

In [11]:
map_words("sample.txt")[:5]

['t', 'consectetur', 'numquam', 'quaerat', 'st']

## Sorting a dictionary by value

By default, if you use `sorted` function on a `dict`, it will use keys to sort it.
 - To sort by values, you can use [operator](https://docs.python.org/3.6/library/operator.html).itemgetter(1), which return a callable object that fetches item from its operand using the operand’s `__getitem__(` method. It could be used to sort results.

In [24]:
import operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
getcount = operator.itemgetter(1)
dict(sorted(fruits, key=getcount))

{'orange': 1, 'banana': 2, 'apple': 3, 'pear': 5}

`sorted` function has also a `reverse` optional argument.

In [25]:
dict(sorted(fruits, key=getcount, reverse=True))

{'pear': 5, 'apple': 3, 'banana': 2, 'orange': 1}

### Exercise 4.3

Create a function `reduce` to reduce the list of words returned by `map_words` and return a dictionary containing all words as keys and number of occurrences as values.

```python
reduce('sample.txt')
{'tempora': 2, 'non': 1, 'quisquam': 1, 'amet': 1, 'sit': 1}
```

You probably notice that this simple function is not easy to implement. Python standard library provides some features that can help.

In [15]:
import operator
words = map_words("sample.txt")
def reduce(a):
    k = {}
    for j in a:
        if j in k:
            k[j] += 1
        else:
            k[j] = 1
    return k

word_counts = reduce(words)

In [21]:
word_counts

{'t': 2,
 'consectetur': 18,
 'numquam': 18,
 'quaerat': 17,
 'st': 4,
 'labore': 15,
 'aliquam': 14,
 'olorem': 2,
 'adipisci': 15,
 'magnam': 14,
 'porro': 10,
 'neque': 15,
 'dolore': 15,
 'dolor': 12,
 'modi': 8,
 'amet': 10,
 'ut': 10,
 'met': 3,
 'dolorem': 8,
 'uaerat': 4,
 'ipsum': 16,
 'eius': 13,
 'it': 6,
 'tempora': 13,
 'non': 11,
 'voluptatem': 10,
 'dipisci': 1,
 'quisquam': 7,
 'oluptatem': 2,
 'quiquia': 8,
 'uiquia': 1,
 'velit': 5,
 'on': 3,
 'psum': 4,
 'empora': 2,
 'etincidunt': 12,
 'olore': 2,
 'orro': 5,
 'sed': 13,
 'olor': 2,
 'sit': 8,
 'est': 11,
 'ius': 3,
 'odi': 2,
 'elit': 1,
 'ed': 3,
 'uisquam': 1,
 'agnam': 4,
 'umquam': 2,
 'liquam': 3,
 'abore': 1}

In [23]:
getcount = operator.itemgetter(1)
dict(sorted(word_counts.items(), key=getcount,reverse=True))

{'consectetur': 18,
 'numquam': 18,
 'quaerat': 17,
 'ipsum': 16,
 'labore': 15,
 'adipisci': 15,
 'neque': 15,
 'dolore': 15,
 'aliquam': 14,
 'magnam': 14,
 'eius': 13,
 'tempora': 13,
 'sed': 13,
 'dolor': 12,
 'etincidunt': 12,
 'non': 11,
 'est': 11,
 'porro': 10,
 'amet': 10,
 'ut': 10,
 'voluptatem': 10,
 'modi': 8,
 'dolorem': 8,
 'quiquia': 8,
 'sit': 8,
 'quisquam': 7,
 'it': 6,
 'velit': 5,
 'orro': 5,
 'st': 4,
 'uaerat': 4,
 'psum': 4,
 'agnam': 4,
 'met': 3,
 'on': 3,
 'ius': 3,
 'ed': 3,
 'liquam': 3,
 't': 2,
 'olorem': 2,
 'oluptatem': 2,
 'empora': 2,
 'olore': 2,
 'olor': 2,
 'odi': 2,
 'umquam': 2,
 'dipisci': 1,
 'uiquia': 1,
 'elit': 1,
 'uisquam': 1,
 'abore': 1}

## Container datatypes

`collection` module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, `dict`, `list`, `set`, and `tuple`.

- `defaultdict` :	dict subclass that calls a factory function to supply missing values
- `Counter`	: dict subclass for counting hashable objects

### defaultdict

When you implement the `wordcount` function you probably had some problem to append key-value pair to your `dict`. If you try to change the value of a key that is not present 
in the dict, the key is not automatically created.

You can use a `try-except` flow but the `defaultdict` could be a solution. This container is a `dict` subclass that calls a factory function to supply missing values.
For example, using list as the default_factory, it is easy to group a sequence of key-value pairs into a dictionary of lists:

In [32]:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

dict(d)

{'yellow': [1, 3], 'blue': [2, 4], 'red': [1]}

### Exercise 4.4

- Modify the `reduce` function you wrote above by using a defaultdict with the most suitable factory.

In [38]:
from collections import defaultdict
def reduce_update(a):
    result = defaultdict(int)
    for v in a:
        result[v] = result[v] + 1
    return dict(result)

In [39]:
word_counts = reduce_update(words)

In [41]:
dict(sorted(word_counts.items(), key=getcount,reverse=True))

{'consectetur': 18,
 'numquam': 18,
 'quaerat': 17,
 'ipsum': 16,
 'labore': 15,
 'adipisci': 15,
 'neque': 15,
 'dolore': 15,
 'aliquam': 14,
 'magnam': 14,
 'eius': 13,
 'tempora': 13,
 'sed': 13,
 'dolor': 12,
 'etincidunt': 12,
 'non': 11,
 'est': 11,
 'porro': 10,
 'amet': 10,
 'ut': 10,
 'voluptatem': 10,
 'modi': 8,
 'dolorem': 8,
 'quiquia': 8,
 'sit': 8,
 'quisquam': 7,
 'it': 6,
 'velit': 5,
 'orro': 5,
 'st': 4,
 'uaerat': 4,
 'psum': 4,
 'agnam': 4,
 'met': 3,
 'on': 3,
 'ius': 3,
 'ed': 3,
 'liquam': 3,
 't': 2,
 'olorem': 2,
 'oluptatem': 2,
 'empora': 2,
 'olore': 2,
 'olor': 2,
 'odi': 2,
 'umquam': 2,
 'dipisci': 1,
 'uiquia': 1,
 'elit': 1,
 'uisquam': 1,
 'abore': 1}

### Counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.

Elements are counted from an iterable or initialized from another mapping (or counter):

In [45]:
from collections import Counter
Counter([1, 4, 3, 2, 3, 3, 2, 1, 3, 4, 1, 2])

Counter({1: 3, 4: 2, 3: 4, 2: 3})

In [42]:
from collections import Counter

violet = dict(r=23,g=13,b=23)
print(violet)
cnt = Counter(violet)  # or Counter(r=23, g=13, b=23)
print(cnt['c'])
print(cnt['r'])

{'r': 23, 'g': 13, 'b': 23}
0
23


In [43]:
print(*cnt.elements())

r r r r r r r r r r r r r r r r r r r r r r r g g g g g g g g g g g g g b b b b b b b b b b b b b b b b b b b b b b b


In [44]:
print(cnt.elements())

<itertools.chain object at 0x0000022ED67A13D0>


In [35]:
cnt.most_common(2)

[('r', 23), ('b', 23)]

In [36]:
cnt.values()

dict_values([23, 13, 23])

### Exercise 4.5

Use a `Counter` object to count words occurences in the sample text file.

In [46]:
def map_words_counter(filename):
    data = []
    with open(filename, "r") as file:
        data = file.read()
    text = re.sub('[^a-z\ \']+', " ", data)
    words = list(text.split())
    return Counter(words)

In [47]:
result = map_words_counter('sample.txt')

In [49]:
result.most_common(5)

[('consectetur', 18),
 ('numquam', 18),
 ('quaerat', 17),
 ('ipsum', 16),
 ('labore', 15)]

The Counter class is similar to bags or multisets in some Python libraries or other languages. We will see later how to use Counter-like objects in a parallel context.

## Process multiple files

- Create several files containing `lorem` text named 'sample01.txt', 'sample02.txt'...
- If you process these files you return multiple dictionaries.
- You have to loop over them to sum occurences and return the resulted dict. To iterate on specific mappings, Python standard library provides some useful features in `itertools` module.
- [itertools.chain(*mapped_values)](https://docs.python.org/3.6/library/itertools.html#itertools.chain) could be used for treating consecutive sequences as a single sequence.

In [37]:
import itertools, operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
vegetables = [('endive', 2), ('spinach', 1), ('celery', 5), ('carrot', 4)]
getcount = operator.itemgetter(1)
dict(sorted(itertools.chain(fruits,vegetables), key=getcount))

{'orange': 1,
 'spinach': 1,
 'banana': 2,
 'endive': 2,
 'apple': 3,
 'carrot': 4,
 'pear': 5,
 'celery': 5}

### Exercise 4.6

- Write the program that creates files, processes and use `itertools.chain` to get the merged word count dictionary.

### Exercise 4.7

- Create the `wordcount` function in order to accept several files as arguments and
return the result dict.

```
wordcount(file1, file2, file3, ...)
```

[Hint: arbitrary argument lists](https://docs.python.org/3/tutorial/controlflow.html#arbitrary-argument-lists)

- Example of use of arbitrary argument list and arbitrary named arguments.

In [38]:
def func( *args, **kwargs):
    for arg in args:
        print(arg)
        
    print(kwargs)
        
func( "3", [1,2], "bonjour", x = 4, y = "y")

3
[1, 2]
bonjour
{'x': 4, 'y': 'y'}
