# `Collections` Module

## Days 1/3 (My data used with the examples)

In [5]:
# Imports and settings

from collections import defaultdict, namedtuple, Counter, deque
import csv
import random
from urllib.request import urlretrieve


### 1. `namedtuple`
> convenient way to define a class without methods, allows storing dict like objects that can be accessed by dot notation attributes

#### To contrast, here is some samples of a classic tuple and f notation

In [6]:
# example of a classic tuple
user = ('katie', 'coder')

# We have to use ugly code to output it, see the below for an example of f notation (Python 3.6 and beyond)

f'{user[0]} is a {user[1]}'



'katie is a coder'

In [7]:
# More f notation fun

f'this is an easier way to print'

'this is an easier way to print'

In [8]:
f'{user[0]} loves dogs'

'katie loves dogs'

#### Now, named tuples below

In [11]:
# First, set up the named tuple
User = namedtuple('User', 'name role')

# Now, define an instance of the named tuple

user = User(name='Katie', role='Coder')


In [12]:
# Now we can access using dot notation

user.name

'Katie'

In [13]:
user.role

'Coder'

In [14]:
# More practice

user_2 = User(name='Scooter', role='Dog')

user_2.name

'Scooter'

In [16]:
# f function output with named tuples

f'{user.name} has a dog named {user_2.name}'

'Katie has a dog named Scooter'

### 2. `defaultdict`
> Eliminates the key error when a key is not in the dictionary that the user queries on without using the 'get' method

Useful for nested data structures or any situation where a key may not be there

In [17]:
# Create a dictionary from a collection

# define collection
challenges_complete = [('Katie', 11), ('Scooter', 12), ('Claire', 9),
                      ('Jeff', 4), ('Matt', 7), ('Chris', 1)]

challenges_complete


[('Katie', 11),
 ('Scooter', 12),
 ('Claire', 9),
 ('Jeff', 4),
 ('Matt', 7),
 ('Chris', 1)]

In [18]:
# blank dictionary
challenges = {}

# Follow up on why this doesnt work
for name, challenge in challenges_complete:
    challenges[name].append(challenge)

challenges
    

KeyError: 'Katie'

We are getting the above key error because the first round of the loop, the dictionary `challenges` is blank and 'Katie' is not in there yet, thus an error is kicked out. We can get around this with defaultdict.

##### Now do with defaultdict, which will work

In [19]:
challenges = defaultdict(list)

for name, challenge in challenges_complete:
    challenges[name].append(challenge)

challenges

defaultdict(list,
            {'Katie': [11],
             'Scooter': [12],
             'Claire': [9],
             'Jeff': [4],
             'Matt': [7],
             'Chris': [1]})

### 3. `Counter`

In [21]:
medium_article = """ One of the problems with early game streaming services was that they were, in essence, streaming video of a game directly from a powerful computer and nothing more. They relied on streaming technologies designed for one-way video streams of content to your machine — think Netflix, for example. But streaming video games requires a fast, two-way connection — the game needs to make its way from the server to your screen, and any inputs you make must then be communicated back to the server. Then it streams back the reaction to what you just did, and the cycle continues.

The internet has come a long way since these first services were imagined. Stadia, for example, uses a standard called WebRTC, which is largely known for making video calls faster and reducing latency by connecting people’s computers directly to one another, rather than routing them through a centralized server. You’ll still need a good internet connection for this to work properly (more on that in a second), but the technology is better equipped to mitigate latency issues than previous efforts like OnLive were.

By using WebRTC along with other standards like QUIC and BBR, which further reduce latency and connection congestion, Stadia is able to make an end run around a lot of the problems encountered by early services, because the video stream and controller commands aren’t intricately linked together.

When your connection gets flaky — as even the best broadband definitely will sometimes — the stream will still feel responsive. Instead of glitching around the screen like in traditional multiplayer games — itself a major problem that can make games unplayable, as you miss your targets and those with better connections destroy you — the stream will simply drop video quality to compensate, becoming pixelated or distorted as when YouTube has buffering issues. Because the congestion is being managed by these new protocols and your controller commands are processed separately from the video stream, your character still responds, and as the connection recovers, the video quality ramps back up again.

The real magic, however — and what makes me so optimistic about Stadia — is that no additional software will be required. Google promises that you’ll be able to play on any device that has a Chrome browser, with a single click. If it works — and early reports indicate it very well could — Stadia will completely transform how games are experienced on every level."""

In [22]:
# use split() to make a list that divides with the white space
medium_article = medium_article.split()
medium_article[:5]

['One', 'of', 'the', 'problems', 'with']

In [26]:
# Last item

medium_article[-1]

'level.'

##### Easily get the five most common words

In [27]:
Counter(medium_article).most_common(5)

[('the', 17), ('a', 12), ('and', 12), ('to', 11), ('—', 10)]

### 4. `deque`

> deque is a generalized term for stacks and queues and stands for double-ended queue 

#### Time it!
> use the time it module %timeit to measure performance

In [28]:
# createing two 10 million integers with range storing one in a list and another in a deque:

lst = list(range(10000000))

deq = deque(range(10000000))

Below we write a function to insert and remove at random locations. A list is pretty slow at this whereas a deque is fast. Check out Grokking Algorithms for explanation at this link ([Grokking Algorithms](https://pybit.es/grokking_algorithms.html) )

In [30]:
def insert_and_delete(i):
    for _ in range(10):
        index = random.choice(range(100))
        i.remove(index)
        i.insert(index, index)
    
%timeit insert_and_delete(lst)

%timeit insert_and_delete(deq)
        

136 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
23.7 µs ± 924 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


`deque` is much faster - `ChainMap` groups dictionaries to create a single view and is worth checking out

# Day 2 - Code Challenge

See [Code Challenge 13 - Highest Rated Movie Directors](https://pybit.es/codechallenge13.html). Let's import a movie data set here to practice `collections`.

In [31]:
movie_data = 'https://raw.githubusercontent.com/pybites/challenges/solutions/13/movie_metadata.csv'

In [37]:
movies_csv = 'movies.csv'


In [33]:
urlretrieve(movie_data, movies_csv)


('movies.csv', <http.client.HTTPMessage at 0x1394b8908>)

Here we create a `namedtuple` to describe a movie so we can access movies attributes

In [41]:
Movie = namedtuple('Movie', 'title year score')

CSV Parse

In [45]:
# the data=movies_csv parameter means that the function takes the parameter data and that our movies_csv is the default

def get_movies_by_director(data=movies_csv):
    """this function extracts all movies from csv and stores them
    in a dictionary where keys are directors, and values is a list of movies (named tuples)"""
    directors = defaultdict(list)
    # The Pythonic way to opena file is to use a context manager ('with' statement)
    with open(data, encoding='utf-8') as f:
        # CSV dict reader parses every line into an ordered dict
        for line in csv.DictReader(f):
            try:
                # In this try statement we extract the four lines below
                director = line['director_name']
                #getting rid of weird characters with the replace method
                movie = line['movie_title'].replace('\xa0', '')
                # make sure to convert year to integer
                year = int(line['title_year'])
                
                # converting score to float as it has decimals
                score = float(line['imdb_score'])
            #a value error will get raised for incomplete rows, we use continue to continue the loop and the try/except catches the error and moves us along as we are not interested in incomplete data
            except ValueError:
                continue
            
            #putting into namedtuple
            m = Movie(title=movie, year=year, score=score)
            
            # appending to the defaultdict created above (directors)
            directors[director].append(m)
            
    return directors

In [46]:
# we dont have to specify the data as we put our data as default (movies.csv)
directors = get_movies_by_director()

#### Now we can look up directors and get their movies stored in our `namedtuple` objects.

In [47]:
directors['Christopher Nolan']

[Movie(title='The Dark Knight Rises', year=2012, score=8.5),
 Movie(title='The Dark Knight', year=2008, score=9.0),
 Movie(title='Interstellar', year=2014, score=8.6),
 Movie(title='Inception', year=2010, score=8.8),
 Movie(title='Batman Begins', year=2005, score=8.3),
 Movie(title='Insomnia', year=2002, score=7.2),
 Movie(title='The Prestige', year=2006, score=8.5),
 Movie(title='Memento', year=2000, score=8.5)]

Lets get top 5 directors with the most movies using counter

In [49]:
# the Counter() is built in from the collections module
cnt = Counter()

# you can loop over a dictionary with items
for director, movies in directors.items():
    # Here we are storing the direcor in the counter object and summing up the length of the movies
    cnt[director] += len(movies)
    
cnt.most_common(5)

[('Steven Spielberg', 26),
 ('Woody Allen', 22),
 ('Martin Scorsese', 20),
 ('Clint Eastwood', 20),
 ('Ridley Scott', 17)]