# Meet the collections module
> The collections module is part of Python's standard library and holds some more advanced data containers. You'll learn how to use the Counter, defaultdict, OrderedDict and namedtuple in the context of answering questions about the Chicago transit dataset. This is the Summary of lecture "Data Types for Data Science in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp, Data_Science]
- image: 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 8)

## Counting made easy
- Collection Module
    - Part of Standard Library
    - Advanced data containers
- Counter
    - Special dictionary used for counting data, measuring frequency

### Using Counter on lists
`Counter` is a powerful tool for counting, validating, and learning more about the elements within a dataset that is found in the `collections` module. You pass an iterable (list, set, tuple) or a dictionary to the `Counter`. You can also use the `Counter` object similarly to a dictionary with key/value assignment, for example `counter[key] = value`.

A common usage for `Counter` is checking data for consistency prior to using it, so let's do just that. In this exercise, you'll be using data from the Chicago Transit Authority on ridership.

In [2]:
df = pd.read_csv('./dataset/cta_daily_station_totals.csv')
df.head()

Unnamed: 0,station_id,stationname,date,daytype,rides
0,40010,Austin-Forest Park,01/01/2015,SUNDAY/HOLIDAY,587
1,40010,Austin-Forest Park,01/02/2015,WEEKDAY,1386
2,40010,Austin-Forest Park,01/03/2015,SATURDAY,785
3,40010,Austin-Forest Park,01/04/2015,SUNDAY/HOLIDAY,625
4,40010,Austin-Forest Park,01/05/2015,WEEKDAY,1752


In [4]:
stations = df['stationname'].tolist()

In [5]:
from collections import Counter

# Print the first ten items from the stations list
print(stations[:10])

# Create a Counter of the stations list: station_count
station_count = Counter(stations)

# Print the station_count
print(station_count)

['Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park', 'Austin-Forest Park']
Counter({'Austin-Forest Park': 700, 'Harlem-Lake': 700, 'Pulaski-Lake': 700, 'Quincy/Wells': 700, 'Davis': 700, "Belmont-O'Hare": 700, 'Jackson/Dearborn': 700, 'Sheridan': 700, 'Damen-Brown': 700, 'Morse': 700, '35th/Archer': 700, '51st': 700, 'Dempster-Skokie': 700, 'Pulaski-Cermak': 700, 'LaSalle/Van Buren': 700, 'Ashland-Lake': 700, 'Oak Park-Forest Park': 700, 'Sox-35th-Dan Ryan': 700, 'Randolph/Wabash': 700, 'Damen-Cermak': 700, 'Western-Forest Park': 700, 'Cumberland': 700, '79th': 700, 'Kedzie-Homan-Forest Park': 700, 'State/Lake': 700, 'Main': 700, 'Central-Lake': 700, 'Ashland/63rd': 700, 'Indiana': 700, 'Western-Orange': 700, 'Division/Milwaukee': 700, 'Grand/State': 700, 'Berwyn': 700, 'UIC-Halsted': 700, 'Southport': 700, 'Washington/Dearborn': 700, 'Clark/

### Finding most common elements
Another powerful usage of `Counter` is finding the most common elements in a list. This can be done with the `.most_common()` method.

Practice using this now to find the most common stations in a stations list.

In [6]:
# Find the 5 most common elements
print(station_count.most_common(5))

[('Austin-Forest Park', 700), ('Harlem-Lake', 700), ('Pulaski-Lake', 700), ('Quincy/Wells', 700), ('Davis', 700)]


## Dictionaries of unknown structure - Defaultdict
- Using defaultdict
    - Pass it a default type that every key will have even if it doesn't currently exist
    - Works exactly like a dictionary

### Creating dictionaries of an unknown structure
Occasionally, you'll need a structure to hold nested data, and you may not be certain that the keys will all actually exist. This can be an issue if you're trying to append items to a list for that key. You might remember the NYC data that we explored in the video. In order to solve the problem with a regular dictionary, you'll need to test that the key exists in the dictionary, and if not, add it with an empty list.

You'll be working with a list of entries that contains ridership details on the Chicago transit system. You're going to solve this same type of problem with a much easier solution in the next exercise.

In [23]:
entries = df[['date', 'stationname', 'rides']].apply(tuple, axis=1).values.tolist()
entries[:10]

[('01/01/2015', 'Austin-Forest Park', 587),
 ('01/02/2015', 'Austin-Forest Park', 1386),
 ('01/03/2015', 'Austin-Forest Park', 785),
 ('01/04/2015', 'Austin-Forest Park', 625),
 ('01/05/2015', 'Austin-Forest Park', 1752),
 ('01/06/2015', 'Austin-Forest Park', 1777),
 ('01/07/2015', 'Austin-Forest Park', 1269),
 ('01/08/2015', 'Austin-Forest Park', 1435),
 ('01/09/2015', 'Austin-Forest Park', 1631),
 ('01/10/2015', 'Austin-Forest Park', 771)]

In [16]:
# Create an empty dictionary: ridership
ridership = {}

# Iterate over the entries
for date, stop, riders in entries:
    # Check to see if date is already in the ridership dictionary
    if date not in ridership:
        # Create an empty list for any missing date
        ridership[date] = []
    # Append the stop and riders as a tuple to the date keys list
    ridership[date].append((stop, riders))
    
# Print the ridership for '03/09/2016'
print(ridership['03/09/2016'])

[('Austin-Forest Park', 2128), ('Harlem-Lake', 3769), ('Pulaski-Lake', 1502), ('Quincy/Wells', 8139), ('Davis', 3656), ("Belmont-O'Hare", 5294), ('Jackson/Dearborn', 8369), ('Sheridan', 5823), ('Damen-Brown', 3048), ('Morse', 4826), ('35th/Archer', 3450), ('51st', 1033), ('Dempster-Skokie', 1697), ('Pulaski-Cermak', 1259), ('LaSalle/Van Buren', 3104), ('Ashland-Lake', 2486), ('Oak Park-Forest Park', 1882), ('Sox-35th-Dan Ryan', 4967), ('Randolph/Wabash', 9659), ('Damen-Cermak', 1572), ('Western-Forest Park', 1819), ('Cumberland', 4589), ('79th', 7476), ('Kedzie-Homan-Forest Park', 2256), ('State/Lake', 10594), ('Main', 1129), ('Central-Lake', 2145), ('Ashland/63rd', 1302), ('Indiana', 919), ('Western-Orange', 3958), ('Division/Milwaukee', 6580), ('Grand/State', 10949), ('Berwyn', 3539), ('UIC-Halsted', 7523), ('Southport', 3467), ('Washington/Dearborn', 12365), ('Clark/Lake', 21640), ('Forest Park', 3636), ('Noyes', 941), ('Cicero-Cermak', 1271), ('Clinton-Forest Park', 4016), ('Califo

### Safely appending to a key's value list
Often when working with dictionaries, you will need to initialize a data type before you can use it. A prime example of this is a list, which has to be initialized on each key before you can append to that list.

A `defaultdict` allows you to define what each uninitialized key will contain. When establishing a `defaultdict`, you pass it the type you want it to be, such as a list, tuple, set, int, string, dictionary or any other valid type object.

In [21]:
from collections import defaultdict

# Create a defaultdict with a default type of list: ridership
ridership = defaultdict(list)

# Iterate over the entries
for date, stop, riders in entries:
    # Use the stop as the key of ridership and append the riders to its value
    ridership[stop].append(riders)
    
# Print the fist 10 items of the ridership dictionary
print(list(ridership.items())[:10])

[('Austin-Forest Park', [587, 1386, 785, 625, 1752, 1777, 1269, 1435, 1631, 771, 588, 2065, 2108, 2012, 2069, 2003, 953, 706, 1216, 2115, 2132, 2185, 2072, 854, 585, 2095, 2251, 2133, 2083, 2074, 953, 596, 1583, 2263, 2179, 2105, 2076, 1049, 612, 2095, 2191, 2117, 1931, 1943, 800, 584, 1434, 2078, 1869, 1455, 1830, 841, 621, 1884, 2100, 2046, 2066, 2016, 875, 615, 1975, 2391, 2058, 2035, 2008, 989, 635, 2105, 2148, 2152, 2155, 2182, 1340, 718, 2191, 2220, 2154, 2248, 2183, 1073, 664, 1924, 2060, 2049, 2138, 1930, 972, 693, 2059, 2060, 2120, 2062, 1751, 928, 664, 2047, 2032, 2030, 1899, 2096, 1012, 688, 2090, 2160, 2182, 2184, 2235, 1060, 732, 2090, 2161, 2115, 2203, 2180, 885, 738, 2152, 2175, 2230, 2218, 2320, 1207, 773, 2171, 2090, 2225, 2333, 2098, 1042, 678, 2048, 2097, 2118, 2198, 2273, 1095, 779, 2103, 2119, 2090, 2206, 2081, 1095, 767, 795, 2025, 2171, 2271, 2175, 910, 668, 2148, 2110, 2198, 2152, 2138, 1129, 773, 2041, 2156, 2172, 2093, 2010, 1225, 843, 2006, 2126, 2062, 2341, 

## Maintaining Dictionary Order with OrderedDict



### Working with OrderedDictionaries
Recently in Python 3.6, dictionaries were made to maintain the order in which the keys were inserted; however, in all versions prior to that you need to use an `OrderedDict` to maintain insertion order.

Let's create a dictionary of all the stop times by route and rider, then use it to find the ridership throughout the day.

In [24]:
from collections import OrderedDict

# Create an OrderedDicst called: ridership_date
ridership_date = OrderedDict()

# Iterate over the entries
for date, _, riders in entries:
    # If a key does not exist in ridership_date, set it to 0
    if date not in ridership_date:
        ridership_date[date] = 0
    
    # Add riders to the date key in ridership_date
    ridership_date[date] += riders
    
# Print the first 31 records
print(list(ridership_date.items())[:31])

[('01/01/2015', 233956), ('01/02/2015', 432144), ('01/03/2015', 273207), ('01/04/2015', 217632), ('01/05/2015', 538868), ('01/06/2015', 556918), ('01/07/2015', 416984), ('01/08/2015', 475074), ('01/09/2015', 524144), ('01/10/2015', 282850), ('01/11/2015', 227240), ('01/12/2015', 605068), ('01/13/2015', 609226), ('01/14/2015', 608109), ('01/15/2015', 622792), ('01/16/2015', 612833), ('01/17/2015', 335555), ('01/18/2015', 244490), ('01/19/2015', 411497), ('01/20/2015', 618377), ('01/21/2015', 619945), ('01/22/2015', 623914), ('01/23/2015', 612177), ('01/24/2015', 333440), ('01/25/2015', 226964), ('01/26/2015', 605287), ('01/27/2015', 626168), ('01/28/2015', 625531), ('01/29/2015', 622695), ('01/30/2015', 618395), ('01/31/2015', 337018)]


### Powerful Ordered popping
Where `OrderedDicts` really shine is when you need to access the data in the dictionary in the order you added it. `OrderedDict` has a `.popitem()` method that will return items in reverse of which they were inserted. You can also pass `.popitem()` the `last=False` keyword argument and go through the items in the order of how they were added.

In [27]:
# Print the first key in ridership_date
print(list(ridership_date.keys())[0])

# Pop the first item for ridership_date and print it
print(ridership_date.popitem())

# Print the last key in ridership_date
print(list(ridership_date.keys())[-1])

# Pop the last item from ridership_date and print it
print(ridership_date.popitem(last=False))

01/01/2015
('11/30/2016', 631904)
11/29/2016
('01/01/2015', 233956)


## What do you mean I don't have any class? Namedtuple
- namedtuple
    - A tuple where each position (column) has a name
    - Ensure each one has the same properties
    - Alternative to a pandas DataFrame row

### Creating namedtuples for storing data
Often times when working with data, you will use a dictionary just so you can use key names to make reading the code and accessing the data easier to understand. Python has another container called a `namedtuple` that is a tuple, but has names for each position of the tuple. You create one by passing a name for the tuple type and a list of field names.

For example, `Cookie = namedtuple("Cookie", ['name', 'quantity'])` will create a container, and you can create new ones of the type using `Cookie('chocolate chip', 1)` where you can access the name using the `name` attribute, and then get the quantity using the `quantity` attribute.

In this exercise, you're going to restructure the transit data you've been working with into namedtuples for more descriptive code.



In [28]:
from collections import namedtuple

# Create the namedtuple: DateDetails
DateDetails = namedtuple('DateDetails', ['date', 'stop', 'riders'])

# Create the empty list: labeled_entries
labeled_entries = []

# Iterate over the entries list
for date, stop, riders in entries:
    # Append a new DateDetails namedtuple instance for each entry to labeled_entries
    labeled_entries.append(DateDetails(date, stop, riders))
    
# Print the first 5 items in labeled_entries
print(labeled_entries[:5])

[DateDetails(date='01/01/2015', stop='Austin-Forest Park', riders=587), DateDetails(date='01/02/2015', stop='Austin-Forest Park', riders=1386), DateDetails(date='01/03/2015', stop='Austin-Forest Park', riders=785), DateDetails(date='01/04/2015', stop='Austin-Forest Park', riders=625), DateDetails(date='01/05/2015', stop='Austin-Forest Park', riders=1752)]


### Leveraging attributes on namedtuples
Once you have a namedtuple, you can write more expressive code that is easier to understand. Remember, you can access the elements in the tuple by their name as an attribute. For example, you can access the date of the namedtuples in the previous exercise using the `.date` attribute.

Here, you'll use the tuples you made in the previous exercise to see how this works.

In [30]:
# Iterate over the first twenty items in labeled_entries
for item in labeled_entries[:20]:
    # Print each item's stop
    print(item.stop)
    
    # Print each item's date
    print(item.date)
    
    # Print each item's riders
    print(item.riders)

Austin-Forest Park
01/01/2015
587
Austin-Forest Park
01/02/2015
1386
Austin-Forest Park
01/03/2015
785
Austin-Forest Park
01/04/2015
625
Austin-Forest Park
01/05/2015
1752
Austin-Forest Park
01/06/2015
1777
Austin-Forest Park
01/07/2015
1269
Austin-Forest Park
01/08/2015
1435
Austin-Forest Park
01/09/2015
1631
Austin-Forest Park
01/10/2015
771
Austin-Forest Park
01/11/2015
588
Austin-Forest Park
01/12/2015
2065
Austin-Forest Park
01/13/2015
2108
Austin-Forest Park
01/14/2015
2012
Austin-Forest Park
01/15/2015
2069
Austin-Forest Park
01/16/2015
2003
Austin-Forest Park
01/17/2015
953
Austin-Forest Park
01/18/2015
706
Austin-Forest Park
01/19/2015
1216
Austin-Forest Park
01/20/2015
2115
