# Goal 1

Firstly, let's produce the field headers as well as a sample row. This will inform us of what data type each field should be.

In [8]:
file_name = 'nyc_parking_tickets_extract.csv'

with open(file_name) as f:
    column_headers = next(f).strip('\n').split(',')
    sample_data = next(f).strip('\n').split(',')

print(column_headers)
print(sample_data)

['Summons Number', 'Plate ID', 'Registration State', 'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make', 'Violation Description']
['4006478550', 'VAD7274', 'VA', 'PAS', '10/5/2016', '5', '4D', 'BMW', 'BUS LANE VIOLATION']


We need it to convert these `column_headers` to something more pythonic e.g. `Summons Number -> summons_number`, so we can access it via `<namedtuple_obj>.summons_number` 

In [9]:
from collections import namedtuple

column_names = [header.replace(' ', '_').lower() for header in column_headers]
                
Ticket = namedtuple('Ticket', column_names)

It would be fair to assume the following:

0. summons_number: **looks like integers**
1. plate_id: string
2. registration_state: string
3. plate_type: string
4. issue_date: **looks like valid dates**
5. violation_code: **looks like integers**
6. vehicle_body_type: string
7. vehicle_make: string
8. violation_description: string

Since the data from the csv will automatically come back as a string, we only need to worry about casting three of the column headers to something other than a string.

Our approach will then be to cast each value in a row to their corresponding expected type. If the casting fails, we will replace that value with either:

- `None` if a valid value is essential.
- An empty string if a valid value is nonessential.

Then, we'll want to throw away any that have a `None` value and create a named tuple for the rest which will be yielded.

#### Parsing Utilities

For fields that are expected to be integers, try to cast the default string type to an integer; if it fails, replace with `None`, ready for the entire row to be thrown away.

In [10]:
def parse_int(value, *, default=None):
    try:
        return int(value)
    except ValueError:
        return default

For fields that are expected to conform to `%m/%d/%Y`, cast the default string type to a `datetime` object; if it fails, replace with `None`, ready for the entire row to be thrown away.

In [11]:
from datetime import datetime
def parse_date(value, *, default=None):
    date_format='%m/%d/%Y'
    try:
        return datetime.strptime(value, date_format).date()
    except ValueError:
        return default

For fields that are expected to be a **valid, essential** string, ensure validity (e.g. no trailing/leading spaces or `"   "` or `""`; if it fails, replace with `None`, ready to be thrown away.

For fields that are expected to be a **valid** string but **not necessarily essential**, ensure validity (e.g. no trailing/leading spaces or `"   "` or `""`; if it fails, replace with `""`, but don't throw row away.

In [12]:
def parse_string(value, *, default=None):
    try:
        cleaned = str(value).strip()
        if not cleaned:
            # empty string
            return default
        else:
            return cleaned
    except ValueError:
        return default

Let's create a tuple which stores the functions that will be mapped onto each rows' values.

Note: We are passing callables here, i.e. `parse_int` instead of `parse_int()`, but if we want to provide a kwarg such as default value, then we can't do something like: `parse_string(default='')` because that's no longer a callable. There are two solutions around this:

1. Use a `lambda` function: `lambda x: parse_string(x, default='')`. When some value is passed to the `lambda` function, it calls `parse_string` with that value.
2. Use a `partial` function: `partial(parse_string, default='')`. Identical to the abovel; it takes the name of the desired function as the first argument and any number of kwargs that we want to pass to that desired function

In [13]:
from functools import partial

column_parsers = (parse_int,  # summons_number, default is None
                  parse_string,  # plate_id, default is None
                  partial(parse_string, default=''),  # state
                  partial(parse_string, default=''),  # plate_type
                  parse_date,  # issue_date, default is None
                  parse_int,  # violation_code
                  partial(parse_string, default=''),  # body type
                  parse_string,  # make, default is None
                  lambda x: parse_string(x, default='')  # description
                 )

#### Utility Iterators/Generators

This generator will provide us with a raw row (a single string with `\n` at the end) each time it is yielded. 

In [14]:
def read_data():
    with open(file_name) as f:
        next(f)
        yield from f

This will take each raw row (`fields`) from above (a single string with `\n` at the end) and first strip `\n` off and convert it into as list of values. 

Then, we'll apply the appropriate function to each value in row depending on its expected type.

This will give us a clean list (`parsed_data`) which may contain `None` or `""` if any of the values were invalid+essential or invalid+nonessential, respectively. 

If there are no `None`s, then let's make a namedtuple with that that data. Otherwise, we'll return `None`; we can use this later to only `yield` parsed rows that are `not None`.

In [15]:
def parse_row(row, *, default=None):
    fields = row.strip('\n').split(',')
    parsed_data = [func(field) for func, field in zip(column_parsers, fields)]
                   
    if all(item is not None for item in parsed_data):
        return Ticket(*parsed_data)
    else:
        return default

#### Main Iterator/Generator

This serves as our main iterator/generator. It will **`yield`** each raw row **lazily**, parse it **mostly eagerly**, and if the row is valid, **`yield`** it **lazily**.

In [16]:
def parsed_data():
    for row in read_data():
        parsed = parse_row(row)
        if parsed:
            yield parsed

Now we can iterate through it!

In [17]:
parsed_rows = parsed_data()

In [18]:
for _ in range(5):
    print(next(parsed_rows))
    print('')

Ticket(summons_number=4006478550, plate_id='VAD7274', registration_state='VA', plate_type='PAS', issue_date=datetime.date(2016, 10, 5), violation_code=5, vehicle_body_type='4D', vehicle_make='BMW', violation_description='BUS LANE VIOLATION')

Ticket(summons_number=4006462396, plate_id='22834JK', registration_state='NY', plate_type='COM', issue_date=datetime.date(2016, 9, 30), violation_code=5, vehicle_body_type='VAN', vehicle_make='CHEVR', violation_description='BUS LANE VIOLATION')

Ticket(summons_number=4007117810, plate_id='21791MG', registration_state='NY', plate_type='COM', issue_date=datetime.date(2017, 4, 10), violation_code=5, vehicle_body_type='VAN', vehicle_make='DODGE', violation_description='BUS LANE VIOLATION')

Ticket(summons_number=4006265037, plate_id='FZX9232', registration_state='NY', plate_type='PAS', issue_date=datetime.date(2016, 8, 23), violation_code=5, vehicle_body_type='SUBN', vehicle_make='FORD', violation_description='BUS LANE VIOLATION')

Ticket(summons_numb

#### Explanation

By the end of the code we have two generators
```python
def read_data():
    with open(file_name) as f:
        next(f)
        yield from f
```
and 
```python
def parsed_data():
    for row in read_data():
        parsed = parse_row(row)
        if parsed:
            yield parsed
```

(Technically, we have one more inside `all(item is not None for item in parsed_data)` but it's irrelevant for my question)

Then we start yielding with `next`:
```python
parsed_rows = parsed_data()
for _ in range(5):
    print(next(parsed_rows))
```
For the code block directly above, here's my understanding of the order of operations:

1. We call `parsed_data()`. It has a `yield` statement so Python doesn't execute **any** lines within it - it just returns a generator object.

2. We call `next`, so Python starts executing the first line, so it runs into `read_data()` which is a generator object, so Python does nothing. But then, we iterate through `read_data()` using a `for` loop, . Python now has permission to `open` the file and `yield from f`, which gets stored in `row`.

3. Python applies a function to `row` and eventually it will `yield parsed`  to the `next()` function, ready to be printed.

4. For the next iteration, we continue from `read_data()`, asking it to give us the next `row` in the iteration, and so on and so forth...

# Goal 2

In [70]:
makes_counts = {}

for data in parsed_data():
    if data.vehicle_make in makes_counts:
        makes_counts[data.vehicle_make] += 1
    else:
        makes_counts[data.vehicle_make] = 1

print(makes_counts)

{'BMW': 34, 'CHEVR': 76, 'DODGE': 45, 'FORD': 104, 'FRUEH': 44, 'HONDA': 106, 'LINCO': 12, 'TOYOT': 112, 'CADIL': 9, 'CHRYS': 12, 'FIR': 1, 'GMC': 35, 'HYUND': 35, 'JAGUA': 3, 'JEEP': 22, 'LEXUS': 26, 'ME/BE': 38, 'MERCU': 4, 'MITSU': 11, 'NISSA': 70, 'HIN': 6, 'NS/OT': 18, 'WORKH': 2, 'ACURA': 12, 'AUDI': 12, 'INTER': 25, 'ISUZU': 10, 'KENWO': 5, 'KIA': 8, 'OLDSM': 1, 'SUBAR': 18, 'VOLVO': 12, 'SATUR': 2, 'SMART': 3, 'INFIN': 13, 'PETER': 1, 'CITRO': 1, 'ROVER': 5, 'BUICK': 5, 'GEO': 1, 'MAZDA': 5, 'PORSC': 3, 'VOLKS': 8, 'YAMAH': 1, 'BSA': 1, 'MINI': 1, 'PONTI': 1, 'SPRI': 1, 'PLYMO': 1, 'SCION': 2, 'UPS': 1, 'FIAT': 1, 'UD': 1, 'UTILI': 1, 'GMCQ': 1, 'SAAB': 2, 'HINO': 2, 'STAR': 1, 'AM/T': 1, 'MI/F': 1}


In [71]:
sorted_data = sorted(makes_counts.items(),
                             key=lambda tuple: tuple[1],
                             reverse=True)
print(sorted_data)

[('TOYOT', 112), ('HONDA', 106), ('FORD', 104), ('CHEVR', 76), ('NISSA', 70), ('DODGE', 45), ('FRUEH', 44), ('ME/BE', 38), ('GMC', 35), ('HYUND', 35), ('BMW', 34), ('LEXUS', 26), ('INTER', 25), ('JEEP', 22), ('NS/OT', 18), ('SUBAR', 18), ('INFIN', 13), ('LINCO', 12), ('CHRYS', 12), ('ACURA', 12), ('AUDI', 12), ('VOLVO', 12), ('MITSU', 11), ('ISUZU', 10), ('CADIL', 9), ('KIA', 8), ('VOLKS', 8), ('HIN', 6), ('KENWO', 5), ('ROVER', 5), ('BUICK', 5), ('MAZDA', 5), ('MERCU', 4), ('JAGUA', 3), ('SMART', 3), ('PORSC', 3), ('WORKH', 2), ('SATUR', 2), ('SCION', 2), ('SAAB', 2), ('HINO', 2), ('FIR', 1), ('OLDSM', 1), ('PETER', 1), ('CITRO', 1), ('GEO', 1), ('YAMAH', 1), ('BSA', 1), ('MINI', 1), ('PONTI', 1), ('SPRI', 1), ('PLYMO', 1), ('UPS', 1), ('FIAT', 1), ('UD', 1), ('UTILI', 1), ('GMCQ', 1), ('STAR', 1), ('AM/T', 1), ('MI/F', 1)]


In [72]:
print({make:cnt  for make, cnt in sorted_data})

{'TOYOT': 112, 'HONDA': 106, 'FORD': 104, 'CHEVR': 76, 'NISSA': 70, 'DODGE': 45, 'FRUEH': 44, 'ME/BE': 38, 'GMC': 35, 'HYUND': 35, 'BMW': 34, 'LEXUS': 26, 'INTER': 25, 'JEEP': 22, 'NS/OT': 18, 'SUBAR': 18, 'INFIN': 13, 'LINCO': 12, 'CHRYS': 12, 'ACURA': 12, 'AUDI': 12, 'VOLVO': 12, 'MITSU': 11, 'ISUZU': 10, 'CADIL': 9, 'KIA': 8, 'VOLKS': 8, 'HIN': 6, 'KENWO': 5, 'ROVER': 5, 'BUICK': 5, 'MAZDA': 5, 'MERCU': 4, 'JAGUA': 3, 'SMART': 3, 'PORSC': 3, 'WORKH': 2, 'SATUR': 2, 'SCION': 2, 'SAAB': 2, 'HINO': 2, 'FIR': 1, 'OLDSM': 1, 'PETER': 1, 'CITRO': 1, 'GEO': 1, 'YAMAH': 1, 'BSA': 1, 'MINI': 1, 'PONTI': 1, 'SPRI': 1, 'PLYMO': 1, 'UPS': 1, 'FIAT': 1, 'UD': 1, 'UTILI': 1, 'GMCQ': 1, 'STAR': 1, 'AM/T': 1, 'MI/F': 1}


This solution is good, but what if we wanted to get around the `if-else` loop when making the original dictionary? 

We can using `defaultdict` from `collections`

With regular dictionaries, if we request a value for a key that doesn't exist, an exception is thrown.

With `defaultdict`, if we request a value for a key that doesn't exist, python will set a default value of an empty string, integer value of 0, empty list, empty dict or any other empty type that we want. 

As an example:

In [73]:
from collections import defaultdict

example_d = defaultdict(int)
example_d['idontexist']

0

In our case, we can just increment 1 whether or not the key's value exists because if it doesn't exist, a default value of 0 will be set.

We'll also wrap it into a function.

In [74]:
def violation_count_by_make():
    makes_counts = defaultdict(int)
    for data in parsed_data():
        makes_counts[data.vehicle_make] += 1

    sorted_data = sorted(makes_counts.items(),
                             key=lambda tuple: tuple[1],
                             reverse=True)

    return {make:cnt  for make, cnt in sorted_data}

violation_count_by_make()

{'TOYOT': 112,
 'HONDA': 106,
 'FORD': 104,
 'CHEVR': 76,
 'NISSA': 70,
 'DODGE': 45,
 'FRUEH': 44,
 'ME/BE': 38,
 'GMC': 35,
 'HYUND': 35,
 'BMW': 34,
 'LEXUS': 26,
 'INTER': 25,
 'JEEP': 22,
 'NS/OT': 18,
 'SUBAR': 18,
 'INFIN': 13,
 'LINCO': 12,
 'CHRYS': 12,
 'ACURA': 12,
 'AUDI': 12,
 'VOLVO': 12,
 'MITSU': 11,
 'ISUZU': 10,
 'CADIL': 9,
 'KIA': 8,
 'VOLKS': 8,
 'HIN': 6,
 'KENWO': 5,
 'ROVER': 5,
 'BUICK': 5,
 'MAZDA': 5,
 'MERCU': 4,
 'JAGUA': 3,
 'SMART': 3,
 'PORSC': 3,
 'WORKH': 2,
 'SATUR': 2,
 'SCION': 2,
 'SAAB': 2,
 'HINO': 2,
 'FIR': 1,
 'OLDSM': 1,
 'PETER': 1,
 'CITRO': 1,
 'GEO': 1,
 'YAMAH': 1,
 'BSA': 1,
 'MINI': 1,
 'PONTI': 1,
 'SPRI': 1,
 'PLYMO': 1,
 'UPS': 1,
 'FIAT': 1,
 'UD': 1,
 'UTILI': 1,
 'GMCQ': 1,
 'STAR': 1,
 'AM/T': 1,
 'MI/F': 1}