# Lecture 4.6 - Basics of cleaning messy text files 
## Part 2 - Grouping blocks of data and extracting information

In this lecture, we will go over a number of cases of messy data, and how to use Python to fix these problems.  This includes

1. Removing unwanted lines.
2. Parsing lines with regular expressions.
3. Working with data blocks spread across multiple lines.

## Reading in current progress

In [37]:
with open('911_Deaths_Grouped.csv') as f:
    content = f.read() # if clean lines, use readlines()
content[:500]

"Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.\nEdelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.\nMarie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.\nAndrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.\nVincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.\nLaurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.\nAlona Abraham, 3"

In [38]:
grouped_lines = content.split('\n')
grouped_lines

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Edelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.',
 'Marie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.',
 'Andrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Vincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Laurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.',
 'Alona Abraham, 30, Ashdod, Israel, Passenger, United 175, World Trade Center.',
 'William F. Abrahamson, 55, Westchester County, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.',
 'Richard Anthony Aceto, 42, Marsh&McLennan Companies, Inc., World Trade Center.',
 'Heinrich Bernhard Ackermann, 38, Aon Corporation, World Trade Center.',
 'Paul Acquaviva, 29, Glen Rock, N.J., Cantor Fitzgerald, World Trade Center.',
 'Christian Adams, 37, Passenger, United 93, Shanksville, Pa.',


## Preprocessing 

Below I have transfered over the preprocessing functions and applied them to the data.

In [39]:
# Imports
from composable import pipeable
from composable.strict import map

In [40]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))

In [41]:
(grouped_lines
>> map(add_missing_period)
>> map(fix_world_trade)
)

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Edelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.',
 'Marie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.',
 'Andrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Vincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Laurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.',
 'Alona Abraham, 30, Ashdod, Israel, Passenger, United 175, World Trade Center.',
 'William F. Abrahamson, 55, Westchester County, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.',
 'Richard Anthony Aceto, 42, Marsh&McLennan Companies, Inc., World Trade Center.',
 'Heinrich Bernhard Ackermann, 38, Aon Corporation, World Trade Center.',
 'Paul Acquaviva, 29, Glen Rock, N.J., Cantor Fitzgerald, World Trade Center.',
 'Christian Adams, 37, Passenger, United 93, Shanksville, Pa.',


In [42]:
# For convenience I will give these a name
prepped_lines = (grouped_lines 
                >> map(add_missing_period)
                >> map(fix_world_trade)
                )
prepped_lines

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Edelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.',
 'Marie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.',
 'Andrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Vincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Laurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.',
 'Alona Abraham, 30, Ashdod, Israel, Passenger, United 175, World Trade Center.',
 'William F. Abrahamson, 55, Westchester County, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.',
 'Richard Anthony Aceto, 42, Marsh&McLennan Companies, Inc., World Trade Center.',
 'Heinrich Bernhard Ackermann, 38, Aon Corporation, World Trade Center.',
 'Paul Acquaviva, 29, Glen Rock, N.J., Cantor Fitzgerald, World Trade Center.',
 'Christian Adams, 37, Passenger, United 93, Shanksville, Pa.',


## Regular expression from lab 2

Below I have attempted to combine all of the regular expressions from lab 2

In [43]:
import re
line_parts = re.compile('^(.+), (\?\?|\d{1,3}),(.*?)( Passenger,| Flight Crew,)?( United \d{2,3},| American \d{2,3},)?( World Trade Center| Pentagon| Shanksville, Pa)(, died \d{1,2}/\d{1,2}/\d{1,2})?\.$')

In [44]:
prepped_lines[2402]

'Jesus Sanchez, 45, Flight Crew, United 175, World Trade Center.'

In [45]:
line_parts.search(prepped_lines[2402]).groups() # pull things apart using groups()

('Jesus Sanchez',
 '45',
 '',
 ' Flight Crew,',
 ' United 175,',
 ' World Trade Center',
 None)

#### Always check for non-matches

In [46]:
help(enumerate)

Help on class enumerate in module builtins:

class enumerate(object)
 |  enumerate(iterable, start=0)
 |  
 |  Return an enumerate object.
 |  
 |    iterable
 |      an object supporting iteration
 |  
 |  The enumerate object yields pairs containing a count (from start, which
 |  defaults to zero) and a value yielded by the iterable argument.
 |  
 |  enumerate is useful for obtaining an indexed list:
 |      (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |      Return state information for pickling.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.



In [47]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)] 

[]

## Verbose regular expressions

**Pros:**
* Spread over multiple lines
* Allow comments

**Cons:**
* Ignore white space outside `()`
* Require escaping spaces `\ `

In [48]:
# Without Using VERBOSE 
regex_email = re.compile(r'^([a-z0-9_\.-]+)@([0-9a-z\.-]+)\.([a-z\.]{2, 6})$')

In [49]:
# Using VERBOSE 
regex_email = re.compile(r""" 
                        ^([a-z0-9_\.-]+)			 # local Part 
                        @							 # single @ sign 
                        ([0-9a-z\.-]+)			 	 # Domain name 
                        \.						 	 # single Dot . 
                        ([a-z]{2,6})$				 # Top level Domain 
                        """,re.VERBOSE)

## Another example.

This example, from the Python docs, shows how to space out an OR section across multiple lines.

In [50]:
charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

## Cleaning up our regular expr

<h2> <font color="red"> Exercise 4.6.1 - Clean up the regular expression </font> </h2>

To clean up the regular expression, 

1. Replace all spaces with `\ ` or `\s` (I prefer the second)
2. Turn the string into a multi-line string.
3. Spread the parts over many lines
4. Add comments.

> Describe the bug here

In [51]:
# Your fix here
line_parts = re.compile('^(.+), (\?\?|\d{1,3}),(.*?)( Passenger,| Flight Crew,)?( United \d{2,3},| American \d{2,3},)?( World Trade Center| Pentagon| Shanksville, Pa)(, died \d{1,2}/\d{1,2}/\d{1,2})?\.$')

In [52]:
line_parts = re.compile(r"""
^(.+),
(
      \s\?\?                          # ??(missing age)
    | \s\d{1,3}                       # age
),
(.*?)                                 # hometown of victim
(
      \sPassenger,                    # optional flight status
    | \sFlight Crew,
)?
(
      \sUnited \d{2,3},               # optional flight carrier
    | \sAmerican \d{2,3},
)?
(
      \sWorld Trade Center            # location of death
    | \sPentagon
    | \sShanksville, Pa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # date of death
)?
\.$
""", re.VERBOSE)

## Progress so far

In [53]:
# Imports
from composable import pipeable
from composable.strict import map

In [54]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [55]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
# New
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default='')) # replace None with ''

In [56]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)] # check for non matches

[]

In [57]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(get_line_parts)
                )
split_lines # with .groups()

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anthony Ac

## Pulling out and cleaning up names

Sometimes it is useful to pull the various columns apart and clean them up separately.  To illustrate, will will pull out and clean up the names. We can do this using the `get` function from `toolz.curried` which *gets* the value from a list at a given index.

In [58]:
from toolz.curried import get

In [59]:
(split_lines
>> map(get(0)) # get only names
)

['Gordon M. Aamoth, Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley, Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar, Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest 

Now we can clean up these name by removing commas.

In [60]:
# helper function
remove_commas = lambda s: s.replace(',', '')

(split_lines
>> map(get(0))
>> map(remove_commas) 
)

['Gordon M. Aamoth Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest Ali

## Pulling out and cleaning up ages

NExt, we will pull out and clean the ages.  In this case, we should replace the missing values, currently `'??'`, to blanks.

In [61]:
remove_quest_mark = lambda s: s.replace('??', '')

(split_lines
>> map(get(1)) # get age only
>> map(remove_quest_mark)
)

[' 32',
 ' 54',
 ' 49',
 ' 37',
 ' 40',
 ' 37',
 ' 30',
 ' 55',
 ' 42',
 ' 38',
 ' 29',
 ' 37',
 ' 28',
 ' 61',
 ' 25',
 ' 51',
 ' 62',
 ' 28',
 ' 22',
 ' 36',
 ' 48',
 ' 32',
 ' 37',
 ' 36',
 ' 37',
 ' 35',
 ' 46',
 ' 30',
 ' 43',
 ' 74',
 ' 27',
 ' 47',
 ' 30',
 ' 33',
 ' 37',
 ' 37',
 ' 41',
 ' 39',
 ' 46',
 ' 25',
 ' 46',
 ' 57',
 ' 43',
 ' 51',
 ' 44',
 ' 39',
 ' 31',
 ' 30',
 ' 36',
 ' 48',
 ' 41',
 ' 31',
 ' 23',
 ' 38',
 ' 25',
 ' 60',
 ' 40',
 ' 60',
 ' 43',
 ' 41',
 ' 32',
 ' 29',
 ' 28',
 ' 42',
 ' 35',
 ' 26',
 ' 57',
 ' 53',
 ' 52',
 ' 34',
 ' 43',
 ' 37',
 ' 63',
 ' 38',
 ' 54',
 ' 52',
 ' 23',
 ' 44',
 ' 32',
 ' 48',
 ' 26',
 ' 55',
 ' 26',
 ' 26',
 ' 36',
 ' 45',
 ' 32',
 ' 38',
 ' 37',
 ' 34',
 ' 52',
 ' 29',
 ' 48',
 ' 50',
 ' 49',
 ' 37',
 ' 47',
 ' 53',
 ' 25',
 ' 21',
 ' 35',
 ' 44',
 ' 38',
 ' 38',
 ' 58',
 ' 41',
 ' 36',
 ' 47',
 ' 48',
 ' 35',
 ' 43',
 ' 44',
 ' 29',
 ' 28',
 ' 53',
 ' 43',
 ' 36',
 ' 43',
 ' 33',
 ' 44',
 ' 35',
 ' 42',
 ' 53',
 ' 35',
 ' 53',


## Progress so far

In [62]:
# Imports
from composable import pipeable
from composable.strict import map

In [63]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [67]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
# New
remove_commas = lambda s: s.replace(',', '') # in names
remove_quest_mark = lambda s: s.replace('??', '') # in ages
strip = lambda s: s.strip() # strip whitespace

In [68]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [69]:
names = (split_lines 
         >> map(get(0))
         >> map(remove_commas)
         >> map(strip)
        )
names

['Gordon M. Aamoth Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest Ali

In [70]:
ages = (split_lines 
        >> map(get(1)) 
        >> map(remove_quest_mark)
        >> map(strip)
       )
ages

['32',
 '54',
 '49',
 '37',
 '40',
 '37',
 '30',
 '55',
 '42',
 '38',
 '29',
 '37',
 '28',
 '61',
 '25',
 '51',
 '62',
 '28',
 '22',
 '36',
 '48',
 '32',
 '37',
 '36',
 '37',
 '35',
 '46',
 '30',
 '43',
 '74',
 '27',
 '47',
 '30',
 '33',
 '37',
 '37',
 '41',
 '39',
 '46',
 '25',
 '46',
 '57',
 '43',
 '51',
 '44',
 '39',
 '31',
 '30',
 '36',
 '48',
 '41',
 '31',
 '23',
 '38',
 '25',
 '60',
 '40',
 '60',
 '43',
 '41',
 '32',
 '29',
 '28',
 '42',
 '35',
 '26',
 '57',
 '53',
 '52',
 '34',
 '43',
 '37',
 '63',
 '38',
 '54',
 '52',
 '23',
 '44',
 '32',
 '48',
 '26',
 '55',
 '26',
 '26',
 '36',
 '45',
 '32',
 '38',
 '37',
 '34',
 '52',
 '29',
 '48',
 '50',
 '49',
 '37',
 '47',
 '53',
 '25',
 '21',
 '35',
 '44',
 '38',
 '38',
 '58',
 '41',
 '36',
 '47',
 '48',
 '35',
 '43',
 '44',
 '29',
 '28',
 '53',
 '43',
 '36',
 '43',
 '33',
 '44',
 '35',
 '42',
 '53',
 '35',
 '53',
 '23',
 '32',
 '26',
 '34',
 '37',
 '27',
 '55',
 '38',
 '41',
 '35',
 '60',
 '48',
 '28',
 '44',
 '29',
 '43',
 '23',
 '48',

<h2> <font color="red"> Exercise 4.6.2 - Separating and cleaning other columns. </font> </h2>

To clean up the following columns 

1. Grab the date of death and replace the missing values with `9/11/2001`
2. Grab the locations (e.g. `World Trade Center`) and remove the comma from `'Shanksville, Pa.`
3. Grab the flights.
4. Grab the passenger status.

**Note:** Be sure to strip whitespace from all of them.

In [71]:
split_lines

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anthony Ac

In [74]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', ' World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '') # in names
remove_quest_mark = lambda s: s.replace('??', '') # in ages
replace_date = lambda s: s.replace('', ', died 9/11/2001') if s == '' else s # for dates
strip = lambda s: s.strip() # strip whitespace

In [75]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [76]:
# Your fix here
replace_date = lambda s: s.replace('', ' died 9/11/2001') if s == '' else s

dates = (split_lines 
         >> map(get(6))
         >> map(replace_date)
         >> map(strip)
        )
dates

['died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 ', died 9/15/01',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/2001',
 'died 9/11/

In [77]:
# location
locations = (split_lines 
             >> map(get(5)) 
             >> map(remove_commas)
             >> map(strip)
            )
locations

['World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'Shanksville Pa',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade Center',
 'World Trade C

In [78]:
# flights
flights = (split_lines 
           >> map(get(4))
           >> map(remove_commas)
           >> map(strip)
          )
flights

['',
 '',
 '',
 '',
 '',
 '',
 'United 175',
 '',
 '',
 '',
 '',
 'United 93',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 11',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 11',
 'United 11',
 '',
 '',
 '',
 'United 11',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 11',
 '',
 '',
 '',
 '',
 '',
 'United 11',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 175',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 11',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'United 175',
 '',
 'United 93',
 '',
 'United 93',
 '',
 '',
 'United 93',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 

In [79]:
# passenger status
passenger_status = (split_lines 
                    >> map(get(3))
                    >> map(remove_commas)
                    >> map(strip)
                   )
passenger_status

['',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 'Passenger',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 'Passenger',
 '',
 '',
 'Passenger',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '

## Grabbing the troubling bit

We have made significant progress, but still need to work on the third entry, which contains the hometown and employment information.  Again, we can do this using the `get` function from `toolz.curried` which *gets* the value from a list at a given index.

In [80]:
troubling_bit = (split_lines
                >> map(get(2))
                )
troubling_bit

[" Sandler O'Neill + Partners,",
 ' Brooklyn, N.Y., Fiduciary Trust Company International,',
 ' Keefe, Bruyette&Woods, Inc.,',
 ' Melville, N.Y., Cantor Fitzgerald,',
 ' Brooklyn, N.Y., Cantor Fitzgerald,',
 ' New York City, Cantor Fitzgerald,',
 ' Ashdod, Israel,',
 ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
 ' Marsh&McLennan Companies, Inc.,',
 ' Aon Corporation,',
 ' Glen Rock, N.J., Cantor Fitzgerald,',
 '',
 ' Cantor Fitzgerald,',
 ' Fuji Bank, Ltd. security,',
 ' Cantor Fitzgerald,',
 ' New York City, Windows on the World,',
 ' Bronx, N.Y., New York Metropolitan Transportation Council,',
 ' New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
 ' New York City, Fred Alger Management, Inc.,',
 ' Bronx, N.Y., Windows on the World,',
 ' Cantor Fitzgerald,',
 ' Manalapan, N.J., Cantor Fitzgerald,',
 ' Windows on the World,',
 ' Cantor Fitzgerald,',
 ' Fiduciary Trust Company International,',
 ' Belle Harbor, N.Y., New York City Fire Department,',
 ' Port Washingto

## Progressively filtering out states

We will start by matching two of the most common states, NY and NJ.

In [81]:
state = re.compile(', (N\.Y\.|N\.J\.),?')
# Rows that match
[(l, state.search(l)) for l in troubling_bit]

[(" Sandler O'Neill + Partners,", None),
 (' Brooklyn, N.Y., Fiduciary Trust Company International,',
  <re.Match object; span=(9, 16), match=', N.Y.,'>),
 (' Keefe, Bruyette&Woods, Inc.,', None),
 (' Melville, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(9, 16), match=', N.Y.,'>),
 (' Brooklyn, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(9, 16), match=', N.Y.,'>),
 (' New York City, Cantor Fitzgerald,', None),
 (' Ashdod, Israel,', None),
 (' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  <re.Match object; span=(19, 26), match=', N.Y.,'>),
 (' Marsh&McLennan Companies, Inc.,', None),
 (' Aon Corporation,', None),
 (' Glen Rock, N.J., Cantor Fitzgerald,',
  <re.Match object; span=(10, 17), match=', N.J.,'>),
 ('', None),
 (' Cantor Fitzgerald,', None),
 (' Fuji Bank, Ltd. security,', None),
 (' Cantor Fitzgerald,', None),
 (' New York City, Windows on the World,', None),
 (' Bronx, N.Y., New York Metropolitan Transportation Council,',
  <re.Match object;

and inspecting all rows that don't match for additional states or problems

In [82]:
[(i, l) for i, l in enumerate(troubling_bit) if not state.search(l)]

[(0, " Sandler O'Neill + Partners,"),
 (2, ' Keefe, Bruyette&Woods, Inc.,'),
 (5, ' New York City, Cantor Fitzgerald,'),
 (6, ' Ashdod, Israel,'),
 (8, ' Marsh&McLennan Companies, Inc.,'),
 (9, ' Aon Corporation,'),
 (11, ''),
 (12, ' Cantor Fitzgerald,'),
 (13, ' Fuji Bank, Ltd. security,'),
 (14, ' Cantor Fitzgerald,'),
 (15, ' New York City, Windows on the World,'),
 (18, ' New York City, Fred Alger Management, Inc.,'),
 (20, ' Cantor Fitzgerald,'),
 (22, ' Windows on the World,'),
 (23, ' Cantor Fitzgerald,'),
 (24, ' Fiduciary Trust Company International,'),
 (29, ' New Jersey, New York State Department of Taxation and Finance,'),
 (32, ' Marsh&McLennan consultant,'),
 (33, ' Summit Security Services, Inc.,'),
 (34, ' New York City, Marsh&McLennan Companies, Inc.,'),
 (39,
  ' New York City, Risk Waters Group conference attendee from Bloomberg L.P.,'),
 (41, ' New York City, Fred Alger Management, Inc.,'),
 (43, ' Cantor Fitzgerald,'),
 (45, ' Cantor Fitzgerald,'),
 (49, ' Stoneha

## Fixing a common problem.

Notice that many rows simply contain ` New York City,` without the state.  Let's fix this problem in our preprocessing step.

In [83]:
grouped_lines[41]

'David D. Alger, 57, New York City, Fred Alger Management, Inc., World Trade Center.'

In [84]:
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))
grouped_lines[41] >> fix_nyc

'David D. Alger, 57, New York City, N.Y., Fred Alger Management, Inc., World Trade Center.'

## Progress so far

In [85]:
# Imports
from composable import pipeable
from composable.strict import map

In [86]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [87]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '')
# New
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))

In [88]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [89]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                )
split_lines

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anth

In [90]:
names =  (split_lines
        >> map(get(0))
        >> map(remove_commas)
        )
names

['Gordon M. Aamoth Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest Ali

In [91]:
troubling_bit = (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                >> map(get(2))
                >> map(strip)
                )
troubling_bit

["Sandler O'Neill + Partners,",
 'Brooklyn, N.Y., Fiduciary Trust Company International,',
 'Keefe, Bruyette&Woods, Inc.,',
 'Melville, N.Y., Cantor Fitzgerald,',
 'Brooklyn, N.Y., Cantor Fitzgerald,',
 'New York City, N.Y., Cantor Fitzgerald,',
 'Ashdod, Israel,',
 'Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
 'Marsh&McLennan Companies, Inc.,',
 'Aon Corporation,',
 'Glen Rock, N.J., Cantor Fitzgerald,',
 '',
 'Cantor Fitzgerald,',
 'Fuji Bank, Ltd. security,',
 'Cantor Fitzgerald,',
 'New York City, N.Y., Windows on the World,',
 'Bronx, N.Y., New York Metropolitan Transportation Council,',
 'New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
 'New York City, N.Y., Fred Alger Management, Inc.,',
 'Bronx, N.Y., Windows on the World,',
 'Cantor Fitzgerald,',
 'Manalapan, N.J., Cantor Fitzgerald,',
 'Windows on the World,',
 'Cantor Fitzgerald,',
 'Fiduciary Trust Company International,',
 'Belle Harbor, N.Y., New York City Fire Department,',
 'Port Washington, N.Y.,

## Adding more states

Next, we will start adding start to our pattern, and again looking for additional states/problems.  For example, let's add the `Mass.` and `D.C.` patterns.

In [92]:
state = re.compile(', (N\.Y\.|N\.J\.|Mass\.|D\.C\.),?')
# rows that match
[(l, state.search(l)) for l in troubling_bit if state.search(l)]

[('Brooklyn, N.Y., Fiduciary Trust Company International,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('Melville, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('Brooklyn, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('New York City, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  <re.Match object; span=(18, 25), match=', N.Y.,'>),
 ('Glen Rock, N.J., Cantor Fitzgerald,',
  <re.Match object; span=(9, 16), match=', N.J.,'>),
 ('New York City, N.Y., Windows on the World,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('Bronx, N.Y., New York Metropolitan Transportation Council,',
  <re.Match object; span=(5, 12), match=', N.Y.,'>),
 ('New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('New York City, N.Y., Fred Alger Management, Inc.,',
  <re

In [93]:
[(i, l) for i, l in enumerate(troubling_bit) if not state.search(l)] # non-match

[(0, "Sandler O'Neill + Partners,"),
 (2, 'Keefe, Bruyette&Woods, Inc.,'),
 (6, 'Ashdod, Israel,'),
 (8, 'Marsh&McLennan Companies, Inc.,'),
 (9, 'Aon Corporation,'),
 (11, ''),
 (12, 'Cantor Fitzgerald,'),
 (13, 'Fuji Bank, Ltd. security,'),
 (14, 'Cantor Fitzgerald,'),
 (20, 'Cantor Fitzgerald,'),
 (22, 'Windows on the World,'),
 (23, 'Cantor Fitzgerald,'),
 (24, 'Fiduciary Trust Company International,'),
 (29, 'New Jersey, New York State Department of Taxation and Finance,'),
 (32, 'Marsh&McLennan consultant,'),
 (33, 'Summit Security Services, Inc.,'),
 (43, 'Cantor Fitzgerald,'),
 (45, 'Cantor Fitzgerald,'),
 (51, 'Cantor Fitzgerald, Forte Food Service,'),
 (52, 'Windows on the World,'),
 (54, 'Windows on the World,'),
 (55, 'Marsh&McLennan Companies, Inc.,'),
 (56, 'Fiduciary Trust Company International,'),
 (57, 'ABM Industries Inc.,'),
 (58, 'New York City Fire Department,'),
 (59, 'Port Authority of New York and New Jersey first responders,'),
 (61, 'Port Authority Police Depa

<h2> <font color="red"> Exercise 4.6.2 - Continue the process. </font> </h2>

Now it is your turn.  You should

1. Keep adding states to the pattern.
2. Add preprocessing steps to fix any issues.

In [94]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '') # in names
remove_quest_mark = lambda s: s.replace('??', '') # in ages
replace_date = lambda s: s.replace('', ', died 9/11/2001') if s == '' else s # for dates
strip = lambda s: s.strip() # strip whitespace
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,')) # add missing N.Y.
#fix_ny = pipeable(lambda line: line.replace(', New York,', ', New York, N.Y.,'))
fix_newjersey = pipeable(lambda line: line.replace(', New Jersey,', ', New Jersey, N.J.,'))
fix_virginia = pipeable(lambda line: line.replace(', Virginia,', ', Virginia, Va.,'))
fix_penn = pipeable(lambda line: line.replace(', Pennsylvania,', ', Pennsylvania, Pa.,'))
fix_mary = pipeable(lambda line: line.replace(', Maryland,', ', Maryland, Md.,'))

In [95]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [96]:
# Your code here
state = re.compile(', (N\.Y\.|N\.J\.|N\.H\.|N\.C\.|N\.M\.|D\.C\.|R\.I\.|Ky\.|Va\.|Md\.|Ga\.|La\.|Pa\.|Ill\.|Ariz\.|Calif\.|Conn\.|Fla\.|Mass\.|Mich\.|Tenn\.|New\sYork|New\sHampshire|New\sJersey|Hawaii|Iowa|Maine|Ohio|Utah|Texas|India|Japan|Germany|Philippines|Ontario,\sCanada|Manitoba,\sCanada|New\sSouth\sWales,\sAustralia|England,\sUnited\sKingdom),?')
# lines that match
[(l, state.search(l)) for l in troubling_bit if state.search(l)]

[('Brooklyn, N.Y., Fiduciary Trust Company International,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('Melville, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('Brooklyn, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(8, 15), match=', N.Y.,'>),
 ('New York City, N.Y., Cantor Fitzgerald,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  <re.Match object; span=(18, 25), match=', N.Y.,'>),
 ('Glen Rock, N.J., Cantor Fitzgerald,',
  <re.Match object; span=(9, 16), match=', N.J.,'>),
 ('New York City, N.Y., Windows on the World,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('Bronx, N.Y., New York Metropolitan Transportation Council,',
  <re.Match object; span=(5, 12), match=', N.Y.,'>),
 ('New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
  <re.Match object; span=(13, 20), match=', N.Y.,'>),
 ('New York City, N.Y., Fred Alger Management, Inc.,',
  <re

In [97]:
# check individual line for new fixes
grouped_lines[682]

'Alberto Dominguez, 66, New South Wales, Australia, Passenger, United 11, World Trade Center.'

In [98]:
[(i, l) for i, l in enumerate(troubling_bit) if not state.search(l)] # non-match

[(0, "Sandler O'Neill + Partners,"),
 (2, 'Keefe, Bruyette&Woods, Inc.,'),
 (6, 'Ashdod, Israel,'),
 (8, 'Marsh&McLennan Companies, Inc.,'),
 (9, 'Aon Corporation,'),
 (11, ''),
 (12, 'Cantor Fitzgerald,'),
 (13, 'Fuji Bank, Ltd. security,'),
 (14, 'Cantor Fitzgerald,'),
 (20, 'Cantor Fitzgerald,'),
 (22, 'Windows on the World,'),
 (23, 'Cantor Fitzgerald,'),
 (24, 'Fiduciary Trust Company International,'),
 (32, 'Marsh&McLennan consultant,'),
 (33, 'Summit Security Services, Inc.,'),
 (43, 'Cantor Fitzgerald,'),
 (45, 'Cantor Fitzgerald,'),
 (51, 'Cantor Fitzgerald, Forte Food Service,'),
 (52, 'Windows on the World,'),
 (54, 'Windows on the World,'),
 (55, 'Marsh&McLennan Companies, Inc.,'),
 (56, 'Fiduciary Trust Company International,'),
 (57, 'ABM Industries Inc.,'),
 (58, 'New York City Fire Department,'),
 (59, 'Port Authority of New York and New Jersey first responders,'),
 (61, 'Port Authority Police Department,'),
 (69, 'Cantor Fitzgerald,'),
 (71, 'Marsh&McLennan Companies, 

In [99]:
troubling_bit = (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc) # fix New York as well?
                >> map(fix_newjersey)
                >> map(fix_virginia)
                >> map(fix_penn)
                >> map(fix_mary)
                >> map(get_line_parts)
                >> map(get(2))
                >> map(strip)
                )
troubling_bit

["Sandler O'Neill + Partners,",
 'Brooklyn, N.Y., Fiduciary Trust Company International,',
 'Keefe, Bruyette&Woods, Inc.,',
 'Melville, N.Y., Cantor Fitzgerald,',
 'Brooklyn, N.Y., Cantor Fitzgerald,',
 'New York City, N.Y., Cantor Fitzgerald,',
 'Ashdod, Israel,',
 'Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
 'Marsh&McLennan Companies, Inc.,',
 'Aon Corporation,',
 'Glen Rock, N.J., Cantor Fitzgerald,',
 '',
 'Cantor Fitzgerald,',
 'Fuji Bank, Ltd. security,',
 'Cantor Fitzgerald,',
 'New York City, N.Y., Windows on the World,',
 'Bronx, N.Y., New York Metropolitan Transportation Council,',
 'New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
 'New York City, N.Y., Fred Alger Management, Inc.,',
 'Bronx, N.Y., Windows on the World,',
 'Cantor Fitzgerald,',
 'Manalapan, N.J., Cantor Fitzgerald,',
 'Windows on the World,',
 'Cantor Fitzgerald,',
 'Fiduciary Trust Company International,',
 'Belle Harbor, N.Y., New York City Fire Department,',
 'Port Washington, N.Y.,

In [100]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                )
split_lines

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anth

<h2> <font color="red"> Exercise 4.6.3 - Make your solution verbose </font> </h2>

Now make your solution to the last problem verbose.  Also reorder the cases so that similar cases are close and add comments.  Finally, change the regular expression to capture the parts before and after the state.

In [101]:
# Your code here
state = re.compile('''
^(.*?)
,?\s
(
       N\.Y\.
    |  N\.J\.
    |  N\.H\.
    |  N\.C\.
    |  N\.M\.
    |  D\.C\.
    |  R\.I\.
    |  Ga\.
    |  Ky\.
    |  La\.
    |  Md\.
    |  Pa\.
    |  Va\.
    |  Ariz\.
    |  Calif\.
    |  Conn\.
    |  Fla\.
    |  Ill\.
    |  Mass\.
    |  Mich\.
    |  Tenn\.
    |  New\sYork
    |  New\sHampshire
    |  New\sJersey
    |  Hawaii
    |  Iowa
    |  Maine
    |  Ohio
    |  Utah
    |  Texas
    |  India
    |  Japan
    |  Germany
    |  Philippines
    |  Ontario,\sCanada
    |  Manitoba,\sCanada
    |  New\sSouth\sWales,\sAustralia
    |  England,\sUnited\sKingdom
)
,
(.*?)$
''', re.VERBOSE)

## Splitting the troubling bit

Now that we have a way to identify rows that have home addresses (through matching the state), we will split up this data.  We will do this by considering three cases.

1. Blank entry become three blanks (for town, state, employer).
2. Lines that match the states regex will get split by this pattern.
3. The remaining lines hold only the employer and become `'','',entry`

In [102]:
def split_troubling_bit(entry):
    if len(entry) == 0:
        return ('', '', '')
    elif state.search(entry):
        return state.search(entry).groups(default='')
    else:
        return ('', '', entry)

In [103]:
( troubling_bit
 >> map(split_troubling_bit)
)

[('', '', "Sandler O'Neill + Partners,"),
 ('Brooklyn', 'N.Y.', ' Fiduciary Trust Company International,'),
 ('', '', 'Keefe, Bruyette&Woods, Inc.,'),
 ('Melville', 'N.Y.', ' Cantor Fitzgerald,'),
 ('Brooklyn', 'N.Y.', ' Cantor Fitzgerald,'),
 ('New York City', 'N.Y.', ' Cantor Fitzgerald,'),
 ('', '', 'Ashdod, Israel,'),
 ('Westchester County', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', 'Marsh&McLennan Companies, Inc.,'),
 ('', '', 'Aon Corporation,'),
 ('Glen Rock', 'N.J.', ' Cantor Fitzgerald,'),
 ('', '', ''),
 ('', '', 'Cantor Fitzgerald,'),
 ('', '', 'Fuji Bank, Ltd. security,'),
 ('', '', 'Cantor Fitzgerald,'),
 ('New York City', 'N.Y.', ' Windows on the World,'),
 ('Bronx', 'N.Y.', ' New York Metropolitan Transportation Council,'),
 ('New Hyde Park', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('New York City', 'N.Y.', ' Fred Alger Management, Inc.,'),
 ('Bronx', 'N.Y.', ' Windows on the World,'),
 ('', '', 'Cantor Fitzgerald,'),
 ('Manalapan', 'N.J.', ' Cantor Fi

## Progress so far

In [104]:
# Imports
from composable import pipeable
from composable.strict import map

In [105]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [106]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '')
# New
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))

In [107]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [108]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                )
split_lines

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anth

In [109]:
names =  (split_lines
        >> map(get(0))
        >> map(remove_commas)
        )
names

['Gordon M. Aamoth Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest Ali

In [110]:
troubling_bit = (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                >> map(get(2))
                >> map(strip)
                )
troubling_bit

["Sandler O'Neill + Partners,",
 'Brooklyn, N.Y., Fiduciary Trust Company International,',
 'Keefe, Bruyette&Woods, Inc.,',
 'Melville, N.Y., Cantor Fitzgerald,',
 'Brooklyn, N.Y., Cantor Fitzgerald,',
 'New York City, N.Y., Cantor Fitzgerald,',
 'Ashdod, Israel,',
 'Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
 'Marsh&McLennan Companies, Inc.,',
 'Aon Corporation,',
 'Glen Rock, N.J., Cantor Fitzgerald,',
 '',
 'Cantor Fitzgerald,',
 'Fuji Bank, Ltd. security,',
 'Cantor Fitzgerald,',
 'New York City, N.Y., Windows on the World,',
 'Bronx, N.Y., New York Metropolitan Transportation Council,',
 'New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
 'New York City, N.Y., Fred Alger Management, Inc.,',
 'Bronx, N.Y., Windows on the World,',
 'Cantor Fitzgerald,',
 'Manalapan, N.J., Cantor Fitzgerald,',
 'Windows on the World,',
 'Cantor Fitzgerald,',
 'Fiduciary Trust Company International,',
 'Belle Harbor, N.Y., New York City Fire Department,',
 'Port Washington, N.Y.,

In [111]:
state = re.compile('''
^(.*?)
,?\s                    # Optional comman
(
       N\.Y\.           
    |  N\.J\.
    |  D\.C\.
    |  N\.H\.
    |  N\.M\.
    |  N\.C\.
    |  R.I.
    |  Md\.
    |  Pa\.
    |  Va\.
    |  Ga\.
    |  La\.
    |  Mass\.
    |  Calif\.
    |  Ariz\.
    |  Fla\.
    |  Ill\.
    |  Conn\.
    |  Hawaii
    |  Iowa
    |  Maine
    |  New\sHampshire
    |  New\sJersey
    |  New\sYork
    |  Ohio
    |  Pennsylvania
    |  Texas
    |  Utah
    |  Virginia
    |  Japan
    |  India
    |  Israel
    |  Germany
    |  Manitoba,\sCanada
    |  New\sSouth\sWales,\sAustralia
    |  England,\sUnited\sKingdom
)
,
(.*?)$
''', re.VERBOSE)

In [112]:
( troubling_bit
 >> map(split_troubling_bit)
)

[('', '', "Sandler O'Neill + Partners,"),
 ('Brooklyn', 'N.Y.', ' Fiduciary Trust Company International,'),
 ('', '', 'Keefe, Bruyette&Woods, Inc.,'),
 ('Melville', 'N.Y.', ' Cantor Fitzgerald,'),
 ('Brooklyn', 'N.Y.', ' Cantor Fitzgerald,'),
 ('New York City', 'N.Y.', ' Cantor Fitzgerald,'),
 ('Ashdod', 'Israel', ''),
 ('Westchester County', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', 'Marsh&McLennan Companies, Inc.,'),
 ('', '', 'Aon Corporation,'),
 ('Glen Rock', 'N.J.', ' Cantor Fitzgerald,'),
 ('', '', ''),
 ('', '', 'Cantor Fitzgerald,'),
 ('', '', 'Fuji Bank, Ltd. security,'),
 ('', '', 'Cantor Fitzgerald,'),
 ('New York City', 'N.Y.', ' Windows on the World,'),
 ('Bronx', 'N.Y.', ' New York Metropolitan Transportation Council,'),
 ('New Hyde Park', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('New York City', 'N.Y.', ' Fred Alger Management, Inc.,'),
 ('Bronx', 'N.Y.', ' Windows on the World,'),
 ('', '', 'Cantor Fitzgerald,'),
 ('Manalapan', 'N.J.', ' Cantor Fitzg

<h2> <font color="red"> Exercise 4.5.4 </font> </h2>

Clean up each part of the troubling bits, then comma join this section into 1 string.

**Hint:** Be sure to remove any problematic commas.

In [113]:
# Your code here
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '') # in names
remove_quest_mark = lambda s: s.replace('??', '') # in ages
replace_date = lambda s: s.replace('', ', died 9/11/2001') if s == '' else s # for dates
strip = lambda s: s.strip() # strip whitespace
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,')) # add missing N.Y.
fix_newjersey = pipeable(lambda line: line.replace(', New Jersey,', ', New Jersey, N.J.,'))
fix_virginia = pipeable(lambda line: line.replace(', Virginia,', ', Virginia, Va.,'))
fix_penn = pipeable(lambda line: line.replace(', Pennsylvania,', ', Pennsylvania, Pa.,'))
fix_mary = pipeable(lambda line: line.replace(', Maryland,', ', Maryland, Md.,'))

In [114]:
fixed_troubling_bits = (grouped_lines
                        >> map(add_missing_period)
                        >> map(fix_world_trade)
                        >> map(fix_nyc)
                        >> map(fix_newjersey)
                        >> map(fix_virginia)
                        >> map(fix_penn)
                        >> map(fix_mary)
                        >> map(get_line_parts)
                        >> map(get(2))
                        >> map(remove_commas)
                        >> map(strip)
                )
fixed_troubling_bits


["Sandler O'Neill + Partners",
 'Brooklyn N.Y. Fiduciary Trust Company International',
 'Keefe Bruyette&Woods Inc.',
 'Melville N.Y. Cantor Fitzgerald',
 'Brooklyn N.Y. Cantor Fitzgerald',
 'New York City N.Y. Cantor Fitzgerald',
 'Ashdod Israel',
 'Westchester County N.Y. Marsh&McLennan Companies Inc.',
 'Marsh&McLennan Companies Inc.',
 'Aon Corporation',
 'Glen Rock N.J. Cantor Fitzgerald',
 '',
 'Cantor Fitzgerald',
 'Fuji Bank Ltd. security',
 'Cantor Fitzgerald',
 'New York City N.Y. Windows on the World',
 'Bronx N.Y. New York Metropolitan Transportation Council',
 'New Hyde Park N.Y. Marsh&McLennan Companies Inc.',
 'New York City N.Y. Fred Alger Management Inc.',
 'Bronx N.Y. Windows on the World',
 'Cantor Fitzgerald',
 'Manalapan N.J. Cantor Fitzgerald',
 'Windows on the World',
 'Cantor Fitzgerald',
 'Fiduciary Trust Company International',
 'Belle Harbor N.Y. New York City Fire Department',
 'Port Washington N.Y. Cantor Fitzgerald',
 'Hoboken N.J. Keefe Bruyette&Woods Inc.

In [115]:
comma_join = pipeable(lambda s: ','.join(s))
fixed_troubling_bits

["Sandler O'Neill + Partners",
 'Brooklyn N.Y. Fiduciary Trust Company International',
 'Keefe Bruyette&Woods Inc.',
 'Melville N.Y. Cantor Fitzgerald',
 'Brooklyn N.Y. Cantor Fitzgerald',
 'New York City N.Y. Cantor Fitzgerald',
 'Ashdod Israel',
 'Westchester County N.Y. Marsh&McLennan Companies Inc.',
 'Marsh&McLennan Companies Inc.',
 'Aon Corporation',
 'Glen Rock N.J. Cantor Fitzgerald',
 '',
 'Cantor Fitzgerald',
 'Fuji Bank Ltd. security',
 'Cantor Fitzgerald',
 'New York City N.Y. Windows on the World',
 'Bronx N.Y. New York Metropolitan Transportation Council',
 'New Hyde Park N.Y. Marsh&McLennan Companies Inc.',
 'New York City N.Y. Fred Alger Management Inc.',
 'Bronx N.Y. Windows on the World',
 'Cantor Fitzgerald',
 'Manalapan N.J. Cantor Fitzgerald',
 'Windows on the World',
 'Cantor Fitzgerald',
 'Fiduciary Trust Company International',
 'Belle Harbor N.Y. New York City Fire Department',
 'Port Washington N.Y. Cantor Fitzgerald',
 'Hoboken N.J. Keefe Bruyette&Woods Inc.

## Combining the parts back together.

We can combine the parts back together using the `zip` function.

In [121]:
from composable.strict import zipOnto
from composable.sequence import to_list
(zip(names, ages, fixed_troubling_bits)
 >> to_list
 >> map(comma_join)
)

["Gordon M. Aamoth Jr.,32,Sandler O'Neill + Partners",
 'Edelmiro Abad,54,Brooklyn N.Y. Fiduciary Trust Company International',
 'Marie Rose Abad,49,Keefe Bruyette&Woods Inc.',
 'Andrew Anthony Abate,37,Melville N.Y. Cantor Fitzgerald',
 'Vincent Paul Abate,40,Brooklyn N.Y. Cantor Fitzgerald',
 'Laurence Christopher Abel,37,New York City N.Y. Cantor Fitzgerald',
 'Alona Abraham,30,Ashdod Israel',
 'William F. Abrahamson,55,Westchester County N.Y. Marsh&McLennan Companies Inc.',
 'Richard Anthony Aceto,42,Marsh&McLennan Companies Inc.',
 'Heinrich Bernhard Ackermann,38,Aon Corporation',
 'Paul Acquaviva,29,Glen Rock N.J. Cantor Fitzgerald',
 'Christian Adams,37,',
 'Donald LaRoy Adams,28,Cantor Fitzgerald',
 'Patrick Adams,61,Fuji Bank Ltd. security',
 'Shannon Lewis Adams,25,Cantor Fitzgerald',
 'Stephen George Adams,51,New York City N.Y. Windows on the World',
 'Ignatius Udo Adanga,62,Bronx N.Y. New York Metropolitan Transportation Council',
 'Christy A. Addamo,28,New Hyde Park N.Y. M

<h2> <font color="red"> Exercise 4.5.4 </font> </h2>

Use `zip` to combine all part of the data and write the result out to a file called `911_Deaths_Fixed.csv` 

In [122]:
# Your code here
combined = (zip(names, ages, fixed_troubling_bits, dates, locations, flights, passenger_status) 
            >> map(comma_join)
           )
combined

["Gordon M. Aamoth Jr.,32,Sandler O'Neill + Partners,died 9/11/2001,World Trade Center,,",
 'Edelmiro Abad,54,Brooklyn N.Y. Fiduciary Trust Company International,died 9/11/2001,World Trade Center,,',
 'Marie Rose Abad,49,Keefe Bruyette&Woods Inc.,died 9/11/2001,World Trade Center,,',
 'Andrew Anthony Abate,37,Melville N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,',
 'Vincent Paul Abate,40,Brooklyn N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,',
 'Laurence Christopher Abel,37,New York City N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,',
 'Alona Abraham,30,Ashdod Israel,died 9/11/2001,World Trade Center,United 175,Passenger',
 'William F. Abrahamson,55,Westchester County N.Y. Marsh&McLennan Companies Inc.,died 9/11/2001,World Trade Center,,',
 'Richard Anthony Aceto,42,Marsh&McLennan Companies Inc.,died 9/11/2001,World Trade Center,,',
 'Heinrich Bernhard Ackermann,38,Aon Corporation,died 9/11/2001,World Trade Center,,',
 'Paul Acquaviva,29,Glen Roc

In [118]:
output = '\n'.join(combined)
output[:500]

"Gordon M. Aamoth Jr.,32,Sandler O'Neill + Partners,died 9/11/2001,World Trade Center,,\nEdelmiro Abad,54,Brooklyn N.Y. Fiduciary Trust Company International,died 9/11/2001,World Trade Center,,\nMarie Rose Abad,49,Keefe Bruyette&Woods Inc.,died 9/11/2001,World Trade Center,,\nAndrew Anthony Abate,37,Melville N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,\nVincent Paul Abate,40,Brooklyn N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,\nLaurence Christopher Abel,37,New York City N."

In [119]:
with open('911_Deaths_Fixed.csv', 'w') as out_file:
    out_file.write(output)
!cat 911_Deaths_Fixed.csv | head -n 10

Gordon M. Aamoth Jr.,32,Sandler O'Neill + Partners,died 9/11/2001,World Trade Center,,
Edelmiro Abad,54,Brooklyn N.Y. Fiduciary Trust Company International,died 9/11/2001,World Trade Center,,
Marie Rose Abad,49,Keefe Bruyette&Woods Inc.,died 9/11/2001,World Trade Center,,
Andrew Anthony Abate,37,Melville N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,
Vincent Paul Abate,40,Brooklyn N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,
Laurence Christopher Abel,37,New York City N.Y. Cantor Fitzgerald,died 9/11/2001,World Trade Center,,
Alona Abraham,30,Ashdod Israel,died 9/11/2001,World Trade Center,United 175,Passenger
William F. Abrahamson,55,Westchester County N.Y. Marsh&McLennan Companies Inc.,died 9/11/2001,World Trade Center,,
Richard Anthony Aceto,42,Marsh&McLennan Companies Inc.,died 9/11/2001,World Trade Center,,
Heinrich Bernhard Ackermann,38,Aon Corporation,died 9/11/2001,World Trade Center,,
cat: stdout: Broken pipe
