# UNIT 3: Reuse, Modularity, and External Resources

## Exercise

In this exercise, you'll use what you've learned so far to write a programme to clean and re-structure a complex dataset.

### INPUT: `crew_manifest.csv`

Use the [Dataset](https://canvas.harvard.edu/courses/113131/files/17333429?wrap=1) link on the Canvas site to download the csv file. Here's a sample of the data to show how it is structured:

Vessel | Rig | Departure | Name | Age | Height | Residence | Rank | Voyage_number | Vessel_number
:------|:----|:----------|:-----|:---:|:------:|:----------|:-----|:-------------:|:-------------:
Mary and Susan | Bark | 9/9/1867 | Frates, John A. | 18 | 5'0 3/4 | Azores | Seaman, Boatsteerer | 9240 | 481
Andrew Hicks | Bark | 1867.9.9 | Hamblin, Otis F. |  |  |  | Master | 941 | 703
Mary and Susan | Bark | 9-sep-1867 |  Herendeen, A.  O. |  |  |  | Master | 9240 | 481
Andrew Hicks | Bark | September 9 1867 | Jenkins,  Thomas H. | 22 | 5'7 | Dartmouth | Master | 941 | 703
Mary and Susan | Bark | 09/09/1867 | Gonsalves, Frank |  |  |  | Seaman | 9240 | 481
Sarah | Bark | 9/9/1867 | Avola, Antone | 23 | 5'9 | Azores |  | 9240 | 637
Mary and Susan | Bark | 9/9/1867 | Baptista, Manuel J. | 30 | 5'4 | Azores |  | 9240 | 481
Mary and Susan | Bark | 9/9/1867 | Berry, William, Jr. | 22 | 5'4 | Boston |  | 9240 | 481
Mary and Susan | Bark | 9/9/1867 | Bettencurt, Antonio | 18 | 5'2 1/4 | Azores |  | 9240 | 481

This is a *very real*—and *very messy*—dataset. A few things to keep in mind:
- You should expect to have **leading, trailing, and double spaces** in all columns. 
- Formatting is **consistent for the following columns**: `Vessel`, `Rig`, `Age`, `Voyage_number`, and `Vessel_number`.
- **Not so much for the rest**. In the next cell, I've included *representative variations* of the different formats used in each of the other cells. I used Python lists so that you can **use them directly for testing your algorithms** as you design them (you're welcome!):

In [None]:
departure_variations = [
    '09/06/1844', # in context, this should be 6 September 1844
    '9/30/1885',
    '1867.9.9',
    '30-Sep-1873', 
    '9-sep-1867', 
    '9/3/1883',
    '9/APR/1846',
    '9/August/1865', 
    'April 9 1855'
]

name_variations = [
    "A. L. E. Benton",
    "A.j. Harvey",
    "Aaron F. Hussey",
    "Aaron Dean",
    "Adams, Charles",
    "Adams, Charles C.",
    "Alden Jr Rounseville",
    "Almada, Peter Antonie",
    "Amos F.",
    "Bennett,william",
    "Berry, William, Jr.",
    "Borden, Joshua G Jr",
    "O'brien, John"
]

height_variations = [
    "4' 6\"", # note escaped double-quotes used as inches marker
    "4'6",
    "5'",
    "5' 10",
    "5' 4 1/2\"", # note escaped double-quotes used as inches marker
    "5' 7 3/4\"", # note escaped double-quotes used as inches marker
    "5'0",
    "5'0 1/2",
    "5'10+",
    "5'11.5"
]

residence_variations = [
    "Albany, Ny",
    "Azores",
    "Brava, Cape Verde",
    "Brookirlee, Canada",
    "Corvo, Azores",
    "Havre De Grace",
    "Lyons Ma",
    "Marshall, Mich",
    "Martha's Vineyard",
    "Saint Croix, West Indies",
    "Saint Paul (Africa)"
]

### OUTPUT:  `crew_manifest.json`

Besides cleaning the dataset, you'll also have to re-structure it. The **original csv file tracks three entities**: `vessels`, `voyages`, and `crew members`. In principle, **only the crew member data is unique for each row**, vessel and voyage data repeats. So, we'll restructure the dataset as follows (*Entity-Relationship diagramme* on the left, *JSON sample* on the right):

<div style="margin-top: 20px"><div style="width: 49%; float: left;">
<img src="https://mermaid.ink/img/pako:eNqFk8Fu4jAQhl9l5EsvpA_AjYXsbiUCUlhRVcrFaw9h1Niu7IlQBH332iHQ0K5YX6JM_pl_5ovnKJTTKKYC_YJk7aWpLMSzzTebfAmnU5adjrBdv8x-5TCFKsotS7IBJDQUGNyuEjcpx_NbOmQZVq35i_4zFtjDShq8jZRUfwb6ulvXyRpDdBzMPSrndbi4vQ-m54__MdWSERb4Jj23HiEWlQFe4smKIlssLjWv5nOPhyial_lzVuTFj7y8bz9gGuvvsxorj9_hnBv8ST4mFY9Pj7CUgcddpiFnNd4GfiPVe465ZEGhZTLIHsM4T5NiKDGQRqsGm-V6PvvztF59o1C6pv8Byhkjs5DwRY4aiNEE2Hln4KGU9vXhC5TxcD2Z09Wjx0JWNa3GhCX1Q85K38GBeH8pdJV_YbN0SjbEXWoqPibA7mDHfSfR3LWWu9vYhtMNqMSubRqw6fr9K8l39yURqY1ch2HFRBj0RpKO69M3WgneY0pMU2rpX1OJpJMtu01nlZiyb3Ei2rd0I4eFE9OdbEKMoiZ2vjjvY7-W7x8UuxWi" width=300>

</div><div style="width: 49%; float: right; border-left-color: rgb(208, 208, 208); border-left-style: solid">

```json
{
    "481": {
        "name": "Mary and Susan",
        "rig": "Bark",
        "voyages": {
            "9240": {
                "departure": "1867-09-09",
                "crew": [
                    {
                        "name": "John A. Frates",
                        "age": 18
                        "height": 155,
                        "residence": {
                            "locality": "Azores"
                        },
                        "roles": [
                            "seaman",
                            "boatsteerer"
                        ]
                    },
                    {
                        "name": "A. O. Herendeen",
                        "age": null,
                        "height": null,
                        "residence": {},
                        "roles": [
                            "master",
                        ]
                    },
                    { other crew members}
                ]
            },
            { other voyages }
        }
    }
    { other vessels }
}
```
</div></div>

## Gabe's Solution

### 1. Import External Modules

We'll need to use a number of external modules, so let's start by importing them:

In [1]:
import os # for creating universal paths
import json # for writing the JSON file at the end 
from csv import DictReader # we'll access the CSV data as a dictionary, so we only import DictReader
from datetime import date # for working with date objects
import calendar # to save ourselves from having to type month names


### 2. Define General Constants

We'll also need a few dictionaries and lists to help us process state and month names, so we define them here as global variables:

In [2]:
STATES = { # nicked from https://gist.github.com/JeffPaine/3083347
    'AK': 'Alaska',
    'AL': 'Alabama',
    'AR': 'Arkansas',
    'AZ': 'Arizona',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DC': 'District of Columbia',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'IA': 'Iowa',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'MA': 'Massachusetts',
    'MD': 'Maryland',
    'ME': 'Maine',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MO': 'Missouri',
    'MS': 'Mississippi',
    'MT': 'Montana',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'NE': 'Nebraska',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NV': 'Nevada',
    'NY': 'New York',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VA': 'Virginia',
    'VT': 'Vermont',
    'WA': 'Washington',
    'WI': 'Wisconsin',
    'WV': 'West Virginia',
    'WY': 'Wyoming'
}

STATE_NAMES = list(STATES.values()) # a list of state names
STATE_ABBR = list(STATES.keys()) # a list of state abbreviations

# we'll also need continents
# Australia is a country and nobody comes from Antarctica
# so we'll skip those two
CONTINENTS = ['africa', 'asia', 'europe', 'north america', 'south america'] 

# Lastly, months from the 'calendar' module
# note the empty string at the beginning, this is so we can use indices as month number 
# i.e. january == 1, otherwise it would be 0
MONTHS = list(calendar.month_name)
print(MONTHS)

['', 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']


### 3. Load the dataset

We'll load the CSV dataset as a dictionary (working with named keys makes keepring track of the columns easier than if we were working with list indices). Let's also print the first three entries to get an idea of what we're dealing with:

In [3]:
csv_file = os.path.join(os.curdir, 'dataset', 'crew_manifest.csv') # define the file

with open(csv_file, 'r') as file:
    csv_data = list(DictReader(file))

for row in csv_data[:3]:
    print(f'{row}\n')

{'Vessel': 'Mary and Susan', 'Rig': 'Bark', 'Departure': '9/9/1867', 'Name': 'Frates, John A.', 'Age': '18', 'Height': "5'0 3/4", 'Residence': 'Azores', 'Rank': 'Seaman, Boatsteerer', 'Voyage_number': '9240', 'Vessel_number': '481'}

{'Vessel': 'Andrew Hicks', 'Rig': 'Bark', 'Departure': '1867.9.9', 'Name': 'Hamblin, Otis F.', 'Age': '', 'Height': '', 'Residence': '', 'Rank': 'Master', 'Voyage_number': '941', 'Vessel_number': '703'}

{'Vessel': 'Mary and Susan', 'Rig': 'Bark', 'Departure': '9-sep-1867', 'Name': ' Herendeen, A.  O.', 'Age': '', 'Height': '', 'Residence': '', 'Rank': 'Master', 'Voyage_number': '9240', 'Vessel_number': '481'}



### 4. Declare Processing Functions

This is where most of the logic is going to be. We're going to loop over the rows in the dataset and for each column do whatever cleaning and manipulation operations are needed using reusable functions:

In [27]:
def clean_integer(integer_string):
    """Takes a string and, if it contains a number, it returns an integer
    if the string doesn't contain a number, then it returns 'None'"""
    
    if integer_string.isdigit():
        return int(integer_string)
    else:
        return None

    
def clean_strings(target, *args, fp=False):
    """Takes any number of strings and a boolean keyword argument, fp (fingerprint). 
    Returns a cleaned string or, if more than one was passed, a tuple containing all 
    cleaned strings. If fp is True, then it also converts the string(s) to all lowercase."""
    
    # the first string passed goes to the 'target' parameter
    # if any more are supplied, they're added to args
    if args: # so if there are args
        target_strings = [target, *args] # we create a list of target and args
    else: # if not
        target_strings = [target] # the list contains just target
        
    # the reason we create a list, even if there is just one string
    # is so that we can iterate with a 'for' loop
    # this way it doesn't matter how many strings are passed
    
    output = [] # to store results of processing
    for target_string in target_strings: 
        target_string = target_string.replace('  ', ' ') # replace multiple spaces with single space
        target_string = target_string.strip() # remove any leading and trailing spaces

        if len(target_string) == 0: # we check that there is some actual content, if not
            output.append(None) # we just add 'None' to the output list
            continue # and bounce

        if fp: # if the function was asked to normalize the string(s)
            target_string = target_string.lower() # we convert them to lowercase
        
        output.append(target_string) # lastly, we add the clean string to the output list
    
    if len(output) == 1: # now we check whether we have just one string or many, if just one
        return output[0] # we return it naked
    else: # if many
        return (*output, ) # we returned the expanded list as an n-tuple 


def process_date(date_string):
    """Takes a string representing a date, processes the different components to 
    create a valid date object (to confirm it's a date) and then returns a string 
    in the desired format. If the string passed is empty, it returns 'None'"""
    backup = date_string
    if date_string is None: # if the cleaned string is 'None'
        return None # the date is 'None' as well
    
    else:
        # we declare a nested function to get the month from whatever we get
        def get_month(month_string):
            """Takes a string and returns an integer representing the equivalent month"""
            
            if month_string.isdigit(): # we first check if the string is a number, if it is
                return clean_integer(month_string) # we return it as an integer
            
            else: # otherwise we assume that it is a month name
                # we capitalize it (because that's how our global list is setup)
                month_string = month_string.capitalize() 

                for month in MONTHS: # then we go over our list 
                    # and test whether the string we have is either a month name
                    # or the beginning of one (e.g. Jan for January)
                    if month == month_string or month.startswith(month_string): # if it is
                        return MONTHS.index(month) # we return the index of that value, is the month number

            return None # if by this point we haven't returned anything, then we failed, so return 'None'
                
        separators = ['/', '-', '.', ' '] # these are the separators used in the dataset
        date_items = [] # list to store date components
        
        for separator in separators: # we go over the separators
            if separator in date_string: # until we find one that's present
                date_items = date_string.split(separator) # so we split the string on it
                break # job done, so we bounce

        if len(date_items) == 3: # if we got 3 items from the split, then we proceed
            item1, item2, item3 = date_items # we unpack the list
            item1, item2, item3 = clean_strings(item1, item2, item3, fp=True) # and clean the strings
            
            # now we reason our way through the different possible formats
            # if the first item is a number of length 4, then we know it's the year
            # which means the format is YYYY DD MM, so:
            if item1.isdigit() and len(item1) == 4:
                year = clean_integer(item1) # year is item1
                day = clean_integer(item2) # day is item2
                month = get_month(item3) # month is item3

            # if item1 is not a number, then it has to be the month
            # so the format is MM DD YYYY (year is only first or last), so:
            elif not item1.isdigit():
                month = get_month(item1) # month is item1
                day = clean_integer(item2) # day is item2
                year = clean_integer(item3) # year is item3
            
            # next we see if item2 is a month name
            elif not item2.isdigit():
                month = get_month(item2) # month is item2
                day = clean_integer(item1) # day is item1
                year = clean_integer(item3) # year is item3

            # if none of the above, then the format is either
            # DD MM YYYY or MM DD YYYY
            else:
                year = clean_integer(item3) # year is item3 regardless
                item1 = clean_integer(item1)
                item2 = clean_integer(item2)
                
                # if either item1 or item2 is greater than 12, then that's the day
                if item1 > 12 or item2 > 12:
                    if item1 > 12:
                        day = item1
                        month = item2
                    else:
                        day = item2
                        month = item1
                else: # if neither is, we assume MM DD YYYY (American) format, given the source
                    month = item1
                    day = item2

            # we have our components, so we can create a date object
            # then we use the strftime() method to format it as YYYY-MM-DD
            # and we return the resulting string:
            return date(year, month, day).strftime('%Y-%m-%d')
                
        else: # if did not get 3 items from the split, it's not a valid date
            return None # so we return 'None'


def process_name(name_string):
    """Takes a string containing a personal name in any of the formats present
    in the dataset, and returns it formatted as 'Given M. I. Surname Jr.'. If 
    the string passed is empty, it returns 'None'."""
    
    if name_string is None: # if the cleaned string is 'None'
        return None # the name is 'None'
    
    else:
        junior = False # boolean to track presence of "Jr." suffix
        name_components = [] # list to hold name components
        
        if 'jr' in name_string: # if the string 'jr' is present
            junior = True # we set junior to True 
            
            # then we remove it from the string (it can appear in weird order)
            # we do so by replacing variants (with or without period, with or without space before, etc)
            # we try all the variants in decreasing order of complexity
            # that way we'll always get the most inclusive version
            for variant in [', jr.', ',jr.', ' jr.', 'jr.', ', jr', ',jr', ' jr', 'jr']:
                name_string = name_string.replace(variant, '')
        
        # if there is a comma in the string, then the format must be
        # 'Surname, Given' (we removed Jr and assoc. commas already)
        if ',' in name_string:  # if there is a comma
            components = name_string.split(',') # we split on it
            if len(components) == 2: # we double-check that we got just two parts
                c1, c2 = components # and unpack the list we got from splitting
                c1, c2 = clean_strings(c1, c2) # then we clean the tokens
                name_string = f'{c2} {c1}' # and compose them back into the 'Given Surname' format
            
            else: # if we don't get two items
                return 'FCK' # something's gone pear-shaped, help future you work out what
        
        # at this point the format of 'name_string' should be a combination of 
        # initials and given name followed by surname(s), whether it came like that 
        # or we built it ourselves above, so we split on space:
        components = name_string.split()
        expanded_components = [] # list to store name components while processing
        
        # let's get any initials next:
        for component in components: # we iterate over the components we currently have
            if '.' in component: # if it contains a period, it must be an initial
                expanded_components += component.split('.') # so we split on it add results to the list
            else: # if there is no period, it must be a name
                expanded_components.append(component) # so we append it to the list
        
        # now we format the components
        for component in expanded_components: # we iterate over the new list
            component = clean_strings(component) # and clean each string
            
            if component: # if it's still there after cleaning (i.e. it's not None)
                if len(component) == 1: # if it is a single letter, i.e. an initial
                    name_components.append(f'{component.upper()}.') # we capitalize it, add a period, and append
                else: # if not, it's a name
                    name_components.append(component.capitalize()) # we capitalize the first letter and append
        
        # 'name_components' is now a list with all relevant parts
        # so we join those using spaces to get the full name
        full_name = ' '.join(name_components)
        
        if junior: # lastly, we check whether it had a junior suffix, if so: 
            full_name = f'{full_name}, Jr.' # we add it back, at the end, properly formatted
    
        return full_name # and we return the result


def process_height(height_string):
    """Takes a string containing a height in one of the formats present in the 
    dataset and returns an integer representing the same height in centimetres. 
    If the string passed is empty, it returns 'None'."""
    
    if height_string is None: # if the cleaned string is 'None'
        return None # the height is 'None'
    
    else:
        feet = inches = fraction = 0 # create variables to store components and set all to 0
        
        if "+" in height_string: # if there is a plus sign
            height_string = height_string.replace('+', '') # we nuke the bastard

        if "'" in height_string: # if the feet indicator is present
            feet, rest = height_string.split("'") # we split on it to get feet + the rest
        
        else: # if not:
            feet = height_string[:1] # we slice the first character as feet (can't be more than one digit!)
            rest = height_string[1:] # and the rest 
        
        # either way we should now have the feet component
        feet = clean_integer(feet) # so we convert it to an integer
        # we remove the inches marker from whatever is left
        rest = rest.replace('"', '') # it's all inches regardless (even with a fraction)
        rest = clean_strings(rest) # and we clean the string, just to be sure!
        
        if rest: # if we have something other than feet
            if '/' in rest: # if we have a fraction
                # we check whether it's just a fraction or also a number
                if len(rest.split()) > 1: # if we split on space and get more than one item, its a number + fraction
                    inches, fraction = rest.split() # so we get both
                    inches = clean_integer(inches) # turn the number into an integer and assign it to 'inches'

                else: # if we only get one item when we split, then it's just a fraction
                    fraction = rest # so we assign the full string to the 'fraction' variable

            elif '.' in rest: # check if we've got a decimal, if so:
                inches = float(rest) # convert it to a float and assign it to 'inches'

            else: # if we don't have a fraction or a decimal, then we have a whole number
                inches = clean_integer(rest) # we convert it to an integer and assign it to 'inches'
            
        if fraction: # now we check wether we've got a fraction, if so:
            numerator, denominator = fraction.split('/') # split to get numerator and denominator
            fraction = int(numerator) / int(denominator) # and get the decimal
            
        # to calculate the height in centimetres we convert the feet to inches
        # then add it to the 'inches' value and to the fraction (as decimal)
        # then multiply by 2.54
        return round(((feet * 12) + inches + fraction) * 2.54) # we round the value and return it


def process_residence(residence_string):
    """Takes a string representing a geographic location and returns a dictionary 
    including key-value pairs for whatever keys are pertinent from: locality, state, 
    country, and continent. If the string passed is empty, it returns 'None'."""
    
    if residence_string is None: # if the cleaned string is 'None'
        return None # the residence is 'None'
    
    else:
        locality = state = country = continent = None # create variables to hold components and set to None
        
        if ',' in residence_string: # if we have a comma in the string, there are at least 2 parts to the loc
            item1, item2 = residence_string.split(',') # we split on the comma to get them
        
        elif '(' in residence_string: # same with brackets: two parts (one in brackets)
            item1, item2 = residence_string.split('(') # split on opening one
            item2 = item2.replace(')', '') # and remove closing one from item2
        
        # if we split on space and the last item is just two letters
        # then it is a state abbreviation
        elif len(residence_string.split()[-1]) == 2: 
            item1 = residence_string[:-2] # so we slice the rest for item1
            item2 = residence_string[-2:] # and the last two letters for item2
            
        else: # if none of the above, then we have just one component
            item1 = residence_string # we assign it to item1
            item2 = '' # and make item2 an empty string
        
        item1, item2 = clean_strings(item1, item2) # next we clean both items
        locality = item1 # and we assign item1 to locality (yes, this is too simplistic, but I'm in a rush!)
        
        # now we need to deal with item2
        if item2 is not None: # if there is an item2 (i.e. clean string is not None)
            if len(item2) == 2: # if it's just two letters, then it's a state abbreviation
                for abbr in STATE_ABBR: # we take our list of abbreviations and iterate
                    if item2.upper() == abbr: # if item2 matches
                        state = abbr # we have our state, so we assign it to the variable
                        break # and bounce
                
            else: # if item2 is not two letters, it *could* still be a state, full name or non-standard abbr.
                for name in STATE_NAMES: # so we iterate over our list of state names
                    if name.lower().startswith(item2.lower()): # if item2 matches or it matches the beginning
                        for abbr, state_name in STATES.items(): # we iterate over the full states dictionary
                            if state_name == name: # until the state name matches
                                state = abbr # then we assign the abbreviation to the corresponding variable
                                break # and bounce
                        break # twice, as we're in two loops deep
            
            if not state: # if after all of the above we didn't get a state
                if item2.lower() in CONTINENTS: # we check if we have a continent
                    continent = item2 # if so, we assign it to the corresponding variable
                else: # if not, then it's a country
                    country = item2 # so we assign it to the country variable
        
        # now we create our location object - we fill it with just the locality first since we'll 
        # always have one (unless the string is empty, but than we never made it this far)
        location = {'locality': locality}
        
        # then we check the other variables to see if we got them
        if state: # got state?
            location['state'] = state # add it to the dict.
            
        if country: # got country?
            location['country'] = country # add it to the dict.
            
        if continent: # got continent?
            location['continent'] = continent # idem.
        
        return location # and we return the location dict.


def process_rank(rank_string):
    """Takes a string representing one or multiple ranks (comma separated) and returns 
    a list containing all entries. If the string passed is empty, it returns 'None'."""
    
    if rank_string is None: # if the cleaned string is 'None'
        return [] # the rank list is empty
    
    elif ',' in rank_string: # if there is a comma in the string
        return rank_string.split(', ') # we split on it to get a list of ranks and return it
    
    else: # if there is no comma, then we just add the full string to a list and return it
        return [rank_string]


def get_crew_member(name, age, height, residence, rank):
    """It takes string for the name, age, height, residence, and rank; and returns 
    a 'crew_member object' in the form of a dictionary containing the results of
    processing those items. If the name string is empty, it returns 'None'"""
    
    name = process_name(clean_strings(name, fp=True)) # we start by cleaning and normalizing the name
    
    if name is None: # if it turns out to be empty (i.e. cleaning returns None)
        return None # then there is no name in this record, we return None
    
    else: # otherwise we return the 'crew_member' dictionary
        return {
            'name': name, # we already have the formatted name
            'age': clean_integer(age), # the age is just a number, so we convert it to integer
            'height': process_height(clean_strings(height)), # we process the height (we clean it first)
            'residence': process_residence(clean_strings(residence)), # same for residence
            'roles': process_rank(clean_strings(rank, fp=True)) # and rank
        }


### 5. Process the Dataset

Now we iterate over the dataset and process the values for each column as necessary.

In [28]:
vessels = {} # our top level entity is the vessel, so we create a container for them (as a dictionary)

for row in csv_data: # now we iterate over the rows
    vessel_number = clean_integer(row['Vessel_number']) # we extract the vessel_number as an integer
    voyage_number = clean_integer(row['Voyage_number']) # same for the voyage number
    crew_member = get_crew_member( # and we get the crew member data, which is the entity that's unique to a row
                row['Name'],
                row['Age'],
                row['Height'],
                row['Residence'],
                row['Rank']
            )
    
    if crew_member is None: # we check if there is a crew member, if not:
        continue # we skip this row
    
    if vessel_number in vessels: # if there an entry for this vessel already exists
        voyages = vessels[vessel_number]['voyages'] # we extract the list of voyages from it
        
        if voyage_number in voyages: # if this row's voyage is already there
            crew_list = voyages[voyage_number]['crew'] # we get the list of crew members
            crew_list.append(crew_member) # and append our chap
        
        else: # if the voyage is not already listed under the vessel
            voyages[voyage_number] = { # we add it
                'departure': process_date(clean_strings(row['Departure'])), # include the departure date
                'crew': [crew_member] # and a crew list, to which we add our man
            }
        
    else: # if there isn't an entry for this vessel      
        vessels[vessel_number] = { # we add it
            'name': clean_strings(row['Vessel']), # we get the vessel's name
            'rig': clean_strings(row['Rig']), # the rig configuration
            'voyages': { # create a dictionary for voyages
                voyage_number: { # add the voyage in this row
                    'departure': process_date(clean_strings(row['Departure'])), # including departure date
                    'crew': [crew_member] # and a crew list, to which we add our man
                }
            }
        }

print('Processing completed!')

Processing completed!


### 6. Save the Results

In the last step we save the results of our data processing to a JSON file on the hard drive.

In [29]:
json_file = os.path.join(os.curdir, 'dataset', 'crew_manifest.json') # create destination path

with open(json_file, 'w') as file: # open file in write mode
    json.dump(vessels, file, indent=4)