# UNIT 3: Reuse, Modularity, and External Resources

## Exercise

In this exercise, you'll use what you've learned so far to write a programme to clean and re-structure a complex dataset.

### INPUT: `crew_manifest.csv`

Use the [Dataset](https://canvas.harvard.edu/courses/113131/files/17333429?wrap=1) link on the Canvas site to download the csv file. Here's a sample of the data to show how it is structured:

Vessel | Rig | Departure | Name | Age | Height | Residence | Rank | Voyage_number | Vessel_number
:------|:----|:----------|:-----|:---:|:------:|:----------|:-----|:-------------:|:-------------:
Mary and Susan | Bark | 9/9/1867 | Frates, John A. | 18 | 5'0 3/4 | Azores | Seaman, Boatsteerer | 9240 | 481
Andrew Hicks | Bark | 1867.9.9 | Hamblin, Otis F. |  |  |  | Master | 941 | 703
Mary and Susan | Bark | 9-sep-1867 |  Herendeen, A.  O. |  |  |  | Master | 9240 | 481
Andrew Hicks | Bark | September 9 1867 | Jenkins,  Thomas H. | 22 | 5'7 | Dartmouth | Master | 941 | 703
Mary and Susan | Bark | 09/09/1867 | Gonsalves, Frank |  |  |  | Seaman | 9240 | 481
Sarah | Bark | 9/9/1867 | Avola, Antone | 23 | 5'9 | Azores |  | 9240 | 637
Mary and Susan | Bark | 9/9/1867 | Baptista, Manuel J. | 30 | 5'4 | Azores |  | 9240 | 481
Mary and Susan | Bark | 9/9/1867 | Berry, William, Jr. | 22 | 5'4 | Boston |  | 9240 | 481
Mary and Susan | Bark | 9/9/1867 | Bettencurt, Antonio | 18 | 5'2 1/4 | Azores |  | 9240 | 481

This is a *very real*—and *very messy*—dataset. A few things to keep in mind:
- You should expect to have **leading, trailing, and double spaces** in all columns. 
- Formatting is **consistent for the following columns**: `Vessel`, `Rig`, `Age`, `Voyage_number`, and `Vessel_number`.
- **Not so much for the rest**. In the next cell, I've included *representative variations* of the different formats used in each of the other cells. I used Python lists so that you can **use them directly for testing your algorithms** as you design them (you're welcome!):

In [None]:
departure_variations = [
    '09/06/1844', # in context, this should be 6 September 1844
    '9/30/1885',
    '1867.9.9',
    '30-Sep-1873', 
    '9-sep-1867', 
    '9/3/1883',
    '9/APR/1846',
    '9/August/1865', 
    'April 9 1855'
]

name_variations = [
    "A. L. E. Benton",
    "A.j. Harvey",
    "Aaron F. Hussey",
    "Aaron Dean",
    "Adams, Charles",
    "Adams, Charles C.",
    "Alden Jr Rounseville",
    "Almada, Peter Antonie",
    "Amos F.",
    "Bennett,william",
    "Berry, William, Jr.",
    "Borden, Joshua G Jr",
    "O'brien, John"
]

height_variations = [
    "4' 6\"", # note escaped double-quotes used as inches marker
    "4'6",
    "5'",
    "5' 10",
    "5' 4 1/2\"", # note escaped double-quotes used as inches marker
    "5' 7 3/4\"", # note escaped double-quotes used as inches marker
    "5'0",
    "5'0 1/2",
    "5'10+",
    "5'11.5"
]

residence_variations = [
    "Albany, Ny",
    "Azores",
    "Brava, Cape Verde",
    "Brookirlee, Canada",
    "Corvo, Azores",
    "Havre De Grace",
    "Lyons Ma",
    "Marshall, Mich",
    "Martha's Vineyard",
    "Saint Croix, West Indies",
    "Saint Paul (Africa)"
]

### OUTPUT:  `crew_manifest.json`

Besides cleaning the dataset, you'll also have to re-structure it. The **original csv file tracks three entities**: `vessels`, `voyages`, and `crew members`. In principle, **only the crew member data is unique for each row**, vessel and voyage data repeats. So, we'll restructure the dataset as follows (*Entity-Relationship diagramme* on the left, *JSON sample* on the right):

<div style="margin-top: 20px"><div style="width: 49%; float: left;">
<img src="https://mermaid.ink/img/pako:eNqFk8Fu4jAQhl9l5EsvpA_AjYXsbiUCUlhRVcrFaw9h1Niu7IlQBH332iHQ0K5YX6JM_pl_5ovnKJTTKKYC_YJk7aWpLMSzzTebfAmnU5adjrBdv8x-5TCFKsotS7IBJDQUGNyuEjcpx_NbOmQZVq35i_4zFtjDShq8jZRUfwb6ulvXyRpDdBzMPSrndbi4vQ-m54__MdWSERb4Jj23HiEWlQFe4smKIlssLjWv5nOPhyial_lzVuTFj7y8bz9gGuvvsxorj9_hnBv8ST4mFY9Pj7CUgcddpiFnNd4GfiPVe465ZEGhZTLIHsM4T5NiKDGQRqsGm-V6PvvztF59o1C6pv8Byhkjs5DwRY4aiNEE2Hln4KGU9vXhC5TxcD2Z09Wjx0JWNa3GhCX1Q85K38GBeH8pdJV_YbN0SjbEXWoqPibA7mDHfSfR3LWWu9vYhtMNqMSubRqw6fr9K8l39yURqY1ch2HFRBj0RpKO69M3WgneY0pMU2rpX1OJpJMtu01nlZiyb3Ei2rd0I4eFE9OdbEKMoiZ2vjjvY7-W7x8UuxWi" width=300>

</div><div style="width: 49%; float: right; border-left-color: rgb(208, 208, 208); border-left-style: solid">

```json
{
    "481": {
        "name": "Mary and Susan",
        "rig": "Bark",
        "voyages": {
            "9240": {
                "departure": "1867-09-09",
                "crew": [
                    {
                        "name": "John A. Frates",
                        "age": 18
                        "height": 155,
                        "residence": {
                            "locality": "Azores"
                        },
                        "roles": [
                            "seaman",
                            "boatsteerer"
                        ]
                    },
                    {
                        "name": "A. O. Herendeen",
                        "age": null,
                        "height": null,
                        "residence": {},
                        "roles": [
                            "master",
                        ]
                    },
                    { other crew members}
                ]
            },
            { other voyages }
        }
    }
    { other vessels }
}
```
</div></div>

## Your Solution

In [None]:
# your code starts here



