In [None]:
#######################################################

- Provide a summary of the data
    - What is the source of the data?
        - How is the data generated?
    - What is the original format of the data?
    - How large is the entire dataset?
    - Describe the features of the data that you are working with
        - Provide a statistical summary of numerical columns
        - Provide a detailed breakdown of how you choose to clean the data
        - What is your methodology for dealing with errors, nulls, and outliers?
            - What quality standards are enforced for your cleaned data?
            - Does the data to be augmented with any external data?
    - Propsed solution
       - ERD for raw tables
       - A complete Entity Relationship Diagram of your final normalized schema
       - The partitioning schema for your dataset

In [None]:
#############################################################

## Data Summary

##### Data Source: 
- US Imports - Automated Manifest System (AMS) Shipments 2018-2020
- Collected by US Customs and Border Protection (CBP)
- CBP requires that all ships entering or passing through US waters provide details about their cargo contents
##### Original format:
- CSV files, separated by tables
    - [Bill of Lading Headers, General Bill Information, Shippers, Shipment Consignees, Notified Parties, Cargo Descriptions, Containers, Marks and Numbers, Tariff Codes, Toxic and Hazardous Materials, Toxic and Hazardous Materials Classification]
- Original Size of all 3 years of data: 63.22GB

In [None]:
#######################

## Use Case

#### Big Question
- Which shippers are most/least reliable (arrival time delta between estimated and actual)?

#### Sub Questions
- Identify most reliable shippers per country/region/subregion
- Identify the most reliable carrier companies
- Identify the reliability changes over years
    - How did covid affect reliability metrics of shipment times?
- Which consignees chose their shippers wisest?

In [None]:
###################################################

#### Brief
Initially, I had intended on pursuing shipping companies and their reliability concerning time delta between estimated and actual delivery, as well as that of their countries of origin. With that, I encountered two issues: using the carrier_code alone would limit the required tables to be used, and that there seemed to be many cases of carrier_codes being linked to multiple countries.

That proved to be troublesome in more ways than one. I had hoped to be able to separate out ships or voyages into their own SQL tables, but continued to encounter issues of non-unique data -- voyages including multiple estimated times due to multiple shipments onboard, companies running under up to 5 different countries, ships belonging to multiple carrier companies, etc. 

So I went back and switched the main question to pertain to the shippers. This allowed for more static information by country and, based on quick pandas explorations, showed enough repeat shippers to make meaningful insights.

From there, I narrowed down my pertinent fields, leaving some extraneous fields that may, as I explore and transform the data, have the potential to prove useful. From header, the most important fields concerned dates, the carrier code, and the identifier to connect with the consignee and shipper. From shipper and consignee, I pulled the identifier, name, and associated address information.

Moving forward, the most challenging aspect of the cleaning process will definitely concern the address, as the input data is far from uniform.



In [None]:
#############################################################

## Raw Data Info

*Total size of pertinent raw data tables: 23.93GB*

- I am currently using 3 tables from the raw data: Header, Shipper, and Consignee.

- I did research and used deductive reasoning to come up with the explanations of columns
- Using DataFrame.info() on the first file of each table, I determined the percentage of non-null values on 
- each column to help determine which cols lacked enough values to lose usefulness in my dataset

##### Header
| Column | Data Type | Explanation | % non-null in 1st file |
| ------ | --------- | ----------- | ---------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| carrier_code | String | Standard Carrier Alpha Code (SCAC) to identify Vessel Operating Common Carriers (VOCC) | 100% |
| vessel_country_code | String | Carrier vessel country of origin | 99.99% |
| vessel_name | String | Ship's name | 100% |
| port_of_unlading | String | Place where vessel unloads shipment | 100% |
| estimated_arrival_date | String | Time given as estimated shipment arrival | 100% |
| foreign_port_of_lading_qualifier | String | Designating seaports handling waterborne shipment of foreign trade from | 100% |
| foreign_port_of_lading | String | Foreign port where shipment was loaded | 100% |
| manifest_quantity | int64 | Number of items in shipment manifest | 100% |
| manifest_unit | String | Unit of which the items contained in the manifest are contained and counted | 100% |
| weight | int64 | total weight of manifest items | 100% |
| weight_unit | String | Weight unit of measure | 100% |
| measurement | int64 | dimensional measurement | 100% |
| measurement_unit | String | Unit of dimensional measurement | 68.73% |
| record_status_indicator | String | Designation of record status: new, deleted, or amended | 100% |
| place_of_receipt | String | Originating port of manifest items | 99.79% |
| port_of_destination | String | Eventual destination point | 9.4% |
| foreign_port_of_destination_qualifier | String | Designating seaports handling waterborne shipment of foreign trade to | 0.64% |
| foreign_port_of_destination | String | Eventual destination point if foreign to US | 0.64% |
| conveyance_id_qualifier | String | Designation of International Martitime Organization ship ID number | 100% |
| conveyance_id | String | IMO ID | 100% |
| in_bond_entry_type | String | Type for in-bond merchandise in shipment | 9.4% |
| mode_of_transportation | String | Container the material is being transported | 100% |
| secondary_notify_party_N | String | Shipping company to notify as secondary 1-10 | [56.65%, 6.48%, 0.4%, ... 0%] |
| actual_arrival_date | String | Real date of arrival | 100% |

##### consignee
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| consignee_name | String | Name of company receiving manifest items | 99.99% |
| consignee_address_1 | String | Top level address | 99.99% |
| consignee_address_2 | String | 2nd level address | 87.80% |
| consignee_address_3 | String | 3rd level address | 55.05% |
| consignee_address_4  | String | 4th level address | 11.52% |
| city | String | City name | 22.07% |
| state_province | String | 2-digit state or province code | 19.35% |
| zip_code | String | Address zip code | 19.41% |
| country_code | String | 2-digit country code | 20.36% |
| contact_name | String | Name of contact | 0.014% |
| comm_number_qualifier | String | Type of communication used for comm_number | 18.01% |
| comm_number | String | Communication number/website/email | 18% |

##### shipper
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| shipper_party_name | String | Name of company shipping manifest items | 99.99% |
| shipper_party_address_1 | String | Top level address | 99.99% |
| shipper_party_address_2 | String | 2nd level address | 91.55% |
| shipper_party_address_3 | String | 3rd level address | 62.89% |
| shipper_party_address_4 | String | 4th level address | 14.36% |
| city | String | City name | 22.19% |
| state_province | String | 2-digit state or province code | 8.61% |
| zip_code | String | Zip code | 12.32% |
| country_code | String | 2-digit country code | 21.21% |
| contact_name | String | Name of company contact | 1.50% |
| comm_number_qualifier | String | Type of communication used for comm_number | 17.89% |
| comm_number | String | Communication number/website/email | 17.86% |

In [None]:
######################################

## Curated Raw Data Info

Here are the columns I determined pertinent to my use case from the tables

#### Selected Columns

##### Header
| Column | Data Type | Explanation | % non-null in 1st file |
| ------ | --------- | ----------- | ---------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| carrier_code | String | Standard Carrier Alpha Code (SCAC) to identify Vessel Operating Common Carriers (VOCC) | 100% |
| vessel_country_code | String | Carrier vessel country of origin | 99.99% |
| vessel_name | String | Ship's name | 100% |
| estimated_arrival_date | String | Time given as estimated shipment arrival | 100% |
| actual_arrival_date | String | Real date of arrival | 100% |

##### Consignee
(Addresses in need of cleaning, non-uniform)
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| consignee_name | String | Name of company receiving manifest items | 99.99% |
| consignee_address_1 | String | Top level address | 99.99% |
| consignee_address_2 | String | 2nd level address | 87.80% |
| consignee_address_3 | String | 3rd level address | 55.05% |
| consignee_address_4  | String | 4th level address | 11.52% |
| country_code | String | 2-digit country code | 20.36% |

##### shipper
(Addresses in need of cleaning, non-uniform)
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| shipper_party_name | String | Name of company shipping manifest items | 99.99% |
| shipper_party_address_1 | String | Top level address | 99.99% |
| shipper_party_address_2 | String | 2nd level address | 91.55% |
| shipper_party_address_3 | String | 3rd level address | 62.89% |
| shipper_party_address_4 | String | 4th level address | 14.36% |
| country_code | String | 2-digit country code | 21.21% |

In [None]:
#############################################

## Preliminary ERD

<a href='https://app.quickdatabasediagrams.com/#/d/I1yY7B'>Quick Database Diagrams Link</a>


![ERD](/Users/jesseputnam/cs-learning/skillstorm/project01/erd.png)

#### Written Description of ERD

I will have four tables. The country table will require outside sources for converting country code to country name.

In [None]:
country
-
country_code CHAR(2) PRIMARY KEY
country_name VARCHAR(50)
region VARCHAR(20)
sub_region VARCHAR(20)

shipper
-
shipper_id INT PRIMARY KEY IDENTITY
shipper_name VARCHAR(100)
country_code CHAR(2) FOREIGN KEY REFERENCES country(country_code)

consignee
-
consignee_id INT PRIMARY KEY IDENTITY
consignee_name VARCHAR(100)
country_code CHAR(2) FOREIGN KEY REFERENCES country(country_code)

shipment
-
shipment_id INT PRIMARY KEY
carrier_code VARCHAR(4)
vessel_name VARCHAR(25)
vessel_country CHAR(2) FOREIGN KEY REFERENCES country(country_code)
port_of_unlading VARCHAR(100)
foreign_port_of_lading VARCHAR(100)
estimated_arrival_date DATETIME
actual_arrival_date DATETIME
shipper_id INT FOREIGN KEY REFERENCES shipper(shipper_id)
consignee_id INT FOREIGN KEY REFERENCES consignee(consignee_id)

In [None]:
###############################################

## Data Cleaning Outline/notes

#### Header
- Cleaning
    - header.estimate_arrival_date & header.actual_arrival_date will need to be converted to datetime
    - vessel_country_code will need outside reference for full country name
        - another table to designate continents?
- Null Values
    - The only null values of true significance would be identifier, estimated and actual arrival times. If any of those are missing, I will have to drop them from the data set.
    - Of secondary importance would be vessel_country_code, carrier_code for my secondary questions. I will keep them in my data for the primary question, but drop them from those


#### Shipper & Consignee
- Cleaning
    - shipper_part_address1...4, city, state_province, zip_code, country_code
        - All will need a lot of cleaning to standardize
        - Work will include conditionals looking at the part_address_1-4
            - 1 generally is for street address, but can also be C/O or Attn
            - Will still need to drill down on the scenarios involved
            - At 22% and 21% non-null entries, state_province and country_code will be useful first-checks, but not statistically enough
        - Final product will hopefully split into 4 columns:
            - address, city_province, zip_code, country
- Null Values
    - identifier is the primary null value to deal with, and must drop the record if so
    - shipper/consignee name erroneous values will also have to be dropped when considering the connected questions, but included for region specific data if region data is present
    - region data will similarly force records to be dropped when considering the region specific questions