# Proposal

## Use Case

#### Big Question
- Which shippers are most/least reliable (arrival time delta between estimated and actual)?

#### Sub Questions
- Which are the most reliable shippers per country/region/subregion
- Which carrier companies are the most reliable?
- What, if any, were the reliability changes over the years?
    - How did covid affect reliability metrics of shipment times?
- Which consignees chose their shippers wisest?

## Raw Denormalized Tables

- The following tables and columns are the ones I primarily require for use case/cleaning

##### Header
| Column | Data Type | Explanation | % non-null in 1st file |
| ------ | --------- | ----------- | ---------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| carrier_code | String | Standard Carrier Alpha Code (SCAC) to identify Vessel Operating Common Carriers (VOCC) | 100% |
| vessel_country_code | String | Carrier vessel country of origin | 99.99% |
| vessel_name | String | Ship's name | 100% |
| estimated_arrival_date | String | Time given as estimated shipment arrival | 100% |
| actual_arrival_date | String | Real date of arrival | 100% |

##### Consignee
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| consignee_name | String | Name of company receiving manifest items | 99.99% |
| consignee_address_1 | String | Top level address | 99.99% |
| consignee_address_2 | String | 2nd level address | 87.80% |
| consignee_address_3 | String | 3rd level address | 55.05% |
| consignee_address_4  | String | 4th level address | 11.52% |
| country_code | String | 2-digit country code | 20.36% |

##### shipper
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| shipper_party_name | String | Name of company shipping manifest items | 99.99% |
| shipper_party_address_1 | String | Top level address | 99.99% |
| shipper_party_address_2 | String | 2nd level address | 91.55% |
| shipper_party_address_3 | String | 3rd level address | 62.89% |
| shipper_party_address_4 | String | 4th level address | 14.36% |
| country_code | String | 2-digit country code | 21.21% |

## Proposed Normalized Schema
- For answering use case questions

![ERD](/Users/jesseputnam/cs-learning/skillstorm/project01/erd.png)

# ETL Pipeline

In [1]:
import pandas as pd

from lib.utils import get_id_nums, clean_row, remove_incorrect_codes

## Set up countries table

- Get table of countries with alpha-2 code that includes region from repository
    - https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes

In [2]:
countries_cols = ['name', 'alpha-2', 'region', 'sub-region']
countries = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/country_data.csv', usecols=countries_cols, keep_default_na=False)

# Add some extra codes found in data set (outdated -- ex. DD, East German Republic)
extra_codes = pd.DataFrame(
    {'name': ['Czechia', 'Netherland Antilles', 'Germany', 'European Union'], 
    'alpha-2': ['XC', 'AN', 'DD', 'EU'], 
    'region': ['Europe', 'Americas', 'Europe', 'Europe'], 
    'sub-region': ['Eastern Europe', 'Latin America and the Caribbean', 'Western Europe', 'Western Europe']}
    )
countries = pd.concat([countries, extra_codes], ignore_index=True, keys=['alpha-2', 'name'])

# Change countries index column to be alpha-2 values and rename to id
countries.set_index('alpha-2', inplace=True)
countries.index.name = 'id'
countries.sort_index(inplace=True)

# Create country code set with O(1) lookup for table cleaning
alpha_2_set = set(countries.index)

# Create country name dictionary with O(1) lookup for table cleaning
country_dict = {x[1].upper(): x[0] for x in countries.itertuples()}

# Add some statistically siginificant outliers, including common 2 common 'typos'
country_dict['TAIWAN'] = 'TW'
country_dict['SOUTH KOREA'] = 'KR'
country_dict['SHANGHAI CN'] = 'CN'
country_dict['SHANGHAI'] = 'CN'
country_dict['SHANGHAI .'] = 'CN'
country_dict['HONG KONG .'] = 'CN'
country_dict['TAIPEI .'] = 'TW'
country_dict['USA'] = 'US'
country_dict['U.S.A.'] = 'US'

- Convert altered countries dataframe to csv

In [20]:
countries_silver_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/silver_layer/countries.csv'
countries.to_csv(countries_silver_path, mode='w')


## Shipper/Consignee
- Both entity tables are handled the same

- Read CSVs to DataFrames

In [8]:
shipper_0 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2018_part_0.csv')
shipper_1 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2018_part_1.csv')
shipper_2 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2019_part_0.csv')
shipper_3 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2019_part_1.csv')
shipper_4 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2020_part_0.csv')
shipper_5 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/shipper_2020_part_1.csv')

- Concatenate shippers DataFrames to single DataFrame

In [9]:
shippers = pd.concat([shipper_0, shipper_1, shipper_2, shipper_3, shipper_4, shipper_5], ignore_index=True)

In [8]:
# Replace NaN in name with Unknown
shippers['shipper_party_name'].fillna('N/A', inplace=True)

- Clean shipper rows and remove remaining unnecessary columns - (see utils.py for **CLEAN_ROW** function)

In [9]:
shipper_clean = shippers.apply(lambda row: clean_row(row, 'shipper_party', alpha_2_set, country_dict), axis=1)

- **Result:** out of 40,240,366 values


|  | Before Cleaning | After Cleaning |
| - | - | - |
| country_codes #| 9,911,774 | 13,100,153| 
| country_codes %| 24.6% | 32.55% |

- Create shipper id column and map IDs by name - (see utils.py for **GET_ID_NUMS** function)

In [13]:
shipper_id_dict = get_id_nums(shipper_clean['shipper_party_name'])
shipper_clean['shipper_id'] = shipper_clean['shipper_party_name'].map(shipper_id_dict)

- Write cleaned data to CSV (time consuming process -- no mistakes)
- Cleaned data will be used to create normalized tables for silver layer

In [None]:
shipper_final_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/final/shipper_clean.csv'
shipper_clean.to_csv(shipper_final_path, mode='w')

## Header

- Read header CSV files

In [None]:
header_0 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2018_part_0.csv')
header_1 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2018_part_1.csv')
header_2 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2018_part_2.csv')
header_3 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2018_part_3.csv')
header_4 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2019_part_0.csv')
header_5 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2019_part_1.csv')
header_6 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2019_part_2.csv')
header_7 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2019_part_3.csv')
header_8 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2020_part_0.csv')
header_9 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2020_part_1.csv')
header_10 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/bronze_layer/header_2020_part_2.csv')


- Concat to a single DataFrame

In [37]:
headers = pd.concat([header_0, header_1, header_2, header_3, header_4, header_5, header_6, header_7, header_8, header_9, header_10], ignore_index=True)

- Remove erroneous or unknown country codes

In [6]:
headers['vessel_country_code'] = headers['vessel_country_code'].apply(lambda row: remove_incorrect_codes(row, alpha_2_set))

- Save cleaned dataset to CSV

In [7]:
header_final_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/final/header_clean.csv'
headers.to_csv(header_final_path, mode='w')

# In Progress

## Junction Tables

### Shipper_Shipment

- From **shipper_clean**: identifier, shipper_id

### Consignee_Shipment

- From **consignee_clean**: identifier, consignee_id

## Silver Layer Creation

- Remove identifier from consignee and shipper, change index to IDs

- Change index on header to identifier

- Upload shipper, header, consignee as as csv for SQL batch loading

- Write SQL DDL table creation file
- Write SQL DML insert file