In [250]:
import pandas as pd
import datetime as dt
import pyarrow as pa
import pyarrow.parquet as pq

### Making Phase One Product List

The first step reads in the excel file. What I did was take the offical document and then with Adobe, extracted Annex 6-1 which has the product lists from which purchases should be made. Then with the extracted .pdf file, I converted it to an excel file. Now some simple cleaning needs to be done. The end product is something where we have an (i) and hs code, (ii) description and (iii) a low catagory and (iv) a high catagory. The high catagory is the catagory for which purchase commitments are made. 

In [251]:
file_path = "./data/annex-6-1.xlsx"

df = pd.read_excel(file_path, skiprows = 1, header = None, dtype = {1: str})

df.columns = ["catagory", "hs4", "description"]

#### Some basic cleaning.

The first step steps are to create the low and high product catagories and seperate them into their own column. Then the remaing steps is to drop headings etc. The only issue is that some HS codes are messed up (i.e. ones are viewed as Ls) and two other issues. I manually replace these. Then we are done.

In [252]:
df.catagory.ffill(inplace = True)
# This will take the catagory and push down, giving rise to the numbers, etc.

df["low_catagory"] = df[df.description.isnull()].hs4

df["high_catagory"] = df[df.description.isnull() & df.hs4.isnull()].catagory

df["low_catagory"].ffill(inplace = True)

df["high_catagory"].ffill(inplace = True)

In [253]:
df.dropna(axis = 0, subset = ["hs4"], inplace = True)

df.dropna(axis = 0, subset = ["description"], inplace = True)

replace_list = ['85ll', '72ll', '220 1', '290543/\n290544']

df.replace(to_replace = replace_list, value = ["8511", "7211", "2201", "2905"], inplace = True)

df["foo"] = df.hs4.apply(pd.to_numeric, args=('coerce',))

In [254]:
df.dropna(axis = 0, subset = ["foo"], inplace = True)

df.drop(axis = 1, labels = ["foo", "catagory"], inplace = True)

The final issue is somehow the "0" infront of the hs4 codes are not being read in. In excel they display, but not read in. This will fix these entries.

In [255]:
def add_zero(x):
    
    if len(x) < 4:
        x = "0" + x
        
    if len(x) > 4:
        # Some codes are above hs, not sure what to do with this,
        # will just truncate and aggregate
        x = x[0:4]
        
    return x

In [256]:
df.hs4 = df.hs4.apply(add_zero)

#### And we are done

Check out the end product. **To Do** make a .csv file for posting.

In [273]:
grp = df.groupby("hs4")

outdf = grp.agg({"description": "first", "low_catagory": "first", "high_catagory": "first"})
# Given the truncation, this will collapse things so that there is a unique hs4 code
# the descrition will be off, but low and high catagory will be right.

outdf.reset_index(inplace = True)

In [275]:
out_file = ".\\data"+ "\\annex-6-1.parquet"

pq.write_table(pa.Table.from_pandas(outdf), out_file)

outdf.to_csv(".\\data"+ "\\annex-6-1.csv",index = False)

In [266]:
foo.tail(20)

Unnamed: 0,hs4,description,low_catagory,high_catagory
511,8541,"Diodes, transistors and similar semiconductor ...",Electrical equipment and mac.hioery,1. Manufactured Goods
512,8542,Electronic integrated circuits; parts thereof,Other manufactured goods,1. Manufactured Goods
513,8543,"Electrical machines and apparatu, s having in...",Electrical equipment and mac.hioery,1. Manufactured Goods
514,8544,"Insulated (including enameled or anodized) wi,...",Electrical equipment and mac.hioery,1. Manufactured Goods
515,8545,"Carbon electrodes, carbon brushes, lamp carbo,...",Electrical equipment and mac.hioery,1. Manufactured Goods
516,8546,Electrical insulators of any material,Electrical equipment and mac.hioery,1. Manufactured Goods
517,8547,"Insulating fittings for electrical machines, a...",Electrical equipment and mac.hioery,1. Manufactured Goods
518,8548,"Waste and scrap of primary cells, primary batt...",Electrical equipment and mac.hioery,1. Manufactured Goods
519,8703,Motor carsand other motor vehicles principally...,Vehicles,1. Manufactured Goods
520,8704,Motor vehicles for the transport of goods,Vehicles,1. Manufactured Goods
