# Normalization Exercise

In [123]:
import pandas as pd

# load data
df = pd.read_csv('dataset.csv')

df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [124]:
# drop unused columns (only for this example)
df.drop(df.columns[[1,2,5,6,7,8,9,10,11,12,13,14,15]], axis=1, inplace=True)

In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Row ID        9994 non-null   int64  
 1   Ship Date     9994 non-null   object 
 2   Ship Mode     9994 non-null   object 
 3   Product Name  9994 non-null   object 
 4   Sales         9994 non-null   float64
 5   Quantity      9994 non-null   int64  
 6   Discount      9994 non-null   float64
 7   Profit        9994 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 624.8+ KB


- disini kita bisa melihat jumlah entries di semua kolom sama, ini menandakan tidak adanya missing values dalam df kita.
- selanjutnya kita bisa melihat duplicate value dari beberapa kolom, dalam contoh ini kita akan melihat dari 2 kolom saja, yaitu `Ship Mode`, dan `Product Name`

In [126]:
# unique values ship mode
print('ship mode:', df['Ship Mode'].unique(), sep='\n')

# unique values product name
print('product name:', df['Product Name'].unique(), sep='\n')

ship mode:
['Second Class' 'Standard Class' 'First Class' 'Same Day']
product name:
['Bush Somerset Collection Bookcase'
 'Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back'
 'Self-Adhesive Address Labels for Typewriters by Universal' ...
 'Eureka Hand Vacuum, Bagless' 'LG G2'
 'Eldon Jumbo ProFile Portable File Boxes Graphite/Black']


- dapat dilihat dari kedua kolom tersebut, masing2 memiliki unique values yang cukup banyak
- ini bisa kita gunakan sebagai referensi untuk normalisasi, dimana nantinya kita akan memiliki 3 table, yaitu table `Shipments`, `Products` dan `Orders` (utama)
- selanjutnya kita coba lakukan proses normalisasi dimulai dari 1NF hingga 3NF

## Table Shipments

In [131]:
# membuat df baru
dfShipment = pd.DataFrame()

# membuat kolom id dimana valuenya didapat dari jumlah unique value dari df['Ship Mode']
dfShipment['Id'] = [idx+1 for idx, value in enumerate(df['Ship Mode'].unique())]

# membuat kolom mode dimana valuenya didapat dari unique value df['Ship Mode']
dfShipment['Mode'] = df['Ship Mode'].unique()

dfShipment

Unnamed: 0,Id,Mode
0,1,Second Class
1,2,Standard Class
2,3,First Class
3,4,Same Day


## Table Products

In [133]:
# membuat df baru
dfProducts = pd.DataFrame()

# membuat kolom id dimana valuenya didapat dari jumlah unique value dari df['Product Name']
dfProducts['Id'] = [idx+1 for idx, value in enumerate(df['Product Name'].unique())]

# membuat kolom id dimana valuenya didapat dari jumlah unique value dari df['Product Name']
dfProducts['Name'] = df['Product Name'].unique()

dfProducts

Unnamed: 0,Id,Name
0,1,Bush Somerset Collection Bookcase
1,2,"Hon Deluxe Fabric Upholstered Stacking Chairs,..."
2,3,Self-Adhesive Address Labels for Typewriters b...
3,4,Bretford CR4500 Series Slim Rectangular Table
4,5,Eldon Fold 'N Roll Cart System
...,...,...
1845,1846,RCA ViSYS 25425RE1 Corded phone
1846,1847,Cisco 8961 IP Phone Charcoal
1847,1848,"Eureka Hand Vacuum, Bagless"
1848,1849,LG G2


## Table Orders

In [135]:
# membuat df baru, duplicate df existing
dfOrders = df.copy()

# mengganti nama kolom Row Id menjadi Id
dfOrders.rename(columns={"Row ID": "Id"}, inplace=True)

# mengganti nama kolom Order ID menjadi Order Code
dfOrders.rename(columns={"Order ID": "Order Code"}, inplace=True)

# mengganti value ship mode dengan id yang sesuai dari table shipments
shipModes = []
for value in dfOrders['Ship Mode']:
    modeId = dfShipment[dfShipment['Mode'] == value]['Id'].iloc[0] # get first index value from every filter result
    shipModes.append(modeId)

dfOrders['Ship Mode'] = shipModes

# rename kolom Ship Mode menjadi Ship Id
dfOrders.rename(columns={"Ship Mode": "Ship Id"}, inplace=True)

# mengganti value product name dengan id yang sesuai dari table products
productIds = []
for value in dfOrders['Product Name']:
    productId = dfProducts[dfProducts['Name'] == value]['Id'].iloc[0] # get first index value from every filter result
    productIds.append(productId)

dfOrders['Product Name'] = productIds

# rename kolom Product Name menjadi Product Id
dfOrders.rename(columns={"Product Name": "Product Id"}, inplace=True)

dfOrders

Unnamed: 0,Id,Ship Date,Ship Id,Product Id,Sales,Quantity,Discount,Profit
0,1,11/11/2016,1,1,261.9600,2,0.00,41.9136
1,2,11/11/2016,1,2,731.9400,3,0.00,219.5820
2,3,6/16/2016,1,3,14.6200,2,0.00,6.8714
3,4,10/18/2015,2,4,957.5775,5,0.45,-383.0310
4,5,10/18/2015,2,5,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...
9989,9990,1/23/2014,1,1159,25.2480,3,0.20,4.1028
9990,9991,3/3/2017,2,1752,91.9600,2,0.00,15.6332
9991,9992,3/3/2017,2,298,258.5760,2,0.20,19.3932
9992,9993,3/3/2017,2,950,29.6000,4,0.00,13.3200


- sekarang kita sudah berhasil melakukan normalisasi terhadap table `Orders`, `Products` dan `Shipments`.
- selanjutnya adalah kita membuat script SQL agar dapat kita migrasi-kan ke dalam PostgreSQL