# Import libraries 

Go ahead an import libraries that you will use into your notebook by running

```python
import pandas as pd  # it is common to provide an alias (shortened name) for the library you're importing to make it easier to reference in your code 
```

If these libraries do not exist on your computer, you would see a `Module Not Found` error. In that case, go ahead and install these libraries by running: 

```
pip install pandas 
```

In [1]:
import pandas as pd 

# Let's ingest

Ingest the following datasets: 

1. Customers dataset: `customers.csv`
2. Orders dataset: `orders.csv` 
3. Order items dataset: `order_items.csv` 


In [12]:
# ingest customers.csv and display the first 5 rows
customers_df = pd.read_csv("../resources/customers.csv")
customers_df.head()

Unnamed: 0,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region
0,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central
1,AA-10375,Allen Armold,Consumer,Santa Catarina,Nuevo León,Mexico,30318.0,LATAM,North
2,AA-10480,Andrew Allen,Consumer,Bangkok,Bangkok,Thailand,48234.0,APAC,Southeast Asia
3,AA-10645,Anna Andreadi,Consumer,Jakarta,Jakarta,Indonesia,94109.0,APAC,Southeast Asia
4,AA-315,Alex Avila,Consumer,Hrodna,Hrodna,Belarus,,EMEA,EMEA


In [24]:
# ingest orders.csv and display the first 5 rows 
orders_df = pd.read_csv("../resources/orders.csv")
orders_df.head()

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Order Priority
0,AE-2011-9160,03-10-2011,07-10-2011,Standard Class,PO-8865,Medium
1,AE-2013-1130,14-10-2013,14-10-2013,Same Day,EB-4110,High
2,AE-2013-1530,31-12-2013,03-01-2014,Second Class,MY-7380,High
3,AE-2014-2840,05-11-2014,08-11-2014,First Class,PG-8820,Critical
4,AE-2014-3830,13-12-2014,19-12-2014,Standard Class,GH-4665,Medium


In [8]:
# ingest order_items.csv and display the first 5 rows 
order_items_df = pd.read_csv("../resources/order_items.csv")
order_items_df.head()

Unnamed: 0,Order Item ID,Order ID,Order Item Number,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost
0,AE-2011-9160-1,AE-2011-9160,1,OFF-FEL-10001405,Office Supplies,Storage,"Fellowes File Cart, Industrial",82.674,2,0.7,-157.086,5.69
1,AE-2011-9160-2,AE-2011-9160,2,TEC-EPS-10004171,Technology,Machines,"Epson Calculator, Red",78.408,6,0.7,-88.992,3.87
2,AE-2013-1130-1,AE-2013-1130,1,FUR-BUS-10003055,Furniture,Bookcases,"Bush Stackable Bookrack, Pine",224.748,6,0.7,-232.272,60.08
3,AE-2013-1130-2,AE-2013-1130,2,OFF-ACC-10004278,Office Supplies,Fasteners,"Accos Paper Clips, Bulk Pack",4.248,1,0.7,-4.692,0.1
4,AE-2013-1530-1,AE-2013-1530,1,OFF-STI-10000114,Office Supplies,Supplies,"Stiletto Letter Opener, High Speed",16.668,2,0.7,-29.472,1.41


# Transforming data 

Once we have ingested the raw data, the next step is to enrich the data so that it can be used by **Data Analysts** to answer business questions.
- "What are my most popular products?"
- "Which countries generate the highest revenue?"



## Step 1: Merging DataFrames

Refer to the diagram below to understand how the DataFrames relates to one another. 

<img src="../resources/database-schema.png" alt="database-schema.png" style="width:600px;"/>

Use the following Pandas function to merge two DataFrames together: 

```python
pd.merge(left=left_df, right=right_df, on="<a shared column>", how="<choose one of: inner, left, right, outer >")
```


In [26]:
# merge customers with orders and store result into a new DataFrame
customer_orders_df = pd.merge(left=customers_df, right=orders_df, on="Customer ID", how="inner")
customer_orders_df.head()

Unnamed: 0,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Order ID,Order Date,Ship Date,Ship Mode,Order Priority
0,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-128055,31-03-2011,05-04-2011,Standard Class,Medium
1,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-138100,15-09-2011,20-09-2011,Standard Class,Medium
2,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2012-121391,04-10-2012,07-10-2012,First Class,Critical
3,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2013-103982,04-03-2013,09-03-2013,Standard Class,Medium
4,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2014-147039,30-06-2014,05-07-2014,Standard Class,Medium


In [28]:
# merge customers_orders with order_items and store result into a new DataFrame
merged_df = pd.merge(left=customer_orders_df, right=order_items_df, on="Order ID", how="inner")
merged_df.head()

Unnamed: 0,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Order ID,...,Order Item Number,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost
0,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-128055,...,1,OFF-AP-10002765,Office Supplies,Appliances,Fellowes Advanced Computer Series Surge Protec...,52.98,2,0.0,14.8344,3.17
1,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-128055,...,2,OFF-BI-10004390,Office Supplies,Binders,GBC DocuBind 200 Manual Binding Machine,673.568,2,0.2,252.588,54.96
2,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-138100,...,1,FUR-FU-10002456,Furniture,Furnishings,"Master Caster Door Stop, Large Neon Orange",14.56,2,0.0,6.2608,1.31
3,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-138100,...,2,OFF-PA-10000349,Office Supplies,Paper,Staples,14.94,3,0.0,7.0218,0.99
4,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2012-121391,...,1,OFF-ST-10001590,Office Supplies,Storage,Tenex Personal Project File with Scoop Front D...,26.96,2,0.0,7.0096,5.23


## Step 2: Add new columns 

Add the following **new** columns to the DataFrame: 
- `Unit Price = Sales / Quantity` 
- `Total Sales = Sales + Shipping Cost`  

In [29]:
# create new column for unit price with the calculation: Sales / Quantity 
merged_df["Unit Price"] = merged_df["Sales"] / merged_df["Quantity"]

# create new column for total sales with the calculation: Sales + Shipping Cost 
merged_df["Total Sales"] = merged_df["Sales"] + merged_df["Shipping Cost"]

merged_df.head()

Unnamed: 0,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Order ID,...,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Unit Price,Total Sales
0,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-128055,...,Office Supplies,Appliances,Fellowes Advanced Computer Series Surge Protec...,52.98,2,0.0,14.8344,3.17,26.49,56.15
1,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-128055,...,Office Supplies,Binders,GBC DocuBind 200 Manual Binding Machine,673.568,2,0.2,252.588,54.96,336.784,728.528
2,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-138100,...,Furniture,Furnishings,"Master Caster Door Stop, Large Neon Orange",14.56,2,0.0,6.2608,1.31,7.28,15.87
3,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2011-138100,...,Office Supplies,Paper,Staples,14.94,3,0.0,7.0218,0.99,4.98,15.93
4,AA-10315,Alex Avila,Consumer,Round Rock,Texas,United States,78664.0,US,Central,CA-2012-121391,...,Office Supplies,Storage,Tenex Personal Project File with Scoop Front D...,26.96,2,0.0,7.0096,5.23,13.48,32.19


# Saving data

After completing your transformation steps, you are ready to save your data so that others can consume the datasets you've produced. 

To save your data when using Pandas, simply perform: 

```python
df.to_csv("your file path here", index=False) # index=False to remove the index column in the DataFrame when saving 
```


In [30]:
merged_df.to_csv("final.csv", index=False)