### Introduction: Multiple DataFrames
In order to efficiently store data, we often spread related information across multiple tables.

In script.py, we've loaded in three DataFrames: orders, products, and customers.

In [6]:
import pandas as pd

orders = pd.read_csv('orders1.csv')
products = pd.read_csv('products.csv')
customers = pd.read_csv('customers.csv')

Start by inspecting orders

In [7]:
print(orders)

   order_id  customer_id  product_id  quantity   timestamp
0         1            2           3         1  2017-01-01
1         2            2           2         3  2017-01-01
2         3            3           1         1  2017-01-01
3         4            3           2         2  2017-02-01
4         5            3           3         3  2017-02-01
5         6            1           4         2  2017-03-01
6         7            1           1         1  2017-02-02
7         8            1           4         1  2017-02-02


Now inspect products

In [8]:
print(products)

   product_id         description  price
0           1      thing-a-ma-jig      5
1           2  whatcha-ma-call-it     10
2           3          doo-hickey      7
3           4               gizmo      3


Now inspect customers

In [9]:
print(customers)

   customer_id customer_name        address  phone_number
0            1    John Smith   123 Main St.  212-123-4567
1            2      Jane Doe  456 Park Ave.  949-867-5309
2            3     Joe Schmo   798 Broadway  112-358-1321


Examine the orders and products tables.

What is the description of the product that was ordered in Order 3?

Give your answer as a string assigned to the variable order_3_description.

In [11]:
order_3_description = 'thing-a-ma-jig'

Examine the orders and customers tables.

What is the phone_number of the customer in Order 5?

Give your answer as a string assigned to the variable order_5_phone_number.

In [12]:
order_5_phone_number = '112-358-1321'

### Inner Merge II
It is easy to do this kind of matching for one row, but hard to do it for multiple rows.

Luckily, Pandas can efficiently do this for the entire table. We use the .merge method.

The .merge method looks for columns that are common between two DataFrames and then looks for rows where those column's values are the same. It then combines the matching rows into a single row in a new table.

We can call the pd.merge method with two tables like this:

In [13]:
# new_df = pd.merge(orders, customers)

You are an analyst Cool T-Shirts Inc. You are going to help them analyze some of their sales data.

There are two DataFrames defined in the file script.py:

    sales contains the monthly revenue for Cool T-Shirts Inc. It has two columns: month and revenue.
    targets contains the goals for monthly revenue for each month. It has two columns: month and target.
Create a new DataFrame sales_vs_targets which contains the merge of sales and targets.

In [15]:
sales = pd.read_csv('sales.csv')
print(sales)
targets = pd.read_csv('targets.csv')
print(targets)

sales_vs_targets = pd.merge(sales, targets)
print(sales_vs_targets)

      month  revenue
0   January      300
1  February      290
2     March      310
3     April      325
4       May      475
5      June      495
      month  target
0   January     310
1  February     270
2     March     300
3     April     350
4       May     475
5      June     500
      month  revenue  target
0   January      300     310
1  February      290     270
2     March      310     300
3     April      325     350
4       May      475     475
5      June      495     500


Cool T-Shirts Inc. wants to know the months when they crushed their targets.

Select the rows from sales_vs_targets where revenue is greater than target. Save these rows to the variable crushing_it.

In [16]:
crushing_it = sales_vs_targets[sales_vs_targets.revenue > sales_vs_targets.target]
print(crushing_it)

      month  revenue  target
1  February      290     270
2     March      310     300


### Inner Merge III
In addition to using pd.merge, each DataFrame has its own merge method. For instance, if you wanted to merge orders with customers, you could use:

In [17]:
# new_df = orders.merge(customers)

This produces the same DataFrame as if we had called pd.merge(orders, customers).

We generally use this when we are joining more than two DataFrames together because we can "chain" the commands. The following command would merge orders to customers, and then the resulting DataFrame to products:

In [19]:
 # big_df = orders.merge(customers).merge(products)

We have some more data from Cool T-Shirts Inc. The number of men's and women's t-shirts sold per month is in a file called men_women_sales.csv. Load this data into a DataFrame called men_women.

In [22]:
sales = pd.read_csv('sales.csv')
targets = pd.read_csv('targets.csv')
men_women = pd.read_csv('men_women_sales.csv')
print(men_women)

      month  men  women
0   January   30     35
1  February   29     35
2     March   31     29
3     April   32     28
4       May   47     50
5      June   49     45


Merge all three DataFrames (sales, targets, and men_women) into one big DataFrame called all_data.

In [23]:
all_data = sales.merge(targets).merge(men_women)
print(all_data)

      month  revenue  target  men  women
0   January      300     310   30     35
1  February      290     270   29     35
2     March      310     300   31     29
3     April      325     350   32     28
4       May      475     475   47     50
5      June      495     500   49     45


Cool T-Shirts Inc. thinks that they have more revenue in months where they sell more women's t-shirts.

Select the rows of all_data where:

    revenue is greater than target
AND

    women is greater than men
Save your answer to the variable results.

In [24]:
results = all_data[(all_data.revenue > all_data.target)]
results = results[results.women > results.men]
print(results)

      month  revenue  target  men  women
1  February      290     270   29     35


### Merge on Specific Columns
In the previous example, the merge function "knew" how to combine tables based on the columns that were the same between two tables. For instance, products and orders both had a column called product_id. This won't always be true when we want to perform a merge.

In [25]:
'''
pd.merge(
    orders,
    customers.rename(columns={'id': 'customer_id'}))
'''

"\npd.merge(\n    orders,\n    customers.rename(columns={'id': 'customer_id'}))\n"

Merge orders and products using rename. Save your results to the variable orders_products.

In [26]:
orders_products = pd.merge(orders,\
                          products.rename(columns={'id': 'product_id'}))
print(orders_products)

   order_id  customer_id  product_id  quantity   timestamp  \
0         1            2           3         1  2017-01-01   
1         5            3           3         3  2017-02-01   
2         2            2           2         3  2017-01-01   
3         4            3           2         2  2017-02-01   
4         3            3           1         1  2017-01-01   
5         7            1           1         1  2017-02-02   
6         6            1           4         2  2017-03-01   
7         8            1           4         1  2017-02-02   

          description  price  
0          doo-hickey      7  
1          doo-hickey      7  
2  whatcha-ma-call-it     10  
3  whatcha-ma-call-it     10  
4      thing-a-ma-jig      5  
5      thing-a-ma-jig      5  
6               gizmo      3  
7               gizmo      3  


### Merge on Specific Columns II
In the previous exercise, we learned how to use rename to merge two DataFrames whose columns don't match.

If we don't want to do that, we have another option. We could use the keywords left_on and right_on to specify which columns we want to perform the merge on. In the example below, the "left" table is the one that comes first (orders), and the "right" table is the one that comes second (customers). This syntax says that we should match the customer_id from orders to the id in customers.

In [27]:
'''
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id')
'''

"\npd.merge(\n    orders,\n    customers,\n    left_on='customer_id',\n    right_on='id')\n"

If we use this syntax, we'll end up with two columns called id, one from the first table and one from the second. Pandas won't let you have two columns with the same name, so it will change them to id_x and id_y.

We could use the following code to make the suffixes reflect the table names:

In [28]:
'''
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id',
    suffixes=['_order', '_customer']
)
'''

"\npd.merge(\n    orders,\n    customers,\n    left_on='customer_id',\n    right_on='id',\n    suffixes=['_order', '_customer']\n)\n"

Merge orders and products using left_on and right_on. Use the suffixes _orders and _products. Save your results to the variable orders_products.

In [32]:
'''
orders_products = pd.merge(
                          orders,
                          products,
                          left_on='product_id',
                          right_on='id',
                          suffixes=['_orders', '_products'])
print(orders_products)
'''

"\norders_products = pd.merge(\n                          orders,\n                          products,\n                          left_on='product_id',\n                          right_on='id',\n                          suffixes=['_orders', '_products'])\nprint(orders_products)\n"

### Mismatched Merges
In our previous examples, there were always matching values when we were performing our merges. What happens when that isn't true?

Let's imagine that our products table is out of date and is missing the newest product: Product 5. What happens when someone orders it?

We've just released a new product with product_id equal to 5. People are ordering this product, but we haven't updated the products table.

In script.py, you'll find two DataFrames: products and orders. Inspect these DataFrames using print.

Notice that the third order in orders is for the mysterious new product, but that there is no product_id 5 in products.

Merge orders and products and save it to the variable merged_df.

In [33]:
print(orders)
print(products)

   order_id  customer_id  product_id  quantity   timestamp
0         1            2           3         1  2017-01-01
1         2            2           2         3  2017-01-01
2         3            3           1         1  2017-01-01
3         4            3           2         2  2017-02-01
4         5            3           3         3  2017-02-01
5         6            1           4         2  2017-03-01
6         7            1           1         1  2017-02-02
7         8            1           4         1  2017-02-02
   product_id         description  price
0           1      thing-a-ma-jig      5
1           2  whatcha-ma-call-it     10
2           3          doo-hickey      7
3           4               gizmo      3


In [34]:
merged_df = pd.merge(orders, products)
print(merged_df)

   order_id  customer_id  product_id  quantity   timestamp  \
0         1            2           3         1  2017-01-01   
1         5            3           3         3  2017-02-01   
2         2            2           2         3  2017-01-01   
3         4            3           2         2  2017-02-01   
4         3            3           1         1  2017-01-01   
5         7            1           1         1  2017-02-02   
6         6            1           4         2  2017-03-01   
7         8            1           4         1  2017-02-02   

          description  price  
0          doo-hickey      7  
1          doo-hickey      7  
2  whatcha-ma-call-it     10  
3  whatcha-ma-call-it     10  
4      thing-a-ma-jig      5  
5      thing-a-ma-jig      5  
6               gizmo      3  
7               gizmo      3  


### Outer Merge
In the previous exercise, we saw that when we merge two DataFrames whose rows don't match perfectly, we lose the unmatched rows.

This type of merge (where we only include matching rows) is called an inner merge. There are other types of merges that we can use when we want to keep information from the unmatched rows.

There are two hardware stores in town: Store A and Store B. Store A's inventory is in DataFrame store_a and Store B's inventory is in DataFrame store_b. They have decided to merge into one big Super Store!

Combine the inventories of Store A and Store B using an outer merge. Save the results to the variable store_a_b_outer.

In [35]:
store_a = pd.read_csv('store_a.csv')
print(store_a)
store_b = pd.read_csv('store_b.csv')
print(store_b)

store_a_b_outer = pd.merge(store_a, store_b, how='outer')
print(store_a_b_outer)

          item  store_a_inventory
0       hammer                 12
1  screwdriver                 15
2        nails                200
3       screws                350
4          saw                  6
5    duct tape                150
6       wrench                 12
7     pvc pipe                 54
            item  store_b_inventory
0         hammer                  6
1          nails                250
2            saw                  6
3      duct tape                150
4       pvc pipe                 54
5           rake                 10
6         shovel                 15
7  wooden dowels                192
             item  store_a_inventory  store_b_inventory
0          hammer               12.0                6.0
1     screwdriver               15.0                NaN
2           nails              200.0              250.0
3          screws              350.0                NaN
4             saw                6.0                6.0
5       duct tape              150

### Left Merge
Let's return to the merge of Company A and Company B.

Suppose we want to identify which customers are missing phone information. We would want a list of all customers who have email, but don't have phone.

We could get this by performing a Left Merge. A Left Merge includes all rows from the first (left) table, but only rows from the second (right) table that match the first table.

For this command, the order of the arguments matters. If the first DataFrame is company_a and we do a left join, we'll only end up with rows that appear in company_a.

By listing company_a first, we get all customers from Company A, and only customers from Company B who are also customers of Company B.

In [36]:
# pd.merge(company_a, company_b, how='left')

Let's return to the two hardware stores, Store A and Store B. They're not quite sure if they want to merge into a big Super Store just yet.

Store A wants to find out what products they carry that Store B does not carry. Using a left merge, combine store_a to store_b and save the results to store_a_b_left.

The items with null in store_b_inventory are carried by Store A, but not Store B.

In [37]:
store_a_b_left = pd.merge(store_a, store_b, how='left')

Now, Store B wants to find out what products they carry that Store A does not carry. Use a left join, to combine the two DataFrames but in the reverse order (i.e., store_b followed by store_a) and save the results to the variable store_b_a_left.

Which items are not carried by Store A, but are carried by Store B?

In [38]:
store_b_a_left = pd.merge(store_b, store_a, how='left')

print(store_a_b_left)
print(store_b_a_left)

          item  store_a_inventory  store_b_inventory
0       hammer                 12                6.0
1  screwdriver                 15                NaN
2        nails                200              250.0
3       screws                350                NaN
4          saw                  6                6.0
5    duct tape                150              150.0
6       wrench                 12                NaN
7     pvc pipe                 54               54.0
            item  store_b_inventory  store_a_inventory
0         hammer                  6               12.0
1          nails                250              200.0
2            saw                  6                6.0
3      duct tape                150              150.0
4       pvc pipe                 54               54.0
5           rake                 10                NaN
6         shovel                 15                NaN
7  wooden dowels                192                NaN


### Concatenate DataFrames
Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller.

When we need to reconstruct a single DataFrame from multiple smaller DataFrames, we can use the method pd.concat([df1, df2, df2, ...]). This method only works if all of the columns are the same in all of the DataFrames.

An ice cream parlor and a bakery have decided to merge.

The bakery's menu is stored in the DataFrame bakery, and the ice cream parlor's menu is stored in DataFrame ice_cream.

Create their new menu by concatenating the two DataFrames into a DataFrame called menu.

In [39]:
bakery = pd.read_csv('bakery.csv')
print(bakery)
ice_cream = pd.read_csv('ice_cream.csv')
print(ice_cream)

menu = pd.concat([bakery, ice_cream])
print(menu)

                  item  price
0               cookie   2.50
1              brownie   3.50
2        slice of cake   4.75
3  slice of cheesecake   4.75
4         slice of pie   5.00
                              item  price
0     scoop of chocolate ice cream   3.00
1       scoop of vanilla ice cream   2.95
2    scoop of strawberry ice cream   3.05
3  scoop of cookie dough ice cream   3.25
                              item  price
0                           cookie   2.50
1                          brownie   3.50
2                    slice of cake   4.75
3              slice of cheesecake   4.75
4                     slice of pie   5.00
0     scoop of chocolate ice cream   3.00
1       scoop of vanilla ice cream   2.95
2    scoop of strawberry ice cream   3.05
3  scoop of cookie dough ice cream   3.25


### Review
This lesson introduced some methods for combining multiple DataFrames:

    Creating a DataFrame made by matching the common columns of two DataFrames is called a merge
    We can specify which columns should be matches by using the keyword arguments left_on and right_on
    We can combine DataFrames whose rows don't all match using left, right, and outer merges and the how keyword argument
    We can stack or concatenate DataFrames with the same columns using pd.concat

Cool T-Shirts Inc. just created a website for ordering their products. They want you to analyze two datasets for them:

visits contains information on all visits to their landing page
checkouts contains all users who began to checkout on their website
Use print to inspect each DataFrame.

In [40]:
visits = pd.read_csv('visits.csv',
                        parse_dates=[1])
checkouts = pd.read_csv('checkouts.csv',
                        parse_dates=[1])
print(visits)
print(checkouts)

                                 user_id          visit_time
0   319350b4-9951-47ef-b3a7-6b252099905f 2017-02-21 07:16:00
1   7435ec9f-576d-4ebd-8791-361b128fca77 2017-05-16 08:37:00
2   0b061e73-f709-42fa-8d1a-5f68176ff154 2017-04-12 19:32:00
3   9133d6f0-e68b-4c8d-bafd-ff2825e8dafe 2017-08-18 04:32:00
4   08d13edb-071c-4cfb-9ee4-8f377d0e932a 2017-07-08 06:24:00
5   c7192ab9-e033-4b69-971d-4bd92631342e 2017-10-05 09:16:00
6   c4dac0f2-2fa9-48a8-b056-c3b2a5a5c683 2017-07-09 14:19:00
7   f028e9dd-77d0-4002-83f6-372a4837fda6 2017-10-27 08:46:00
8   e43cf28f-7d08-4019-bd66-ddf7dfd2e034 2017-11-12 01:47:00
9   746631d2-35d5-441e-a21b-e5f39442f981 2017-06-19 23:34:00
10  a0fc94a2-4a80-4a33-994b-75783066ac62 2017-05-11 13:07:00
11  e2c24ee0-7fdf-4400-abde-b36378fe5ce6 2017-07-04 15:33:00
12  78751233-c0de-44fb-bc2f-822bd9dd9be7 2017-01-23 05:38:00
13  fbcec4bc-f191-4c0c-870b-d22728ad1b18 2017-01-24 17:41:00
14  e6c7ecb9-4710-4cbd-ad02-c43971ebbe7f 2017-09-27 16:10:00
15  0c682ddd-144a-4743-9

We want to know the amount of time from a user's initial visit to the website to when they start to check out.

Use merge to combine visits and checkouts and save it to the variable v_to_c.

In [41]:
v_to_c = pd.merge(visits, checkouts)

In order to calculate the time between visiting and checking out, define a column of v_to_c called time

In [42]:
v_to_c['time'] = v_to_c.checkout_time - v_to_c.visit_time

Get the average time to checkout

In [43]:
print(v_to_c.time.mean)

<bound method Series.mean of 0    00:11:00
1    00:24:00
2    00:12:00
3    00:18:00
4    00:08:00
5    00:07:00
6    00:05:00
7    00:20:00
8    00:23:00
9    00:11:00
10   00:18:00
11   00:13:00
12   00:24:00
13   00:06:00
14   00:20:00
15   00:17:00
16   00:12:00
17   00:22:00
18   00:06:00
19   00:03:00
20   00:03:00
21   00:04:00
22   00:00:00
23   00:10:00
24   00:28:00
25   00:22:00
26   00:12:00
27   00:29:00
28   00:24:00
29   00:06:00
       ...   
50   00:22:00
51   00:26:00
52   00:19:00
53   00:16:00
54   00:15:00
55   00:08:00
56   00:01:00
57   00:14:00
58   00:02:00
59   00:03:00
60   00:19:00
61   00:24:00
62   00:21:00
63   00:27:00
64   00:13:00
65   00:07:00
66   00:20:00
67   00:16:00
68   00:07:00
69   00:13:00
70   00:07:00
71   00:23:00
72   00:20:00
73   00:00:00
74   00:29:00
75   00:22:00
76   00:01:00
77   00:29:00
78   00:01:00
79   00:29:00
Name: time, Length: 80, dtype: timedelta64[ns]>
