## Introduction

Cross products (or Cartesian joins) are an important concept in SQL for joining tables together. However, they are seldom used as-is for reasons we will investigate in this exercise.

The files we will use in this exercise are pre-processed from data taken from the Government of Alberta's Open Data Portal and are licensed by the [Open Government License](https://open.alberta.ca/licence).

In [3]:
import pandas as pd

oil_records = pd.read_csv("OilProductionByMunicipality_2018.csv")
oil_records.head()

gas_records = pd.read_csv("NaturalGasProductionByMunicipality_2018.csv")
gas_records.head()

well_records = pd.read_csv("WellCountByMunicipality_2018.csv")
well_records.head()


Unnamed: 0,CSDUID,CSD,Period,IndicatorSummaryDescription,UnitOfMeasure,OriginalValue
0,4805026,Drumheller,2018,Well Count,,5.0
1,4805031,Starland County,2018,Well Count,,17.0
2,4805041,Kneehill County,2018,Well Count,,36.0
3,4802001,Warner County No. 5,2018,Well Count,,11.0
4,4807049,Wainwright No. 61,2018,Well Count,,32.0


Recall that a cross product between tables A $(A_1, A_2, ...,A_n)$ and B $(B_1, B_2, ..., B_m)$ will take the form

A $\times$ B $(A_1 B_1, A_1 B_2, ..., A_1 B_m, A_2 B_1, ..., A_n B_m)$

Given this, give a naive implementation of the cross product of `oil_records` and `well_records`, using the [`append()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) method in pandas.

In [4]:
resultTuples = []

for oil_tuple in oil_records.itertuples():
    for well_tuple in well_records.itertuples():
        resultTuples.append(oil_tuple[1:] + well_tuple[1:])
     
resultFrame = pd.DataFrame(resultTuples)

resultFrame.set_axis(oil_records.columns.append(well_records.columns), axis = 1, inplace = True)
resultFrame.describe()

Unnamed: 0,CSDUID,Period,OriginalValue,CSDUID.1,Period.1,UnitOfMeasure,OriginalValue.1
count,6068.0,6068.0,6068.0,6068.0,6068.0,0.0,6068.0
mean,4810379.0,2018.0,1607874.0,4810565.0,2018.0,,64.219512
std,5263.328,0.0,5854682.0,5181.746,0.0,,131.286085
min,4801003.0,2018.0,56.4,4801003.0,2018.0,,0.0
25%,4806014.0,2018.0,77370.6,4806016.0,2018.0,,1.0
50%,4811006.0,2018.0,196453.5,4811033.0,2018.0,,13.5
75%,4814003.0,2018.0,897298.4,4815013.0,2018.0,,41.0
max,4819071.0,2018.0,47438230.0,4819071.0,2018.0,,671.0


pandas was created to provide more efficient ways to handle tabular data. While no Cartesian product has been [officially implemented yet](https://github.com/pandas-dev/pandas/issues/5401), a frequently mentioned workaround works as follows: 

- create a dummy column on both A and B (The column should just hold a single value for all rows of A and B)
- use `merge()` to join both tables on the dummy column
- drop the dummy column from the resulting column

Try this now on `gas_records` and `well_records`. 


In [5]:
cartesianFrame = gas_records.assign(dummy = 1).merge(well_records.assign(dummy = 1), on = 'dummy').drop('dummy', 1)

cartesianFrame.describe()

Unnamed: 0,CSDUID_x,Period_x,OriginalValue_x,CSDUID_y,Period_y,UnitOfMeasure_y,OriginalValue_y
count,6232.0,6232.0,6232.0,6232.0,6232.0,0.0,6232.0
mean,4810343.0,2018.0,1760190.0,4810565.0,2018.0,,64.219512
std,5288.762,0.0,5228492.0,5181.734,0.0,,131.285801
min,4801003.0,2018.0,57.9,4801003.0,2018.0,,0.0
25%,4806011.0,2018.0,91166.87,4806016.0,2018.0,,1.0
50%,4811006.0,2018.0,319867.9,4811033.0,2018.0,,13.5
75%,4814256.0,2018.0,1002641.0,4815013.0,2018.0,,41.0
max,4819071.0,2018.0,35682380.0,4819071.0,2018.0,,671.0


## merge() and join() methods

Now use the pandas `merge()` and `join()` methods to accomplish the following tasks:

Use merge() to combine the gas_records and well_records on the CSDUID columns. What is the cardinality of the resulting table?

Use join() to combine the oil_records and well_reocrds on the CSD column. What is the cardinality of the resulting table?

What would be appropriate tables (and columns) to use a left or right SQL-style join upon? What would be the intended result?

In [11]:
gas_well = pd.merge(oil_records, well_records, on = 'CSDUID')
display(gas_well.head())

oil_well = oil_records.join(well_records, )

Unnamed: 0,CSDUID,CSD_x,Period_x,IndicatorSummaryDescription_x,UnitOfMeasure_x,OriginalValue_x,CSD_y,Period_y,IndicatorSummaryDescription_y,UnitOfMeasure_y,OriginalValue_y
0,4805026,Drumheller,2018,Oil Production,m3,6969.6,Drumheller,2018,Well Count,,5.0
1,4805031,Starland County,2018,Oil Production,m3,112882.5,Starland County,2018,Well Count,,17.0
2,4805041,Kneehill County,2018,Oil Production,m3,280762.7,Kneehill County,2018,Well Count,,36.0
3,4802001,Warner County No. 5,2018,Oil Production,m3,195193.9,Warner County No. 5,2018,Well Count,,11.0
4,4807049,Wainwright No. 61,2018,Oil Production,m3,1086087.1,Wainwright No. 61,2018,Well Count,,32.0


In [None]:
# code using join() here