## Introduction

Cross products (or Cartesian joins) are an important concept in SQL for joining tables together. However, they are seldom used as-is for reasons we will investigate in this exercise.

The files we will use in this exercise are pre-processed from data taken from the Government of Alberta's Open Data Portal and are licensed by the [Open Government License](https://open.alberta.ca/licence).

In [3]:
import pandas as pd

oil_records = pd.read_csv("OilProductionByMunicipality_2018.csv")
print(oil_records.count())
display(oil_records.head())

gas_records = pd.read_csv("NaturalGasProductionByMunicipality_2018.csv")
print(gas_records.count())
display(gas_records.head())

well_records = pd.read_csv("WellCountByMunicipality_2018.csv")
print(well_records.count())
display(well_records.head())

CSDUID                         74
CSD                            74
Period                         74
IndicatorSummaryDescription    74
UnitOfMeasure                  74
OriginalValue                  74
dtype: int64


Unnamed: 0,CSDUID,CSD,Period,IndicatorSummaryDescription,UnitOfMeasure,OriginalValue
0,4805026,Drumheller,2018,Oil Production,m3,6969.6
1,4805031,Starland County,2018,Oil Production,m3,112882.5
2,4805041,Kneehill County,2018,Oil Production,m3,280762.7
3,4802001,Warner County No. 5,2018,Oil Production,m3,195193.9
4,4807049,Wainwright No. 61,2018,Oil Production,m3,1086087.1


CSDUID                         76
CSD                            76
Period                         76
IndicatorSummaryDescription    76
UnitOfMeasure                  76
OriginalValue                  76
dtype: int64


Unnamed: 0,CSDUID,CSD,Period,IndicatorSummaryDescription,UnitOfMeasure,OriginalValue
0,4805026,Drumheller,2018,Natural Gas Production,m3,45710.9
1,4805031,Starland County,2018,Natural Gas Production,m3,562042.6
2,4805041,Kneehill County,2018,Natural Gas Production,m3,2452594.5
3,4802001,Warner County No. 5,2018,Natural Gas Production,m3,229669.1
4,4807049,Wainwright No. 61,2018,Natural Gas Production,m3,251671.0


CSDUID                         82
CSD                            82
Period                         82
IndicatorSummaryDescription    82
UnitOfMeasure                   0
OriginalValue                  82
dtype: int64


Unnamed: 0,CSDUID,CSD,Period,IndicatorSummaryDescription,UnitOfMeasure,OriginalValue
0,4805026,Drumheller,2018,Well Count,,5.0
1,4805031,Starland County,2018,Well Count,,17.0
2,4805041,Kneehill County,2018,Well Count,,36.0
3,4802001,Warner County No. 5,2018,Well Count,,11.0
4,4807049,Wainwright No. 61,2018,Well Count,,32.0


Recall that a cross product between tables A $(A_1, A_2, ...,A_n)$ and B $(B_1, B_2, ..., B_m)$ will take the form

A $\times$ B $(A_1 B_1, A_1 B_2, ..., A_1 B_m, A_2 B_1, ..., A_n B_m)$

Given this, give a naive implementation of the cross product of `oil_records` and `well_records`, using the [`append()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) method in pandas.

In [19]:
empty_list = []
for i, oil in oil_records.iterrows():
    for j, well in well_records.iterrows():
        empty_list.append(pd.concat([pd.DataFrame(oil).T, pd.DataFrame(well).T], axis = 1))
result = pd.concat(empty_list)

In [20]:
display(result.head())

Unnamed: 0,CSDUID,CSD,Period,IndicatorSummaryDescription,UnitOfMeasure,OriginalValue,CSDUID.1,CSD.1,Period.1,IndicatorSummaryDescription.1,UnitOfMeasure.1,OriginalValue.1
0,4805026.0,Drumheller,2018.0,Oil Production,m3,6969.6,4805026.0,Drumheller,2018.0,Well Count,,5.0
0,4805026.0,Drumheller,2018.0,Oil Production,m3,6969.6,,,,,,
1,,,,,,,4805031.0,Starland County,2018.0,Well Count,,17.0
0,4805026.0,Drumheller,2018.0,Oil Production,m3,6969.6,,,,,,
2,,,,,,,4805041.0,Kneehill County,2018.0,Well Count,,36.0


In [18]:
oil_well = pd.DataFrame()
k = 0
for i in range(len(oil_records)):
    for j in range(len(well_records)):
        oil_well.iloc[[k]] = oil_records.iloc[[i]].append(well_records.iloc[[j]])
        k += 1
print(oil_well.count())

IndexError: positional indexers are out-of-bounds

pandas was created to provide more efficient ways to handle tabular data. While no Cartesian product has been [officially implemented yet](https://github.com/pandas-dev/pandas/issues/5401), a frequently mentioned workaround works as follows: 

- create a dummy column on both A and B (The column should just hold a single value for all rows of A and B)
- use `merge()` to join both tables on the dummy column
- drop the dummy column from the resulting column

Try this now on `gas_records` and `well_records`. 


How many records would result from the Cartesian join of all three tables? 