<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/007-merging-data-with-pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Chicago Data Portal Dataset  

Chicago is divided into 50 wards, each representing a local neighborhood. We will use the following tables:  

- **The Ward Data**  

  - Local government data, including office addresses.
  - The wards table contains 50 rows and 4 columns.
  - One row for each ward's government details.
  - Download the file here: [chicago_wards.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_wards.p)

- **Census table** 
  
  - Population data for 2000 and 2010, percentage changes, and ward center addresses.  
  - The census table has 50 rows and 6 columns
  - Covers population data and ward center addresses.
  - Download the file here: [chicago_census.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_census.p)

- **Licenses Table**

  - Holds information such as business addresses and the wards.
  - Also related to the wards data through the `wards` column.
  - Download the file here: [chicago_business_licenses.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_licenses.p)

- **Business Owners**
  - Contains records of business owners in different wards.
  - Download the file here: [chicago_business_licenses.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_owners.p)

# Importing the Datasets

Import the wards data:

In [None]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_wards.p'

wards = pd.read_pickle(url)
print(wards.head())
print(wards.shape)

Import the census data:

In [None]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_census.p'

census = pd.read_pickle(url)
print(census.head())
print(census.shape)

Import the licenses data:

In [None]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_licenses.p'

licenses = pd.read_pickle(url)
print(licenses.head())
print(licenses.shape)

Import the business owners data:

In [14]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_owners.p'

biz_owners = pd.read_pickle(url)
print(biz_owners.head())
print(biz_owners.shape)

  account first_name  last_name      title
0      10      PEARL    SHERMAN  PRESIDENT
1      10      PEARL    SHERMAN  SECRETARY
2   10002     WALTER     MROZEK    PARTNER
3   10002     CELINA     BYRDAK    PARTNER
4   10005      IRENE  ROSENFELD  PRESIDENT
(21352, 4)


# Merging Tables  

The tables are related by the **ward** column. Using this column, we can merge the wards table with the census table, matching rows based on ward numbers.

Using pandas' `merge` method, we can combine the two DataFrames. The `on` argument specifies the **ward** column for the merge. The resulting DataFrame includes 50 rows and 9 columns, showing only rows with matching ward values in both tables. This is called an **inner join**.

The wards table is considered the "left table" and thus the column from this table will appear first, followed by the census table ("right table") columns.

In [None]:
wards_census = wards.merge(census, on='ward')
print(wards_census.head(6))

Merged tables may include columns with suffixes like `_x` or `_y` when both tables have overlapping column names (e.g., address or zip).

# Controlling Suffixes  

The `suffix` argument in `merge` allows customization. For instance, suffixes can be set to `'_ward'` for the left table and `'_cen'` for the right, making it easier to distinguish columns.

In [None]:
wards_census = wards.merge(census, on='ward', suffixes=('_ward', '_cen'))
print(wards_census.head())
print(wards_census.shape)

# Types of Relationships

- **One-to-One**

  - Each row in the left table matches exactly one row in the right table.
  - Example: The relationship between the *wards* and *census* tables.
    - Each ward in the *wards* table corresponds to one population entry in the *census* table.
    - For instance, ward 3 appears only once in both tables, ensuring a single population value for each ward.

- **One-to-Many**

- Each row in the left table can relate to multiple rows in the right table.
- Example: The relationship between the *wards* table and a *licenses* table containing business licenses.
  - Each ward can have many businesses, so rows in the *wards* table are repeated when merged with the *licenses* table.
  - For instance, ward 1 and its alderman appear multiple times due to many businesses in the 1st ward.

# Merging One-to-Many

Merging is performed using the `merge` method with the `on` attribute set to the shared column, e.g., *ward*. The resulting table combines ward data with all matching rows from the business license data.

Example: The *wards* table has 50 rows. 

- After merging with the *licenses* table, the new table has 10,000 rows.
- Number of rows in the resulting table is often larger than in the left table.

In [None]:
wards_licenses = wards.merge(licenses, on='ward', suffixes=('_ward', '_lic'))
print(wards_licenses.head())
print(wards_licenses.shape)

Next, merge the business owners table with the licenses table to find out what is the most common business owner title. After merging, group the results by title and then count the number of accounts.

In [15]:
licenses_owners = licenses.merge(biz_owners, on='account')

counted_df = licenses_owners.groupby("title").agg({'account':'count'})
sorted_df = counted_df.sort_values(by='account', ascending=False)
print(sorted_df.head())

                 account
title                   
PRESIDENT           6259
SECRETARY           5205
SOLE PROPRIETOR     1658
OTHER               1200
VICE PRESIDENT       970
