<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/007-merging-data-with-pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Chicago Data Portal Dataset  

Chicago is divided into 50 wards, each representing a local neighborhood. We will use the following tables:  

- **The Ward Data**  

  - Local government data, including office addresses.
  - The wards table contains 50 rows and 4 columns.
  - One row for each ward's government details.
  - Download the file here: [chicago_wards.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_wards.p)

- **Census table** 
  
  - Population data for 2000 and 2010, percentage changes, and ward center addresses.  
  - The census table has 50 rows and 6 columns
  - Covers population data and ward center addresses.
  - Download the file here: [chicago_census.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_census.p)

- **Licenses Table**

  - Holds information such as business addresses and the wards.
  - Also related to the wards data through the `wards` column.
  - Download the file here: [chicago_business_licenses.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_licenses.p)

- **Business Owners**
  - Contains records of business owners in different wards.
  - Download the file here: [chicago_business_licenses.p](/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_owners.p)

# Importing the Datasets

Import the wards data:

In [16]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_wards.p'

wards = pd.read_pickle(url)
print(wards.head())
print(wards.shape)

  ward            alderman                          address    zip
0    1  Proco "Joe" Moreno        2058 NORTH WESTERN AVENUE  60647
1    2       Brian Hopkins       1400 NORTH  ASHLAND AVENUE  60622
2    3          Pat Dowell          5046 SOUTH STATE STREET  60609
3    4    William D. Burns  435 EAST 35TH STREET, 1ST FLOOR  60616
4    5  Leslie A. Hairston            2325 EAST 71ST STREET  60649
(50, 4)


Import the census data:

In [17]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_census.p'

census = pd.read_pickle(url)
print(census.head())
print(census.shape)

  ward  pop_2000  pop_2010 change                                  address  \
0    1     52951     56149     6%              2765 WEST SAINT MARY STREET   
1    2     54361     55805     3%                 WM WASTE MANAGEMENT 1500   
2    3     40385     53039    31%                      17 EAST 38TH STREET   
3    4     51953     54589     5%  31ST ST HARBOR BUILDING LAKEFRONT TRAIL   
4    5     55302     51455    -7%  JACKSON PARK LAGOON SOUTH CORNELL DRIVE   

     zip  
0  60647  
1  60622  
2  60653  
3  60653  
4  60637  
(50, 6)


Import the licenses data:

In [18]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_licenses.p'

licenses = pd.read_pickle(url)
print(licenses.head())
print(licenses.shape)

  account ward  aid                   business               address    zip
0  307071    3  743       REGGIE'S BAR & GRILL       2105 S STATE ST  60616
1      10   10  829                 HONEYBEERS   13200 S HOUSTON AVE  60633
2   10002   14  775                CELINA DELI     5089 S ARCHER AVE  60632
3   10005   12  NaN  KRAFT FOODS NORTH AMERICA        2005 W 43RD ST  60609
4   10044   44  638  NEYBOUR'S TAVERN & GRILLE  3651 N SOUTHPORT AVE  60613
(10000, 6)


Import the business owners data:

In [19]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/joining-data-with-pandas/chicago_business_owners.p'

biz_owners = pd.read_pickle(url)
print(biz_owners.head())
print(biz_owners.shape)

  account first_name  last_name      title
0      10      PEARL    SHERMAN  PRESIDENT
1      10      PEARL    SHERMAN  SECRETARY
2   10002     WALTER     MROZEK    PARTNER
3   10002     CELINA     BYRDAK    PARTNER
4   10005      IRENE  ROSENFELD  PRESIDENT
(21352, 4)


# Merging Tables  

The tables are related by the **ward** column. Using this column, we can merge the wards table with the census table, matching rows based on ward numbers.

Using pandas' `merge` method, we can combine the two DataFrames. The `on` argument specifies the **ward** column for the merge. The resulting DataFrame includes 50 rows and 9 columns, showing only rows with matching ward values in both tables. This is called an **inner join**.

The wards table is considered the "left table" and thus the column from this table will appear first, followed by the census table ("right table") columns.

In [20]:
wards_census = wards.merge(census, on='ward')
print(wards_census.head(6))

  ward            alderman                         address_x  zip_x  pop_2000  \
0    1  Proco "Joe" Moreno         2058 NORTH WESTERN AVENUE  60647     52951   
1    2       Brian Hopkins        1400 NORTH  ASHLAND AVENUE  60622     54361   
2    3          Pat Dowell           5046 SOUTH STATE STREET  60609     40385   
3    4    William D. Burns   435 EAST 35TH STREET, 1ST FLOOR  60616     51953   
4    5  Leslie A. Hairston             2325 EAST 71ST STREET  60649     55302   
5    6  Roderick T. Sawyer  8001 S. MARTIN LUTHER KING DRIVE  60619     54989   

   pop_2010 change                                address_y  zip_y  
0     56149     6%              2765 WEST SAINT MARY STREET  60647  
1     55805     3%                 WM WASTE MANAGEMENT 1500  60622  
2     53039    31%                      17 EAST 38TH STREET  60653  
3     54589     5%  31ST ST HARBOR BUILDING LAKEFRONT TRAIL  60653  
4     51455    -7%  JACKSON PARK LAGOON SOUTH CORNELL DRIVE  60637  
5     52341    -5%

Merged tables may include columns with suffixes like `_x` or `_y` when both tables have overlapping column names (e.g., address or zip).

# Controlling Suffixes  

The `suffix` argument in `merge` allows customization. For instance, suffixes can be set to `'_ward'` for the left table and `'_cen'` for the right, making it easier to distinguish columns.

In [21]:
wards_census = wards.merge(census, on='ward', suffixes=('_ward', '_cen'))
print(wards_census.head())
print(wards_census.shape)

  ward            alderman                     address_ward zip_ward  \
0    1  Proco "Joe" Moreno        2058 NORTH WESTERN AVENUE    60647   
1    2       Brian Hopkins       1400 NORTH  ASHLAND AVENUE    60622   
2    3          Pat Dowell          5046 SOUTH STATE STREET    60609   
3    4    William D. Burns  435 EAST 35TH STREET, 1ST FLOOR    60616   
4    5  Leslie A. Hairston            2325 EAST 71ST STREET    60649   

   pop_2000  pop_2010 change                              address_cen zip_cen  
0     52951     56149     6%              2765 WEST SAINT MARY STREET   60647  
1     54361     55805     3%                 WM WASTE MANAGEMENT 1500   60622  
2     40385     53039    31%                      17 EAST 38TH STREET   60653  
3     51953     54589     5%  31ST ST HARBOR BUILDING LAKEFRONT TRAIL   60653  
4     55302     51455    -7%  JACKSON PARK LAGOON SOUTH CORNELL DRIVE   60637  
(50, 9)


# Types of Relationships

- **One-to-One**

  - Each row in the left table matches exactly one row in the right table.
  - Example: The relationship between the *wards* and *census* tables.
    - Each ward in the *wards* table corresponds to one population entry in the *census* table.
    - For instance, ward 3 appears only once in both tables, ensuring a single population value for each ward.

- **One-to-Many**

- Each row in the left table can relate to multiple rows in the right table.
- Example: The relationship between the *wards* table and a *licenses* table containing business licenses.
  - Each ward can have many businesses, so rows in the *wards* table are repeated when merged with the *licenses* table.
  - For instance, ward 1 and its alderman appear multiple times due to many businesses in the 1st ward.

# Merging One-to-Many

Merging is performed using the `merge` method with the `on` attribute set to the shared column, e.g., *ward*. The resulting table combines ward data with all matching rows from the business license data.

Example: The *wards* table has 50 rows. 

- After merging with the *licenses* table, the new table has 10,000 rows.
- Number of rows in the resulting table is often larger than in the left table.

In [22]:
wards_licenses = wards.merge(licenses, on='ward', suffixes=('_ward', '_lic'))
print(wards_licenses.head())
print(wards_licenses.shape)

  ward            alderman               address_ward zip_ward account  aid  \
0    1  Proco "Joe" Moreno  2058 NORTH WESTERN AVENUE    60647   12024  NaN   
1    1  Proco "Joe" Moreno  2058 NORTH WESTERN AVENUE    60647   14446  743   
2    1  Proco "Joe" Moreno  2058 NORTH WESTERN AVENUE    60647   14624  775   
3    1  Proco "Joe" Moreno  2058 NORTH WESTERN AVENUE    60647   14987  NaN   
4    1  Proco "Joe" Moreno  2058 NORTH WESTERN AVENUE    60647   15642  814   

               business              address_lic zip_lic  
0   DIGILOG ELECTRONICS       1038 N ASHLAND AVE   60622  
1      EMPTY BOTTLE INC   1035 N WESTERN AVE 1ST   60622  
2  LITTLE MEL'S HOT DOG    2205 N CALIFORNIA AVE   60647  
3    MR. BROWN'S LOUNGE   2301 W CHICAGO AVE 1ST   60622  
4          Beat Kitchen  2000-2100 W DIVISION ST   60622  
(10000, 9)


Next, merge the business owners table with the licenses table to find out what is the most common business owner title. After merging, group the results by title and then count the number of accounts.

In [23]:
licenses_owners = licenses.merge(biz_owners, on='account')

counted_df = licenses_owners.groupby("title").agg({'account':'count'})
sorted_df = counted_df.sort_values(by='account', ascending=False)
print(sorted_df.head())

                 account
title                   
PRESIDENT           6259
SECRETARY           5205
SOLE PROPRIETOR     1658
OTHER               1200
VICE PRESIDENT       970
