---
format:
  revealjs:
    include-in-header:
    - text: <script src='https://toolness.github.io/p5.js-widget/p5-widget.js'></script>
title: Set Operations
---


# Set Operations



## Introduction to Set Theory and Relational Algebra


Set theory is a fundamental concept in relational algebra, providing the basis for operations that manipulate relations. These operations enable combining and filtering data effectively in relational databases.


:::: {.columns}
::: {.column width=47%}
- Set theory deals with the mathematical concept of sets, collections of distinct elements.
- Relational algebra applies set theory to relations (tables) in databases.
- Common operations include union, intersection, difference, and Cartesian product.
- These operations allow manipulation of data across multiple relations.
- Set operations are performed on relations that have the same schema.
:::
::: {.column width=6%}
:::
::: {.column width=47%}
[![](.//assets/codd_acm_article.png)](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)
:::
::::


*Set theory is essential for understanding how relational algebra manipulates and combines relations.*

<!-- -->

In [None]:
#| echo: false
import pandas as pd
from tabulate import tabulate
from IPython.display import display, Markdown, HTML


def show_df( df, width="80%" ):
#   display(Markdown(df.to_markdown(index=False)))
#   print("<center>")
#   print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))   
#   print("</center>")
   html_table = df.drop_duplicates().to_html(index=False)

   # Define the HTML with centered table and 75% width
   html_content = f"""
   <div style="text-align: center;">
      <div style="display: inline-block; width: {width};">
         {html_table}
      </div>
   </div>
"""
   display(HTML(html_content))


d1 = {
    'Course': ['CMSC301', 'CMSC408',  'CMSC408'],
    'Term': ['Fall 2024','Fall 2024', 'Fall 2023'],
}
d2 = {
    'Course': ['CMSC110', 'CMSC201',  'CMSC475', 'CMSC408'],
    'Term': ['Fall 2024','Fall 2024', 'Fall 2023','Fall 2024'],
}
d3 = {
   'Term': ['Fall 2022','Fall 2023','Fall 2024'],
   'Term_code': ['202310','202410','202510']
}

df1 = pd.DataFrame( d1 )
df2 = pd.DataFrame( d2 )
df3 = pd.DataFrame( d3 )

## Intersection Operation in Relational Algebra


The intersection operation retrieves rows that are common to two relations. It is used to find data that appears in both relations, making it useful when comparing datasets or finding shared entries between relations.


:::: {.columns}
::: {.column width=47%}
**∩ - Intersection**

- Intersection finds common tuples between two relations.
- The result includes only those tuples that appear in both relations.
- Denoted as Relation1 ∩ Relation2
- It's a binary operation, meaning it operates on two relations.
- Both relations must be *union-compatible*, meaning they have the same set of attributes and data types.
- Intersection is often used in conjunction with other set-based operations like union and difference.
:::
::: {.column width=6%}
:::
::: {.column width=47%}

In [None]:
#| echo: false
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Define the two sets
set_A = {1, 2, 3, 4}
set_B = {4, 5, 6, 7}

# Create the Venn diagram with custom colors
venn = venn2([set_A, set_B], set_labels=('Set A', 'Set B'))

venn.get_patch_by_id('10').set_color('white')
venn.get_patch_by_id('11').set_color('lightblue')
venn.get_patch_by_id('01').set_color('white')

# Set all patches to 'lightblue' color
for subset in ('10', '01', '11'):
    venn.get_patch_by_id(subset).set_edgecolor('black')  # Thin black borders
    venn.get_patch_by_id(subset).set_linewidth(1.5)      # Adjust the border width


# Remove the default labels for A and B
venn.get_label_by_id('10').set_text('')
venn.get_label_by_id('01').set_text('')
venn.get_label_by_id('11').set_text('')

# Make the background transparent
plt.gca().set_facecolor('none')

# Display the plot
plt.title('A ∩ B')
plt.show()

**Examples**

Given two relations *Students(ID,Name,Major)* and *Registered(ID,Name,Major)*,

The following are valid examples of the $\cap$ in unicode:

1. Students ∩ Registered

1. Courses ∩ OfferedCourses

1. Employees ∩ Managers
:::
::::

<!-- -->

*Intersection is useful for finding commonality between two sets of data in relational databases.*



## Properties of the Intersection Operator



:::: {.columns}
::: {.column width=47%}
**Definition**

$$
R_1 \cap R_2 = { t \mid t \in R_1 \text{ and } t \in R_2 }
$$

- where $t$ is a row (tuple),
- $R_1$ and $R_2$ are relations (tables) with the same attributes,
- The intersection operation returns a new relation containing only the rows that are present in both $R_1$ and $R_2$,
- The result consists of distinct rows that satisfy the condition of being in both relations.
:::
::: {.column width=6%}
:::
::: {.column width=47%}
**Properties**

- **Idempotent** – Applying the intersection of a relation with itself doesn't change the result:

$$
R \cap R = R
$$

- **Commutative** – The order of relations in an intersection operation doesn't matter:

$$
R_1 \cap R_2 = R_2 \cap R_1
$$

- **Associative** – The grouping of intersection operations doesn't affect the result:

$$
(R_1 \cap R_2) \cap R_3 = R_1 \cap (R_2 \cap R_3)
$$

- **Intersection with an empty set** – The intersection of a relation with an empty set is the empty set:

$$
R \cap \emptyset = \emptyset
$$

- **Intersection distributes over union** – The intersection of two relations distributes over their union:

$$
R_1 \cap (R_2 \cup R_3) = (R_1 \cap R_2) \cup (R_1 \cap R_3)
$$
:::
::::

<!-- -->



## Explanation of properties


- **Idempotent**: Combining a relation with itself using intersection does not remove any rows, so the result remains the same.
- **Commutative**: The order of the relations in the intersection operation does not affect the result.
- **Associative**: You can group intersection operations in any way, and the result will be the same.
- **Intersection with an empty set**: Intersecting with an empty relation results in an empty set because no rows can be in both the original relation and an empty set.
- **Distributed**: Intersection distributes over union, meaning you can break down or combine intersections in a structured way over unions.



## Intersection - ∩ - Example 1



:::: {.columns}
::: {.column width=47%}
Given *Courses1( Course,Term)*:

In [None]:
#| echo: false
show_df(df1)

and *Courses2( Course,Term )*:

In [None]:
#| echo: false
show_df(df2)

:::
::: {.column width=6%}
:::
::: {.column width=47%}
*Courses1* $\cap$ *Courses2* returns:

In [None]:
#| echo: false
new_df = pd.concat( [df1, df2] )
new_df = pd.merge(df1, df2, how='inner')
show_df( new_df )

:::
::::

<!-- -->



## Difference in Relational Algebra


Difference in relational algebra subtracts one relation from another, returning the rows that are present in the first relation but not the second.


:::: {.columns}
::: {.column width=47%}
- The difference operation returns tuples that are in one relation but not in the other.
- It is often used to filter out unwanted data from a larger dataset.
- The relations must have the same schema for the difference operation to be valid.
- This operation can help isolate unique data points in a relation.
- The result is a relation that includes only the data exclusive to the first set.
:::
::: {.column width=6%}
:::
::: {.column width=47%}

In [None]:
#| echo: false
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Define the two sets
set_A = {1, 2, 3, 4}
set_B = {4, 5, 6, 7}

# Create the Venn diagram with custom colors
venn = venn2([set_A, set_B], set_labels=('Set A', 'Set B'))

venn.get_patch_by_id('10').set_color('lightblue')
venn.get_patch_by_id('11').set_color('white')
venn.get_patch_by_id('01').set_color('white')

# Set all patches to 'lightblue' color
for subset in ('10', '01', '11'):
    venn.get_patch_by_id(subset).set_edgecolor('black')  # Thin black borders
    venn.get_patch_by_id(subset).set_linewidth(1.5)      # Adjust the border width


# Remove the default labels for A and B
venn.get_label_by_id('10').set_text('')
venn.get_label_by_id('01').set_text('')
venn.get_label_by_id('11').set_text('')

# Make the background transparent
plt.gca().set_facecolor('none')

# Display the plot
plt.title('A - B')
plt.show()

:::
::::

<!-- -->

*Difference is a powerful tool for excluding data from one relation that is present in another.*



## Properties of the Difference Operator



:::: {.columns}
::: {.column width=47%}
**Definition**

$$
R_1 - R_2 = { t \mid t \in R_1 \text{ and } t \notin R_2 }
$$

- where $t$ is a row (tuple),
- $R_1$ and $R_2$ are relations (tables) with the same attributes,
- The difference operation returns a new relation containing only the rows that are in $R_1$ but not in $R_2$,
- The result consists of distinct rows that exist in $R_1$ and are absent from $R_2$.
:::
::: {.column width=6%}
:::
::: {.column width=47%}
**Properties**

- **Non-commutative** – The order of relations in the difference operation matters:

$$
R_1 - R_2 \neq R_2 - R_1
$$

- **Not associative** – Grouping difference operations affects the result:

$$
(R_1 - R_2) - R_3 \neq R_1 - (R_2 - R_3)
$$

- **Difference with an empty set** – The difference between a relation and an empty set is the relation itself:

$$
R_1 - \emptyset = R_1
$$

- **Difference with itself** – The difference between a relation and itself is the empty set:

$$
R_1 - R_1 = \emptyset
$$

- **Distributive over intersection** – The difference operation distributes over intersection:

$$
R_1 - (R_2 \cap R_3) = (R_1 - R_2) \cap (R_1 - R_3)
$$
:::
::::

<!-- -->



## Explanation of properties


- **Non-commutative**: The order in which the relations are used in the difference matters because the result will include rows from $R_1$ that are not in $R_2$, but not vice versa.
- **Not associative**: The grouping of relations in a difference operation affects the outcome since subtracting another relation later changes the rows that remain.
- **Difference with an empty set**: Subtracting an empty set from a relation has no effect since there are no rows to remove.
- **Difference with itself**: Subtracting a relation from itself results in an empty set, as no rows are left.
- **Distributive over intersection**: Difference distributes over intersection, allowing you to apply the difference to both parts of the intersection separately and then take their intersection.



## Difference - $-$ - Example 1



:::: {.columns}
::: {.column width=47%}
Given *Courses1( Course,Term)*:

In [None]:
#| echo: false
show_df(df1)

and *Courses2( Course,Term )*:

In [None]:
#| echo: false
show_df(df2)

:::
::: {.column width=6%}
:::
::: {.column width=47%}
*Courses1* $-$ *Courses2* returns:

In [None]:
#| echo: false
difference = pd.merge(df1, df2, how='left', indicator=True)

# Keep only the rows that are unique to df1
df1_minus_df2 = difference[difference['_merge'] == 'left_only'].drop(columns=['_merge'])

show_df( df1_minus_df2 )

:::
::::

<!-- -->

## Difference - $-$ - Example 2

:::: {.columns}
::: {.column width=47%}
Given *Courses1( Course,Term)*:

In [None]:
#| echo: false
show_df(df1)

and *Courses2( Course,Term )*:

In [None]:
#| echo: false
show_df(df2)

:::
::: {.column width=6%}
:::
::: {.column width=47%}
*Courses2* $-$ *Courses1* returns:

In [None]:
#| echo: false
difference = pd.merge(df2, df1, how='left', indicator=True)

# Keep only the rows that are unique to df1
df2_minus_df1 = difference[difference['_merge'] == 'left_only'].drop(columns=['_merge'])

show_df( df2_minus_df1 )

:::
::::

<!-- -->





## Combining Relations Using Set Operations


Set operations allow the combination of multiple relations in a variety of ways, depending on the desired outcome of the query.


:::: {.columns}
::: {.column width=98%}
- Relational algebra supports various set operations like union, intersection, and difference.
- These operations allow filtering and merging data across relations.
- Set operations are only valid when the schemas of the involved relations match.
- Use cases include combining multiple tables, finding common data, or filtering out specific records.
- Understanding these operations is key to effective data manipulation in relational databases.
:::
::: {.column width=1%}
:::
::: {.column width=1%}

:::
::::

<!-- -->

*Set operations provide flexible tools for combining and comparing datasets in relational databases.*



## Set Operation Requirements and Considerations


When performing set operations, it's essential to ensure that both relations have compatible schemas and understand how each operation behaves.


:::: {.columns}
::: {.column width=98%}
- Set operations can only be performed on relations with identical schemas.
- The number of attributes and their types must match for the operation to succeed.
- Set operations can return large results, depending on the size of the input relations.
- Performance considerations include the size of relations and efficiency of the operation.
- Proper indexing can improve the speed of set operations in large databases.
:::
::: {.column width=1%}
:::
::: {.column width=1%}

:::
::::

<!-- -->

*Understanding the requirements of set operations ensures successful and efficient data manipulation.*



## Examples of Set Operations in Relational Queries


Relational algebra operations like union, intersection, and difference can be directly applied in database queries to filter and combine data.


:::: {.columns}
::: {.column width=98%}
- Example: Union of two employee tables to combine employee records from two departments.
- Example: Intersection of student and graduate tables to find students who have graduated.
- Example: Difference between a product catalog and inventory to find out-of-stock items.
- Cartesian product used to combine customer and order tables for further analysis.
- Practical queries often combine set operations with other relational algebra operations.
:::
::: {.column width=1%}
:::
::: {.column width=1%}

:::
::::

<!-- -->

*Set operations are applied in real-world scenarios to efficiently manipulate and query data.*



## Set Operations in Real-World Databases


Set operations play a vital role in real-world databases, helping manage and analyze large datasets effectively.


:::: {.columns}
::: {.column width=98%}
- Used to merge large datasets across departments or organizations.
- Helpful in financial reporting, where records from different periods or regions are combined.
- Set operations can aid in data cleaning by removing duplicates or irrelevant records.
- They are fundamental in multi-relational databases where data is distributed across tables.
- Often used in cloud environments for large-scale data analysis and processing.
:::
::: {.column width=1%}
:::
::: {.column width=1%}

:::
::::

<!-- -->

*In practice, set operations streamline data integration and analysis across various industries.*



## Summary of Set Operations in Relational Algebra


Set operations, including union, intersection, difference, and Cartesian product, are key tools in relational algebra for manipulating and combining relations. They enable powerful queries that form the basis of relational database functionality.


:::: {.columns}
::: {.column width=98%}
- Set theory provides the foundation for combining and filtering relations.
- Intersection finds common records, while difference filters out data from one relation.
- Cartesian product creates all possible combinations of tuples from two relations.
- Union merges two relations, removing duplicates.
- These operations are essential for querying and managing relational databases.
:::
::: {.column width=1%}
:::
::: {.column width=1%}

:::
::::

<!-- -->

*Mastery of set operations in relational algebra allows for complex and efficient database queries.*
