## Python Collections (Arrays)
There are four collection data types in the Python programming language:

1. **List** is a collection which is ordered and changeable. Allows duplicate members.
2. **Tuple** is a collection which is ordered and unchangeable. Allows duplicate members.
3. **Set** is a collection which is unordered, unchangeable*, and unindexed. No duplicate members.
4. **Dictionary** is a collection which is ordered** and changeable. No duplicate members.

### Example demonstrating all 4 collection types

In [None]:
# Python program demonstrating all 4 collection types

# 1. List - ordered, changeable, allows duplicates
my_list = ["apple", "banana", "cherry", "apple"]
print("List example:")
print(f"Original list: {my_list}")
my_list.append("orange")  # Lists are changeable
print(f"After adding 'orange': {my_list}")
print(f"First item: {my_list[0]}")  # Lists are ordered (indexed)
print()

# 2. Tuple - ordered, unchangeable, allows duplicates
my_tuple = ("apple", "banana", "cherry", "apple")
print("Tuple example:")
print(f"Tuple: {my_tuple}")
print(f"Second item: {my_tuple[1]}")  # Tuples are ordered (indexed)
# my_tuple[0] = "orange"  # This would cause an error - tuples are unchangeable
print()

# 3. Set - unordered, unchangeable items, no duplicates
my_set = {"apple", "banana", "cherry", "apple"}  # Note: duplicate "apple" will be removed
print("Set example:")
print(f"Set: {my_set}")  # Notice only one "apple" appears
my_set.add("orange")  # You can add items to a set
print(f"After adding 'orange': {my_set}")
# print(my_set[0])  # This would cause an error - sets are unordered and unindexed
print()

# 4. Dictionary - ordered, changeable, no duplicate keys
my_dict = {
    "name": "John",
    "age": 30,
    "city": "New York"
}
print("Dictionary example:")
print(f"Dictionary: {my_dict}")
my_dict["email"] = "john@example.com"  # Dictionaries are changeable
print(f"After adding email: {my_dict}")
print(f"Value of 'name' key: {my_dict['name']}")  # Access by key

## Tuples in Python

**What it is:** A tuple is an ordered, immutable collection of items.

**Ordered:** The items are stored in a specific sequence, and that sequence will not change. You can access items by their index (e.g., the first item, the second item).
**Immutable:** Once a tuple is created, you cannot change its contents. You can't add new items, remove existing items, or change an item at a specific position.
Tuples are defined using parentheses ().
In what situations will a Data Engineer/Data Analyst use it?

**Representing Fixed Collections of Data** When you have a small, fixed set of related items where the order matters and the items shouldn't change.

**Examples:**

**Data Integrity:** When you want to pass around a collection of items and be absolutely sure they won't be accidentally modified.
Example: Defining a set of constant column names that should be used consistently across a script.

**Returning Multiple Values from a Function:** Functions in Python can only return a single "object." Tuples are a very common and Pythonic way to return multiple values bundled together.
Example: A data processing function might return (number_of_records_processed, number_of_errors).

In [6]:
location_coords = (40.7128, -74.0060) # New York City
another_location = (34.0522, -118.2437) # Los Angeles

print(f"NYC Latitude: {location_coords[0]}")
print(f"NYC Longitude: {location_coords[1]}")

# Attempting to change a value will raise an error
#location_coords[0] = 41.0

# You can assigning values of tuples to 2 variables
lat, lon = location_coords
print(f"Unpacked - Lat: {lat}, Lon: {lon}")

print(location_coords)

NYC Latitude: 40.7128
NYC Longitude: -74.006
Unpacked - Lat: 40.7128, Lon: -74.006
(40.7128, -74.006)


In [7]:

expected_columns = ('transaction_id', 'customer_id', 'amount', 'timestamp')

print(f"Expected column order: {expected_columns}")
print(f"The first expected column is: {expected_columns[0]}")

# Imagine you receive data and check if its columns match
received_columns = ['transaction_id', 'customer_id', 'timestamp', 'amount'] # This is a list

if tuple(received_columns) == expected_columns:
    print("Column order matches the expectation.")
else:
    print("Column order mismatch!")

Expected column order: ('transaction_id', 'customer_id', 'amount', 'timestamp')
The first expected column is: transaction_id
Column order mismatch!


## Sets in Python

**What is a Set?**
A set is an unordered collection of unique items. Think of it like a mathematical set. Key characteristics:

**Unordered:** The elements in a set do not have a specific order. You cannot refer to an element by an index number.

**Unique:** A set cannot contain duplicate values. If you try to add an element that's already present, the set remains unchanged.

**Mutable:** You can add or remove elements from a set after it's created. (Note: The elements within a set must be of an immutable type, like numbers, strings, or tuples).

-----------

In [8]:
# A simple example to demonstrate sets

# Creating sets
set1 = {1, 2, 3}  # Set created using curly braces
set2 = set([3, 4, 5])  # Set created from a list

# Adding elements
set1.add(4)  # Adds the element 4 to set1
print("Added 4 to set1:", set1)

set2.update([6, 7])  # Adds multiple elements (6 and 7) to set2
print("Added 6 and 7 to set2:", set2)

# Removing elements
set1.remove(1)  # Removes the element 1 from set1 (raises error if not present)
print("Removed 1 from set1:", set1)

set2.discard(3)  # Removes the element 3 from set2 (does nothing if not present)
print("Discarded 3 from set2:", set2)

# Printing sets
print("Current Set 1:", set1)
print("Current Set 2:", set2)

# Clearing a set
set1.clear()  # Removes all elements from set1
print("Set 1 after clearing:", set1)

Added 4 to set1: {1, 2, 3, 4}
Added 6 and 7 to set2: {3, 4, 5, 6, 7}
Removed 1 from set1: {2, 3, 4}
Discarded 3 from set2: {4, 5, 6, 7}
Current Set 1: {2, 3, 4}
Current Set 2: {4, 5, 6, 7}
Set 1 after clearing: set()


## Set Operations

**Use Cases for Data Engineers/Data Analysts:**

Data engineers and data analysts find sets extremely useful for several common tasks:

**De-duplication/Finding Uniqueness:**

**Use Case:** Quickly getting a list of unique customer IDs from a transaction log, unique product SKUs, or unique IP addresses from web server logs.
Why sets? Their inherent property of storing only unique items makes this trivial. Converting a list to a set and back to a list is a common way to de-duplicate.
Membership Testing (Fast Lookups):

**Use Case:** Checking if a particular user ID exists in a large list of VIP customers, if a specific product is part of a promotional campaign, or if an encountered error code is in a list of known critical errors.
Why sets? Sets are highly optimized for checking if an element is present (in operator). This is significantly faster than searching through a list, especially for large collections.
Comparing Collections (Set Operations): This is where sets shine for analytical tasks:

**Intersection (finding common elements):**
Use Case: "Which customers bought product A and product B?" or "Which users are active on our website and our mobile app?"
                                                          
**Union (combining all unique elements):**
Use Case: "Get a list of all unique customers who interacted with marketing campaign X or campaign Y."
                                                          
**Difference (finding elements in one but not the other):**
Use Case: "Which customers made a purchase last month but not this month?" or "Which products are in warehouse A but not in warehouse B (data reconciliation)?"
                                                          
**Symmetric Difference (finding elements in either, but not both):**
Use Case: "Identify items that are in list A or list B, but not common to both (highlighting discrepancies)."

---

### Can Set Operations Be Used with Lists or Tuples?

The set operations you mentioned—**intersection, union, difference, and symmetric difference**—are **natively supported only by sets** in Python.

---

### Why Only Sets?

- **Sets** are designed for mathematical set operations.  
  - They ensure all elements are unique.
  - They provide optimized methods for these operations.
- **Lists** and **tuples** can contain duplicates and maintain order, so they do **not** have built-in support for these set operations.

---

### What About Lists and Tuples?

- **Lists** and **tuples** do **not** have methods like `.union()`, `.intersection()`, etc.
- If you want to perform set operations on lists or tuples, you need to **convert them to sets first**.

> **Example:**  
> To find common elements between two lists, convert them to sets and then use set operations.

---

### Summary Table

| Operation           | Sets | Lists | Tuples |
|---------------------|:----:|:-----:|:------:|
| Intersection        |  ✔️  |   ❌  |   ❌   |
| Union               |  ✔️  |   ❌  |   ❌   |
| Difference          |  ✔️  |   ❌  |   ❌   |
| Symmetric Difference|  ✔️  |   ❌  |   ❌   |

---

### In Practice

- **Use sets** when you need to perform these types of comparisons or analyses.
- **Convert lists/tuples to sets** if you need to use set operations, then convert back if you need the result as a list or tuple.

In [12]:
# Creating sets
set1 = {1, 2, 3}
set2 = {3, 4, 5}

# Set operations
# Union combines all unique elements from both sets
union_set = set1 | set2
print("Union of set1 and set2 (all unique elements from both sets):", union_set)

# Intersection finds common elements between sets
intersection_set = set1 & set2
print("Intersection of set1 and set2 (common in both elements):", intersection_set)

# Difference finds elements in set1 but not in set2
difference_set = set1 - set2
print("Difference of set1 and set2 (elements in set1 but not in set2):", difference_set)

# Symmetric difference finds elements in either set, but not in both
symmetric_difference_set = set2 ^ set1
print("Symmetric difference of set1 and set2 (elements in either set, but not both):", symmetric_difference_set)

# Checking for membership
print("Is 4 in set1?", 4 in set1)  # Checks if 4 is an element of set1

# Iterating through a set
print("Elements in set2:")
for item in set2:
    print(item)

Union of set1 and set2 (all unique elements from both sets): {1, 2, 3, 4, 5}
Intersection of set1 and set2 (common in both elements): {3}
Difference of set1 and set2 (elements in set1 but not in set2): {1, 2}
Symmetric difference of set1 and set2 (elements in either set, but not both): {1, 2, 4, 5}
Is 4 in set1? False
Elements in set2:
3
4
5


## Dictionary

A dictionary is an unordered (in Python versions before 3.7; ordered by insertion in Python 3.7+) collection of key-value pairs.

**Key-Value Pairs:** Each item in a dictionary consists of a unique key and its associated value. {'key1': 'value1', 'key2': 'value2'}.
Unique Keys: Keys within a dictionary must be unique. If you try to add a key that already exists, its value will be updated.
**Immutable Keys:** Keys must be of an immutable type (e.g., strings, numbers, tuples). Values can be of any type (including lists or other dictionaries).
**Mutable:** Dictionaries themselves are mutable. You can add, remove, or modify key-value pairs after creation.
Fast Lookups: Dictionaries are highly optimized for retrieving a value when you know its key.
Dictionaries are defined using curly braces {}.


**Historically (Python < 3.7):** If you added {'name': 'Alice', 'age': 30, 'city': 'New York'}. When you looped through it, you might get age first, then city, then name, or some other arbitrary order. This order could even change between different runs of the same script or different Python versions (pre-3.7).

**Currently (Python 3.7+):** If you add {'name': 'Alice', 'age': 30, 'city': 'New York'}. When you loop through it, you will reliably get name first, then age, then city.

In [14]:
# creating a dictionary
country_capitals = {
  "Germany": "Berlin", 
  "Canada": "Ottawa", 
  "England": "London"
}

# printing the dictionary
print(country_capitals)

{'Germany': 'Berlin', 'Canada': 'Ottawa', 'England': 'London'}


![image.png](attachment:bbc2fd28-0925-434d-82c0-2e904617155a.png)

In [15]:
country_capitals = {
  "Germany": "Berlin", 
  "Canada": "Ottawa", 
  "England": "London"
}

# access the value of keys
print(country_capitals["Germany"])
print(country_capitals["England"])

Berlin
London


In [16]:
# add an item with "Italy" as key and "Rome" as its value
country_capitals["Italy"] = "Rome"

print(country_capitals)

{'Germany': 'Berlin', 'Canada': 'Ottawa', 'England': 'London', 'Italy': 'Rome'}


In [17]:
# delete item having "Germany" key
del country_capitals["Germany"]

print(country_capitals)

{'Canada': 'Ottawa', 'England': 'London', 'Italy': 'Rome'}


In [18]:
# clear the dictionary
country_capitals.clear()

print(country_capitals)  

{}


In [20]:
job_roles = [
    {'role': 'Data Engineer - AWS', 'skills': ['Python', 'AWS Glue', 'AWS Lambda', 'S3', 'Redshift', 'SQL', 'Apache Spark']},
    {'role': 'Data Engineer - Azure', 'skills': ['Python', 'Azure Data Factory', 'Apache Spark', 'Kubernetes', 'Apache Synapse Analytics', 'SQL', 'Pandas']},
    {'role': 'Data Engineer - GCP', 'skills': ['Python', 'Cloud Dataflow', 'Cloud Functions', 'Cloud Storage', 'BigQuery', 'SQL']},
    {'role': 'Data Engineer - Batch Processing', 'skills': ['Python', 'Apache Spark', 'Hadoop', 'SQL', 'Java', 'Scala']},
    {'role': 'Data Engineer - Cloud Native', 'skills': ['Python', 'Kubernetes', 'Docker', 'Serverless', 'AWS Lambda', 'Azure Data Factory', 'Google Cloud Functions', 'Pandas']},
    {'role': 'Data Engineer - Real-time Data', 'skills': ['Python', 'Apache Kafka', 'Apache Flink', 'Spark Streaming', 'SQL', 'NoSQL']},
    {'role': 'Data Engineer - Data Warehousing', 'skills': ['SQL', 'ETL Tools', 'Data Modeling', 'Snowflake', 'Redshift', 'Azure Synapse Analytics']},
    {'role': 'Data Engineer - Data Lake', 'skills': ['Python', 'Hadoop', 'Spark', 'AWS S3', 'Azure Data Lake Storage', 'Google Cloud Storage']},
    {'role': 'Data Engineer - Database', 'skills': ['SQL', 'Database Administration', 'Data Modeling', 'Oracle', 'MySQL', 'PostgreSQL', 'Kubernetes']},
    {'role': 'Data Engineer - MLOps', 'skills': ['Python', 'Kubernetes', 'Docker', 'MLflow', 'CI/CD', 'Cloud Platforms (AWS, Azure, GCP)']}
]

my_skills = ['Azure Data Factory', 'Apache Spark', 'SQL', 'Kubernetes', 'Pandas']

In [21]:
type(job_roles)

list

In [24]:
for job in job_roles:
    print(job)

{'role': 'Data Engineer - AWS', 'skills': ['Python', 'AWS Glue', 'AWS Lambda', 'S3', 'Redshift', 'SQL', 'Apache Spark']}
{'role': 'Data Engineer - Azure', 'skills': ['Python', 'Azure Data Factory', 'Apache Spark', 'Kubernetes', 'Apache Synapse Analytics', 'SQL', 'Pandas']}
{'role': 'Data Engineer - GCP', 'skills': ['Python', 'Cloud Dataflow', 'Cloud Functions', 'Cloud Storage', 'BigQuery', 'SQL']}
{'role': 'Data Engineer - Batch Processing', 'skills': ['Python', 'Apache Spark', 'Hadoop', 'SQL', 'Java', 'Scala']}
{'role': 'Data Engineer - Cloud Native', 'skills': ['Python', 'Kubernetes', 'Docker', 'Serverless', 'AWS Lambda', 'Azure Data Factory', 'Google Cloud Functions', 'Pandas']}
{'role': 'Data Engineer - Real-time Data', 'skills': ['Python', 'Apache Kafka', 'Apache Flink', 'Spark Streaming', 'SQL', 'NoSQL']}
{'role': 'Data Engineer - Data Warehousing', 'skills': ['SQL', 'ETL Tools', 'Data Modeling', 'Snowflake', 'Redshift', 'Azure Synapse Analytics']}
{'role': 'Data Engineer - Data 

In [26]:
for job in job_roles:
    print(job['skills'])

['Python', 'AWS Glue', 'AWS Lambda', 'S3', 'Redshift', 'SQL', 'Apache Spark']
['Python', 'Azure Data Factory', 'Apache Spark', 'Kubernetes', 'Apache Synapse Analytics', 'SQL', 'Pandas']
['Python', 'Cloud Dataflow', 'Cloud Functions', 'Cloud Storage', 'BigQuery', 'SQL']
['Python', 'Apache Spark', 'Hadoop', 'SQL', 'Java', 'Scala']
['Python', 'Kubernetes', 'Docker', 'Serverless', 'AWS Lambda', 'Azure Data Factory', 'Google Cloud Functions', 'Pandas']
['Python', 'Apache Kafka', 'Apache Flink', 'Spark Streaming', 'SQL', 'NoSQL']
['SQL', 'ETL Tools', 'Data Modeling', 'Snowflake', 'Redshift', 'Azure Synapse Analytics']
['Python', 'Hadoop', 'Spark', 'AWS S3', 'Azure Data Lake Storage', 'Google Cloud Storage']
['SQL', 'Database Administration', 'Data Modeling', 'Oracle', 'MySQL', 'PostgreSQL', 'Kubernetes']
['Python', 'Kubernetes', 'Docker', 'MLflow', 'CI/CD', 'Cloud Platforms (AWS, Azure, GCP)']


In [27]:
for skill in my_skills:
    print(skill)

Azure Data Factory
Apache Spark
SQL
Kubernetes
Pandas


In [None]:
for job in job_roles:
    for skill in my_skills:
        print(skill)
        print(job['skills'])

In [None]:
for job in job_roles:
    for skill in my_skills:
        if skill in job['skills']:
            print(skill)
            print(job)

In [None]:
for job in job_roles:
    qualified = True
    
    for skill in my_skills:
        if skill in job['skills']:
            print(skill)
            print(job['skills'])

In [37]:
myskills = ['Azure Data Factory', 'Python', 'SQL', 'Apache Spark', 'Pandas']

# Function to calculate skill match percentage
def calculate_match(required_skills, my_skills):
    matching_skills = set(required_skills) & set(my_skills)
    return (len(matching_skills) / len(required_skills)) * 100

# Loop through each job role and check skills match
for job in job_roles:
    matching_skills = set(job['skills']) & set(myskills)
    match_percentage = calculate_match(job['skills'], myskills)

    print(f"\nRole: {job['role']} - Match Percentage: {match_percentage:.1f}%")


Role: Data Engineer - AWS - Match Percentage: 42.9%

Role: Data Engineer - Azure - Match Percentage: 71.4%

Role: Data Engineer - GCP - Match Percentage: 33.3%

Role: Data Engineer - Batch Processing - Match Percentage: 50.0%

Role: Data Engineer - Cloud Native - Match Percentage: 37.5%

Role: Data Engineer - Real-time Data - Match Percentage: 33.3%

Role: Data Engineer - Data Warehousing - Match Percentage: 16.7%

Role: Data Engineer - Data Lake - Match Percentage: 16.7%

Role: Data Engineer - Database - Match Percentage: 14.3%

Role: Data Engineer - MLOps - Match Percentage: 16.7%
