## Python for Data Engineering

Python plays a crucial role in the world of data engineering, offering versatile and powerful libraries. It has been adopted in various domains, including data science, machine learning, AI, data visualization, and data engineering. Python is widely used in big data processing through frameworks like Apache Spark, workflow orchestration, web scraping, and more. In this post, I’ll present the useful elements of this language and its use cases for data engineering.

A little walkthrough with some built-in data structures that python provides. 
- list
- tuple
- dictionary
- set

### list

Mutable: Lists are mutable, meaning you can modify their elements after the list is created. You can add, remove, or modify elements.

In [1]:
my_list = [1, 2, 3, 'a', 'b', 'c']
my_list.append(4)     # Adds 4 to the end of the list
my_list.remove('a')  # Removes the element 'a'

In [2]:
print(my_list)

[1, 2, 3, 'b', 'c', 4]


### List Operations
**Some useful operations you can use with lists:**
- list comprehension
- filer
- map
- reduce

#### list comprehension
List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list. This allows you to simplify code.

In [9]:
cubes = []

# normal syntax
for i in range(10):
    cubes.append(i**3)
print(cubes)

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]


In [10]:
# with list comprehension

cubes = [i**3 for i in range(10)]
print(cubes)

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]


#### Filter
Filter as the name suggests helps to filter your list using filter conditions.

In [11]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Filter even numbers
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
# Output: [2, 4, 6, 8]

In [12]:
print(even_numbers)

[2, 4, 6, 8]


#### Lambda Function
We're taking a quick detour from our course as we've encountered anonymous function(lambda function).

```
lambda x: x % 2 == 0, numbers)
```
This is an anonymous function (lambda function) that takes an input `x` and returns `True` if `x` is even `(x % 2 == 0)`, otherwise `False`.

The `filter` function works hand in hand here taking the lambda function as input and applies it to each element in `numbers`. 
** It keeps only the elements for which the lambda function returns True (i.e., even numbers).

In [13]:
# we could use a regular function instead of a lamdba function

def is_even(x):
    return x%2 == 0

In [14]:
print(is_even(10))

True


In [22]:
result = filter(is_even, numbers) # filter() returns a filter object(an iterator)

N.B: **By definition, an iterator can be iterated through once, producing all values, and is then exhausted.**

In [23]:
result

<filter at 0x7939b814d5a0>

In [24]:
for i in result:
    print(i)

2
4
6
8


In [25]:
# or we can use list() to convert it into a list
even_num = list(filter(is_even, numbers))
print(even_num)

[2, 4, 6, 8]


#### Map
The Map function allows us to use a function on a list of elements.

In [26]:
def power2(val: int) -> int:
    return val*val

numbers = [1, 2, 3, 4, 5]

power_numbers = list(map(power2, numbers))
# OR
squared_numbers = list(map(lambda x: x**2, numbers))

# Output: [1, 4, 9, 16, 25]

In [27]:
print(numbers)

[1, 2, 3, 4, 5]


In [28]:
print(squared_numbers)

[1, 4, 9, 16, 25]


#### reduce

The `reduce()` function is a powerful tool in Python that operates on a list (or any iterable), applies a function to its elements, and ‘reduces’ them to a single output.

In [39]:
from functools import reduce

words = ["apple", "banana", "orange", "apple", "grape", "banana"]

# Count the occurrences of each word
word_counts = reduce(lambda counts, word: {**counts, word: counts.get(word, 0) + 1}, words, {})

print(word_counts)
# Output: {'apple': 2, 'banana': 2, 'orange': 1, 'grape': 1}


numbers = [1, 2, 3, 4, 5]
product = reduce((lambda x, y: x * y), numbers)
print(product)

# Output: 120

{'apple': 2, 'banana': 2, 'orange': 1, 'grape': 1}
120


### Dictionary
Dictionaries are used to store data values in key-value pairs. A dictionary is a collection that is changeable and does not allow duplicates in the keys.

In [4]:
thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}

In [5]:
print(thisdict)

{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}


### Tuple
Immutable: Tuples are immutable, meaning once they are created, their elements cannot be changed or modified. Unlike lists, you can’t add or remove elements from a tuple.

### Set
Unordered and Unique Elements: Sets are unordered collections of unique elements. They do not allow duplicate values, and the order of elements is not guaranteed.

Syntax: Defined using curly braces {} or by using the set() constructor.

In [7]:
my_set = {1, 2, 3}
my_set.add(1)
my_set.add(1) # won't add 1 to the set
print(my_set)

{1, 2, 3}


In [3]:
my_tuple = (1, 2, 3, 'a', 'b', 'c')

### Generators
Python provides a generator to create your iterator function. A generator is a special type of function that does not return a single value, instead, it returns an iterator object with a sequence of values. In a generator function, a yield keyword is used instead of a return. It allows you to create an iterator without loading the entire dataset into memory, making it suitable for processing huge files.

Benefits of generators:
- Memory efficient: Only stores the current value, not the entire list.
- Faster Execution: Generates values on demand, reducing overhead.
- Lazy Evaluation: Useful for large datasets or infinite sequences.
- Better Performance: Avoids creating large lists in memory.

When to use generators?
- when dealing with large datasets (ex. reading big files line by line)
- when you don't need to store all values at once 
- when you need lazy evaluation. (streaming data)

In [42]:
text = """
transaction_id,user\r
1,aaa\r
\r
2,xx\r
3,ccc\r
\r
"""

In [44]:
print(text) # this is how it looks


transaction_id,user
1,aaa

2,xx
3,ccc




In [49]:
text.split("\r")[0].strip().upper() # e.g. of a processed line

'TRANSACTION_ID,USER'

In [63]:
# creating a generator function
def process_large_files(text):
    for line in text.split("\r"):
        processed_line = line.strip().upper()
        # print("begin processing")
        if processed_line != "":
            # Yield the processed line
            yield processed_line

In [64]:
for processed_line in process_large_files(text):
    print(processed_line)

TRANSACTION_ID,USER
1,AAA
2,XX
3,CCC


### Enumerate
The enumerate function adds a counter to an iterable and returns it.

In [41]:
# normal way
lista = ['a','b','c','d','e']
count = 0
for l in lista:
    print('Index:', count,' Value:', l)
    count+=1
print('')
# enumerate way
for count, l in enumerate(lista):
    print('Index:', count,' Value:', l)

Index: 0  Value: a
Index: 1  Value: b
Index: 2  Value: c
Index: 3  Value: d
Index: 4  Value: e

Index: 0  Value: a
Index: 1  Value: b
Index: 2  Value: c
Index: 3  Value: d
Index: 4  Value: e


### Decorators:
Python Decorators are used to apply additional functionality to objects. They are used to provide more functionality without having to write additional code inside the object. For example, we can modify the returned value and display it from the decorator function

In [65]:
def make_upper(function):
    def upper():
        f = function()
        print(f"this is from origin value: {f}")
        return f.upper()
    return upper

In [66]:
def helloworld():
    return "hello world"

In [67]:
helloworld()

'hello world'

In [68]:
@make_upper
def helloworld():
    return "hello world"

In [69]:
helloworld()

this is from origin value: hello world


'HELLO WORLD'