# Some Python Modules for Data Engineering/Science Work
## Greg Placencia, PhD

# File Handling
## csv - Handle CSV files
- https://docs.python.org/3/library/csv.html

- Parse and process large CSV files as part of ETL pipelines
- Convert CSV data into other formats, like JSON or database tables
- Write processed or transformed data back into CSV format

CSV Module - How to Read, Parse, and Write CSV Files tutorial
- https://www.youtube.com/watch?v=q5uM4VKywbA

---
## json - Handle JSON data
- https://docs.python.org/3/library/json.html

- Converting API responses into Python objects for further processing
- Store config info or metadata in a structured format
- Handle complex, nested data structures often found in big data applications

Work with JSON Data tutorial
- https://www.youtube.com/watch?v=9N6a-VLBa2I

---
## pickle - Save and reload Python objects to and from a binary format
- https://docs.python.org/3/library/pickle.html
### Saves complex data structures, such as lists, dictionaries, or custom objects to memory and reload them later.
Useful for the following tasks:
- Cache transformed data to speed up repetitive tasks in data pipelines
- Persisting trained models or data transformation steps for reproducibility
- Store/reload complex configurations or datasets between processing stages

Python Pickle Module for saving objects (serialization) tutorial
- https://www.youtube.com/watch?v=2Tw39kZIbhs

In [None]:
!python -m pip install csv
!python -m pip install json
!python -m pip install pickle

# System Related
## os - Interact with operating systems
- https://docs.python.org/3/library/os.html
### Perform the following data engineering tasks with the os module’s functionalities:
- Automate creation and deletion of directories for temporary or output data storage  
- Manipulate file paths when organizing large datasets across different directories  
- Handle environment variables to manage configuration settings in data pipelines  
  
OS Module - Use Underlying Operating System Functionality tutorial
- https://www.youtube.com/watch?v=tJxcKyFMTGo

---
## pathlib - Handle file system paths
- https://docs.python.org/3/library/pathlib.html
### Manipulate file and directory paths with an intuitive and readable syntax to manage file tasks.

The pathlib module can come in handy in the following data engineering tasks:
- Streamline the process of iterating over and validating large datasets
- Simplify management of paths when moving or copying files during ETL (Extract, Transform, Load) processes
- Ensure cross-platform compatibility, especially in multi-environment data engineering workflows

How To Navigate the Filesystem with Python’s Pathlib tutorial
- https://www.kdnuggets.com/how-to-navigate-the-filesystem-with-pythons-pathlib
Organize, Search, and Back Up Files with Python’s Pathlib
- https://www.kdnuggets.com/organize-search-and-back-up-files-with-pythons-pathlib

---
## shutil - Copy, move, and delete files and directories
- https://docs.python.org/3/library/shutil.html

### In data engineering projects, shutil can help with:
- Efficiently move or copy large datasets across different storage locations
- Automate the cleanup of temporary files and directories after processing data
- Create backups of critical datasets before processing or analysis

The Ultimate Python File Management Toolkit tutorial
- https://www.youtube.com/watch?v=sXzezIK0d7c

---
## subprocess - Run shell commands and interact with the system shell within Python scripts
- https://docs.python.org/3/library/subprocess.html
### Essential to automate system tasks, call command-line tools, or capture output from external processes:
- Automate execution of shell scripts or data processing commands
- Capture output from command-line tools to integrate with Python workflows
- Orchestrate complex data processing pipelines that involve multiple tools and commands

Calling External Commands Using the Subprocess Module tutorial
- https://www.youtube.com/watch?v=2Fp1N6dof0Y

In [9]:
!python -m pip install os
!python -m pip install pathlib
!python -m pip install shutil

ERROR: Could not find a version that satisfies the requirement os (from versions: none)
ERROR: No matching distribution found for os


In [11]:
!python -m pip install csv
!python -m pip install json
!python -m pip install pickle

ERROR: Could not find a version that satisfies the requirement csv (from versions: none)
ERROR: No matching distribution found for csv


In [None]:
!python -m pip install os
!python -m pip install pathlib
!python -m pip install shutil
!python -m pip install subprocess

# Data Related

## datetime - Work date and time data.
- https://docs.python.org/3/library/datetime.html

- Parse and format timestamps in logs or event data
- Manage date ranges and calculate time intervals when working with real-world datasets

Datetime Module - How to work with Dates, Times, Timedeltas, and Timezones tutorial
- https://www.youtube.com/watch?v=eirjjyP2qcQ
---

## re - Work with regular expressions for text processing
- https://docs.python.org/3/library/re.html

- Extract specific patterns from logs, raw data, or unstructured text
- Validate data formats, such as dates, emails, or phone numbers, during ETL processes
- Clean raw text data for further analysis

re Module - How to Write and Match Regular Expressions (Regex) tutorial
- https://www.youtube.com/watch?v=K8L6KVGG-7o

In [None]:
!python -m pip install datetime
!python -m pip install re

# Database
## sqlite3 - Useful for projects that require structured data storage without the overhead of a database server
- https://docs.python.org/3/library/sqlite3.html

- Prototype ETL pipelines before scaling them to fully fledged database systems
- Store metadata, logging information, or intermediate results during data processing
- Quickly query and manage structured data without setting up a database server

A Guide to Working with SQLite Databases in Python tutorial
- https://www.kdnuggets.com/a-guide-to-working-with-sqlite-databases-in-python

In [None]:
!python -m pip install sqlite3