## **Some Advanced Software Engineering Principles for Clean & Reusable Python Code: Part 2**   

You've done everything possible to make your code reproducible and reusable. You've followed best practices, named your variables like a responsible adult, and even added comments that you *might* understand six months from now. But now you’ve realised that some of these tasks are getting a bit too repetitive, like déjà vu with less excitement.

Writing the same block of code over and over every time you encounter a familiar objective? Yes, that’s not exactly the dream. To make life easier (and to protect your sanity), it's time to streamline the process. And guess what? You couldn’t find any existing helper libraries or packages tailored specifically for your needs. So what do you do? You build your own `Python` package, of course – from scratch, fully customisable, and exactly how *you* like it.

You can keep it for yourself, use it locally like a secret weapon, or share it with the world by indexing it on [PyPI](https://pypi.org/) (the Python Package Index) and bask in the glory of open-source contribution.

In this article, we won’t just dump code and call it a day. Instead, we’ll walk through the basic structure of a Python package designed for analytics tasks. We'll build it from scratch, test it, format it, and finally publish it to `PyPI`. Our package will be simple but functional, and will focus on:

#### An Extract, Transform, and Load (ETL) Pipeline for CSV Data

1. The package will list and transform any number of CSV files from a local directory.
2. It will load the integrated and transformed dataset into a PostgreSQL database ([PostgreSQL](https://www.postgresql.org/)).

Let’s dive in and have some fun building your own package. Who knows, this might be the start of your open-source empire.


![](https://plus.unsplash.com/premium_photo-1720287601300-cf423c3d6760?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

Photo by [Philip Oroni](https://unsplash.com/@philipsfuture) on [Unsplash](https://unsplash.com/)

---

## **1. Developing a Python Package**  
A package is a **collection of modules**.

Difference between a `script`, `module`, and `package`:

- A **script** is generally a standalone Python file (`.py`) intended to be executed directly. It might contain plain lines of code, functions, or objects. Scripts often include the following line of code to specify that certain code should only run when the file is executed directly, not when imported as a module:

```python
if __name__ == "__main__":
    print("Running some code!") # The function(s)/code intended to be run goes in this block
```

- A **module** is a Python file (`.py`) containing functions, classes, and variables that can be imported and used in other scripts or modules.

- A **package** is a collection of one or more modules, sub-packages (packages within packages), and scripts. It is typically organised as a directory containing Python files and optionally an `__init__.py` file to mark it as a package.

Below is an example structure of the package we're going to build, showing only the key modules, subpackage, and scripts (with a brief description of each in comments using `#`).

### **Example: An ETL (Extract, Transform, and Load) Package**

```python
my_etl_package (top-level/parent directory of the package)
├── my_etl_package (main package)  
│   ├── __init__.py            # Organises imports of modules and their functions for the main package  
│   ├── load_data.py           # Handles loading data into the database  
│   ├── read_data.py           # Reads raw CSV files from local storage  
│   ├── transform_data.py      # Applies transformations to raw data  
│   ├── write_data.py          # Writes cleaned data to intermediate storage or output  
│   ├── utils (subpackage) 
│   │   ├── __init__.py        # Organises imports of modules and their functions within the subpackage  
│   │   ├── connect_db.py      # Manages database connections  
│   │   └── fetch_files.py     # Fetches file paths and performs file-level checks  
├── setup.py                   # Package configuration and metadata  
├── main.py                    # Example script demonstrating usage of the package and subpackage  
└── ... [other files for testing, formatting, documentation, etc.]
```

__Note:__ Actually a common convention in Python projects to have both the top-level project folder and the importable package (main package) folder share the same name.

While it is beyond the scope of this article to discuss each module in detail, the source code for all of them is provided at the end. The code is well-documented, type-annotated, formatted, and tested for production-level use. In this section, we will only briefly discuss how to structure imports within the __init__.py files, as well as at the top-level directory (i.e. outside the package itself), which reflects how the package would be utilised if installed in an external environment or from a different directory (more on this later).

---

#### 1.1. Structuring Imports

#### a) Importing Modules and Functions from Packages and Sub-packages in Python

When working with a structured Python project, you often want to import modules or specific functions from a package or sub-package into a top-level script like `main.py`. This can be done in multiple ways, depending on your preference for clarity, scope, and modularity.

#### **Method 1: Direct Module Import**

You can import entire modules from a package or sub-package and then access their functions or classes using dot notation.

**Example Import Structure:**

```
my_etl_package/
│
├── main.py
└── my_etl_package/
    ├── __init__.py
    ├── load_data.py
    ├── read_data.py
    ├── transform_data.py
    ├── write_data.py   
    └── utils/
        ├── __init__.py
        ├── connect_db.py 
        └── fetch_files.py
```

**Usage in `main.py`:**

```python
# Importing the module from the package
from my_etl_package import read_data  
read_data.read_csv(...)

or
import my_etl_package.read_data
my_etl_package.read_data.read_csv(...)

# Importing the module from the sub-package
from my_etl_package.utils import fetch_files  
fetch_files.list_csv_files(...)

or
import my_etl_package.utils.fetch_files 
my_etl_package.utils.fetch_files.list_csv(...)
```

This approach makes it easy to trace where a function or class came from.

#### **Method 2: Direct Function Import via `__init__.py`**

You can expose specific functions or classes at the package level by importing them in the `__init__.py` file of the package or sub-package. This allows direct access to those functions without having to go through the module path.

**Inside `my_etl_package/__init__.py`:**

```python
from my_etl_package.read_data import read_csv
```

**Inside `my_etl_package/utils/__init__.py`:**

```python
from my_etl_package.utils.fetch_files import list_csv_files
```

**Usage in `main.py`:**

```python
# Direct function access after exposing via __init__.py
from my_etl_package import read_csv  
read_csv(...)

from my_etl_package.utils import list_csv_files  
list_csv_files(...)
```

This method helps encapsulate complexity within the package.

Direct function imports via `__init__.py` are better for keeping imports short and clean, especially in top-level scripts. They make the code easier to read and maintain by avoiding long module paths.

This kind of imports are called `absolute` imports, which are preferred. You can also use `relative` imports where you don't specifically mention the name but use `.` (dot) notation, where one `.` means the current directory, and two `..` mean the parent of the current directory. For example:

**Inside `my_etl_package/__init__.py`:**

```python
from .read_data import read_csv
```

**Inside `my_etl_package/utils/__init__.py`:**

```python
from .fetch_files import list_csv_files
```

This is shorter; however, it might cause conflicts later if not installed as a package, therefore not recommended.

#### Method 3: Chained Exposure via **init**.py

To allow top-level access to deeply nested modules or functions, you can chain imports through `__init__.py` files of each package or sub-package. To do that, we first need to import the modules, functions, and sub-packages into the __init__.py file of the package, and then import the modules and functions of the sub-package into the __init__.py file of the sub-package.

**Inside `my_etl_package/__init__.py`:**

```python
from my_etl_package import utils
from my_etl_package import read_data
from my_etl_package.read_data import read_csv
```

**Inside `my_etl_package/utils/__init__.py`:**

```python
from my_etl_package.utils import fetch_files
from my_etl_package.utils.fetch_files import list_csv_files
```

**Usage in `main.py`:**

```python
import my_etl_package

my_etl_package.read_data.read_csv(...)
or
my_etl_package.read_csv(...)

my_etl_package.utils.fetch_files.list_csv_files(...)
or
my_etl_package.utils.list_csv_files(...)
```

#### b) Importing Between Modules (Sibling Imports) or Modules in Different Levels

Similarly, both `absolute` and `relative` imports can be used. For example:

**Inside `my_etl_package/transform_data.py`:**

```python
from .read_data import read_csv
or 
from my_etl_package.read_data import read_csv
```

**Inside `my_etl_package/utils/*` from a module (`test.py`) on one level up (`my_etl_package/*`):**

```python
from ..test import test_func
```

####  Important Note: Avoid Circular Imports When Importing Between Modules

When importing functions between modules within the same package, **do not import them from the package level** (i.e., avoid accessing them via the `__init__.py`-exposed shortcut). For example;

In `my_etl_package/__init__.py`:

```python
from my_etl_package.read_data import read_csv
```

**Incorrect – will likely cause a circular import error:**

In `transform_data.py`:

```python
from my_etl_package import read_csv  # Causes circular import
```

This causes a circular import because:

* `transform_data.py` depends on a function (`read_csv`) exposed at the **package level** via `__init__.py`.
* But the package level (`__init__.py`) may itself depend on `transform_data.py` or other modules — creating an import loop.

**Correct – use direct modular imports between modules:**

In `transform_data.py`:

```python
from my_etl_package.read_data import read_csv  # Safe and clear
```

This avoids unnecessary coupling and prevents circular dependencies.

__Note:__ `main.py` is not part of the package (added optionally here). It serves as a testing ground/implementation for various parts of the package from the package’s parent or top-level directory.

---

#### 1.2. Setting up the Package

You've developed all the modules and sub-packages in your package, but right now it's only available in the package's top-level directory. This means it's inaccessible outside that directory even within the same environment on your local machine. To be able to import and use it in any other directory or script within the same environment, you must first install it locally within that environment.

Once installed, the import structure will remain the same regardless of the script's location, as long as you're running the code within the same environment. If you want to use the package in a different environment, you'll need to install it there separately.

To install the package locally in your working environment, you need a `setup.py` file. This file tells Python what to install and includes metadata about your package. Below is a sample `setup.py` file:

```python
from setuptools import setup, find_packages
import io
import os


# Read README.md for the long description
with io.open(
    os.path.join(os.path.dirname(__file__), "README.md"), encoding="utf-8"
) as f:
    long_description = f.read()

setup(
    name="my_etl_package",
    version="1.0.0",
    description="A package for ETL pipeline operations",
    long_description=long_description,
    long_description_content_type="text/markdown",
    author="Khaled Ahmed",
    author_email="khhaledahmaad@gmail.com",
    packages=find_packages(include=["my_etl_package", "my_etl_package.*"]),
    python_requires=">=3.10",
    install_requires=[
        "python-dotenv",
        "numpy",
        "pandas",
        "sqlalchemy",
    ],
    classifiers=[
        "Development Status :: 4 - Beta",
        "Environment :: Other Environment",
        "Intended Audience :: Developers",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
        "Programming Language :: Python",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
        "Topic :: Software Development",
        "Topic :: Software Development :: Libraries :: Python Modules",
    ],
)
```

You can then install the package in **editable mode**, which allows you to modify the package source code without needing to reinstall it each time:

```bash
pip install -e .
```

Once you've finalised your package, you can list and save all the dependencies by running the following command from the top-level directory:

```bash
pip freeze > requirements.txt
```

This is different from the `install_requires` parameter in the `setup.py` file:

* **`install_requires`** is intended for **users** of the package. It specifies the core dependencies with version flexibility to ensure compatibility.
* **`requirements.txt`** is typically for **developers**. It lists **exact versions** of all dependencies in the current environment, ensuring consistent builds and reproducibility.

#### **What are classifiers?**

Classifiers (also called **Trove classifiers**) are **standardised metadata tags** for Python packages. They are a set of predefined strings that describe:

* The **maturity** of the project
* Its **audience**
* The **programming language versions** it supports
* The **environment** it runs in
* Its **topic or purpose**

You can see the full list here: [PyPI Classifiers](https://pypi.org/classifiers/)

#### **Purpose of classifiers**

1. **Help users discover your package**

   * On PyPI (where developers usually publish their package to make it usable for other users—more on this later), users can filter or search packages by these tags.
   * Example: If someone searches for “Python 3.11 libraries for data processing,” valid classifiers make your package show up in the results.

2. **Give clear metadata about your package**

   * Developers and tools can immediately understand:

     * Supported Python versions (`Programming Language :: Python :: 3.10`)
     * License (`License :: OSI Approved :: MIT License`)
     * Intended audience (`Intended Audience :: Developers`)

3. **Prevent metadata errors during upload**

   * PyPI validates these classifiers when you upload a package.
   * **Invalid classifiers** (like `"Topic :: ETL"`) cause **HTTP 400 Bad Request errors**.


#### **Example of what classifiers tell people**

```python
classifiers=[
    "Development Status :: 4 - Beta",  # Package is in beta
    "Intended Audience :: Developers", # For software developers
    "Programming Language :: Python :: 3.11", # Supports Python 3.11
    "License :: OSI Approved :: MIT License", # Open-source license
    "Topic :: Software Development :: Libraries :: Python Modules", # It’s a library
]
```

* This **instantly tells PyPI users** that your package is a beta Python library for developers, open-source, and works on Python 3.11.


For more information on setting up a Setup Script ans all the options and metadata you can add: [Writing the Setup Script](https://docs.python.org/3.11/distutils/setupscript.html#)

---

### 2. Testing your Package
You’ve spent days, maybe weeks, building your Python package. It’s clean, modular, and efficient. But how do you know it actually works? Not just on your machine—but in real-world use cases, across different components, under stress?

This post is your practical guide to testing your Python package properly using tools like `pytest` and `unittest`. Whether you’re working on a personal project, a data pipeline, or preparing to publish your package to `PyPI`, the techniques here will help you ensure reliability and confidence in your code.

#### 2.1. Why Test?

Testing isn't just about finding bugs. It’s about validating assumptions, preventing regressions, and making collaboration sustainable. A well-tested package:

* Catches issues early
* Makes refactoring safer
* Serves as documentation for intended behavior
* Increases confidence in your work

#### 2.2. Understanding Testing Methodologies

Before we dive into writing code for testing, let's briefly understand the different types of testing.

#### **Testing types in a broader context:**

1. **Black Box Testing**
   You test functionality *without* knowing the internal code. Think of using a well-documented external library; you're only concerned about input and output.

2. **White Box Testing**
   You test your *own* code with full knowledge of its logic and structure. You know what the functions are doing internally.

3. **Gray Box Testing**
   A hybrid approach. You have partial knowledge of the system. This is common in integration testing between third-party tools and your own logic.

#### **Testing types at code level:**

1. **Unit Testing**
   Test individual units of functionality; typically functions or classes.

2. **Feature Testing**
   Tests that cover multiple units working together to achieve a specific feature or goal (e.g., a data import workflow).

3. **Integration Testing**
   Tests that check if different components (modules, packages, external services) are working together correctly.

4. **Performance Testing**
   Validates if the functions/modules perform efficiently under specific time constraints or large datasets.

#### 2.3. Tools for Testing

There are several tools you can use for testing in Python. Here's a quick comparison:

`unittest`
A built-in Python module that uses an object-oriented structure for writing test cases.

`pytest`
A third-party testing framework that is simpler, more expressive, and supports fixtures, plugins, and better error reporting out of the box.

`tox`
Used to automate testing across multiple Python environments. Ideal for package developers who want to maintain compatibility across versions.

> Both pytest and unittest can be used interchangeably in many projects. You can mix both styles within the same test directory if needed, though most teams prefer to stick with one for consistency.

| Tool           | Purpose                                         |
| -------------- | ----------------------------------------------- |
| **`unittest`** | Built-in testing module, class-based, OOP-style |
| **`pytest`**   | Lightweight, more readable, feature-rich        |
| **`tox`**      | Automates testing across Python versions        |

In this article, we’ll focus primarily on `pytest`, but also show its close equivalents using `unittest` when relevant.

#### 2.4. Structuring the Test Directory

Organizing your test files clearly is key to maintainability. Ideally, your test directory should mirror the structure of your package. This makes it intuitive to locate and maintain tests, especially as the codebase grows. While there's no strict rule enforcing this structure, following it is considered a best practice for clarity and maintainability.

**Example directory structure:**

```
my_etl_package/
├── my_etl_package/              # Main package with modules/sub-packages
│   ├── __init__.py              
│   ├── load_data.py           
│   ├── read_data.py             
│   ├── transform_data.py      
│   ├── write_data.py           
│   ├── utils 
│   │   ├── __init__.py         
│   │   ├── connect_db.py        
│   │   └── fetch_files.py      
├── test_my_etl_package/          # Test suite root
│   ├── __init__.py
│   ├── conftest.py             # Fixtures and global test setup
│   ├── test_integration.py
│   ├── test_performance.py
│   └── test_utils/
│       ├── __init__.py
│       ├── test_connect_db.py
│       └── test_fetch_files.py
│
├── setup.py
├── main.py
├── ... [other files for testing, formatting, documentation, etc.]
└── tox.ini
```
Any test module or script inside the test directory can be run directly using the pytest command:
```bash
pytest test_my_etl_package/test_integration.py
```
#### Naming conventions:

* Prefix all test folders and files with `test_`
* Use descriptive but concise names: `test_fetch_files.py`, `test_integration.py`
* Follow the same import structure in `__init__.py` files as in your main package

#### 2.5. Writing Test Cases

Let’s say you have a utility function that squares a number:

```python
# your_package/math_utils.py
def square(x):
    return x * x
```

#### Using `pytest`

```python
# test_utils/test_math_utils.py
from your_package.math_utils import square

def test_square():
    assert square(2) == 4
    assert square(5) == 25
    assert square(-3) == 9
```

#### Using `unittest`

```python
# test_utils/test_math_utils_unittest.py
import unittest
from your_package.math_utils import square

class TestMathUtils(unittest.TestCase):
    def test_square(self):
        self.assertEqual(square(2), 4)
        self.assertEqual(square(5), 25)
        self.assertEqual(square(-3), 9)

if __name__ == '__main__':
    unittest.main()
```

Both styles work effectively, and the choice depends on your project preference. `pytest` offers simpler syntax and advanced features, while `unittest` provides a structured, class-based approach.

**Note**: Both `pytest` and `unittest` support custom assertions for `pandas`, `NumPy`, etc. See:

* [Python unittest assertions](https://docs.python.org/3/library/unittest.html#test-cases)
* [Pandas testing assertions](https://pandas.pydata.org/docs/reference/testing.html)


#### 2.6. Key Testing Features
The following features are embedded within any test modules or scripts:

#### a) Fixtures

Fixtures in `pytest` help prepare preconditions for tests; like sample data, database connections, or configuration. They improve code reuse and test readability.

```python
import pytest

@pytest.fixture
def sample_data():
    return [1, 2, 3]

def test_sample_data_length(sample_data):
    assert len(sample_data) == 3
```

#### b) Markers

Markers help categorise, skip, or handle tests conditionally.

```python
import pytest
import sys

# This test is always skipped with the given reason
@pytest.mark.skip(reason="Not implemented yet")
def test_not_ready():
    assert False

# This test is expected to fail due to a known bug (e.g. division by zero)
# If it fails, pytest will mark it as expected; if it passes, pytest will warn
@pytest.mark.xfail(reason="Known bug")
def test_bug_behavior():
    assert 1 / 0 == 0

# This test is skipped if the condition is True
# Here, it will be skipped on non-Windows platforms
@pytest.mark.skipif(sys.platform != "win32", reason="Runs only on Windows")
def test_windows_only_feature():
    assert True
```

#### c) Temporary Files & Directories

When testing file operations, use `pytest`'s `tmp_path` or `tmpdir` fixtures to work with temporary paths:

```python
def test_file_write(tmp_path):
    test_file = tmp_path / "output.txt"
    test_file.write_text("sample content")
    assert test_file.read_text() == "sample content"
```

These paths are automatically cleaned up after the test run.

#### d) Setup and Teardown Methods

Some tests require setting up resources before running (e.g., creating a database connection), and cleaning them afterward. These steps can be handled using **setup** and **teardown** methods.

**Using `pytest` fixtures:**

```python
@pytest.fixture
def db_connection():
    conn = create_connection()
    yield conn
    conn.close()
```

**Using `unittest`:**

```python
import unittest

class TestDB(unittest.TestCase):
    def setUp(self):
        self.conn = create_connection()

    def tearDown(self):
        self.conn.close()
```

These methods help ensure tests are isolated and don’t affect each other, especially when shared resources (like databases, files, or APIs) are involved. They also clean up unnecessary memory and resource usage after each test run.

**Notes:**

* **In `unittest`:**

  * `setUp()` is called **before each test method** in the class to prepare the environment (e.g., open DB connections).
  * `tearDown()` is called **after each test method** to clean up (e.g., close connections, release memory).
  * These methods are automatically run for every test, regardless of whether the test needs the resource or not.

* **In `pytest`:**

  * `@pytest.fixture` functions run **before each test** that explicitly uses the fixture (by passing it as a parameter).
  * Code placed **after `yield`** in the fixture acts as the teardown; it runs **after the test finishes**.
  * Fixtures give more control: they can be shared, scoped (per function/module/session), and reused across multiple tests or files.

This helps ensure test isolation and avoids shared state side effects.

#### e) Mocking

Mocking is used to simulate external dependencies or environment conditions, allowing tests to run in isolation and reliably without needing real external resources.

In this example, we mock environment variables using `patch.dict` to test that `PostgresConnector` utility module from our package creates the correct database connection string without requiring actual environment setup, but mocking it:

```python
import os
import pytest
from my_etl_package.utils import PostgresConnector
from unittest.mock import patch
from sqlalchemy.engine import Engine


# Connector to the database
connector = PostgresConnector()


def test_get_db_connection():
    """
    Test that PostgresConnector generates the correct SQLAlchemy engine
    when all required environment variables are present.
    """
    # Load env vars from the connector
    ENV_VARS = {
        "DB_HOST": connector.host,
        "DB_NAME": connector.database,
        "DB_USER": connector.user,
        "DB_PASSWORD": connector.password,
        "DB_PORT": connector.port,
    }

    with patch.dict(os.environ, ENV_VARS):
        engine = connector.get_db_connection()

        # Assert type
        assert isinstance(engine, Engine)

        # Assert connection string
        actual = engine.url.render_as_string(hide_password=False)
        expected = f"postgresql://{connector.user}:{connector.password}@{connector.host}:{connector.port}/{connector.database}"
        assert actual == expected
        
```

****Notes:****

* The `patch.dict(os.environ, ENV_VARS)` temporarily **overrides environment variables** just for this test.
* This allows testing the behavior of `get_db_connection()` **without relying on real environment setup**.
* Mocking environment variables ensures tests are **deterministic, isolated, and safe to run anywhere**.
* The patch only applies within the `with` block scope, so it doesn't affect other tests or the global environment.

Mocking lets you control the return values of functions and track how they're called, enabling precise unit tests without invoking real dependencies.

#### f) Performance Benchmarking

Using the `pytest-benchmark` plugin, you can test how efficiently your code runs:

```python
def test_sorting_speed(benchmark):
    data = list(range(1000, 0, -1))  # Create a reversed list of 1000 numbers
    benchmark(sorted, data)           # Benchmark the built-in sorted function on this data
```

* `benchmark` is a pytest fixture that runs the given callable multiple times to measure performance.
* The first argument after `benchmark` is the callable (`sorted`), and the rest are arguments passed to it (`data`).
* This runs `sorted(data)` repeatedly and reports the timing statistics.

If you wanted to do the same with the decorator style, it would look like:

```python
def test_sorting_speed_decorator(benchmark):
    data = list(range(1000, 0, -1))

    @benchmark
    def run_sort():
        sorted(data)
```
Both are valid and produce similar benchmarking results.

This is particularly useful when optimising data transformations or algorithms for large-scale use.

#### 2.7. Pytest Configuration File (`conftest.py`)

The `conftest.py` file is used by pytest to set up configuration and fixtures shared across multiple test modules. In this example, it loads environment variables from a `.env` file located at the root of your project directory. This setup ensures that environment-dependent configurations, such as database credentials or API keys, are available during test runs.

The environment variables are loaded using the `python-dotenv` package, with the `.env` file path explicitly resolved relative to the location of the `conftest.py` file:

```python
from dotenv import load_dotenv
from pathlib import Path

# Load .env file from the root directory of the project
load_dotenv(dotenv_path=Path(__file__).resolve().parents[1] / ".env")
```

Here, `Path(__file__).resolve().parents[1]` navigates two levels up from the `conftest.py` file to locate the root directory, where the `.env` file is stored. This approach ensures consistent loading of environment variables regardless of the current working directory when tests are executed.

#### 2.8. Running Tests

You can run all your tests using the following commands in the `top-level/parent` directory of the package using the `CLI`:

```bash
pytest               # discovers and runs all tests
pytest -v            # verbose output
pytest -k "pattern"  # run tests matching pattern
pytest --maxfail=2   # stop after two failures
```

To run across Python versions using `tox` use the following commands in the `top-level/parent` directory of the package using the `CLI`:

```bash
tox
```

Your `tox.ini` would look like:

```ini
[tox]
envlist = py310, py311

[testenv]
deps =
    pytest
    python-dotenv
    sqlalchemy
    psycopg2-binary
    pytest-benchmark
commands =
    pytest --doctest-modules
```

This configuration:

* Creates virtual environments for Python 3.10 and 3.11
* Installs necessary dependencies for the test suite
* Runs all `pytest` tests and also executes **doctests**, which are embedded in module-level docstrings using the `--doctest-modules` flag

#### Notes:

* **`pytest`** can run both standard test scripts and doctests. So even if your code uses `unittest`, you can still run those tests via `pytest` since it’s compatible with `unittest`-style tests. `pytest` will pick up and run `unittest` test cases by default.
* **`doctest`** allows you to validate example code embedded in your documentation. These are often used in function or class docstrings to demonstrate expected usage and output.

This setup enables a robust, scalable, and maintainable testing strategy for your Python package. Whether you choose `pytest`, `unittest`, or both, the key is to stay consistent and prioritize clarity and coverage. Let your tests guide your development, not just validate it afterward.

Testing is not just a chore, it’s a superpower. With a proper testing framework in place, you can:

- Avoid regressions
- Ship faster with confidence
- Build reliable, scalable, maintainable packages

Start small. Cover critical paths first. Add more as you go. The tools are powerful and the payoff is huge.

---

### 3. Increasing Package Quality

In [**Part 1**](https://medium.com/@khhaledahmaad/some-advanced-software-engineering-principles-for-writing-clean-reusable-python-code-part-1-cc518b97e422), we discussed documentation, type annotations, and enforcing code standards using linters in line with the [PEP 8 – Style Guide for Python Code](https://peps.python.org/pep-0008/). Please refer to sections 2, 3, and 12 for a recap.

You can apply these practices manually or automate them using tools like:

* [`pyment`](https://github.com/dadadel/pyment): to generate docstrings
* [`monkeytype`](https://github.com/Instagram/MonkeyType): to infer and apply type hints
* [`flake8`](https://flake8.pycqa.org): to check Python code style and standards

In this part, we’ll include a **configuration file** for `flake8` to automatically check all scripts and modules in the project. This ensures consistent formatting and helps catch style violations across the package. Below is a example configuration file (`setup.cfg`) in the top-level directory (same location as `setup.py`):

```ini
[flake8]

ignore =
    E501

exclude =
    .git,
    __pycache__,
    build,
    dist,
    .tox,
    .eggs,
    my_etl_package.egg-info
    .benchmarks,
    .pytest_cache

per-file-ignores =
    __init__.py: F401

```

With this configuration, running `flake8` in the terminal from the top-level directory will recursively check all relevant Python files while skipping unwanted directories, files, and style warnings.

* **ignore**: Specifies which error codes to skip reporting. In this case, it tells flake8 to ignore warnings about lines being too long (E501).

* **exclude**: Lists files and directories that flake8 should completely skip when checking your code, such as version control folders, caches, build output, test and package metadata.

* **per-file-ignores**: Defines specific warnings or errors to ignore but only for certain files. Here, it tells flake8 not to warn about unused imports in all `__init__.py` files, which is common since those files often import modules to make them available without using them directly.

You can check individual files using the following commands in the `top-level/parent` directory of the package using the `CLI`:
```bash
flake8 <package/module.py>
```

To run flake8 to do checks for all the files based on the configuration in `setup.cfg`, use the following commands in the `top-level/parent` directory of the package using the `CLI`:
```bash
flake8
```
__Example Output:__

```bash
.\my_etl_package\transform_data.py:14:80: E501 line too long (81 > 79 characters)
.\my_etl_package\transform_data.py:19:80: E501 line too long (90 > 79 characters)
.\my_etl_package\transform_data.py:20:80: E501 line too long (85 > 79 characters)
.\my_etl_package\transform_data.py:24:80: E501 line too long (87 > 79 characters)
.\my_etl_package\transform_data.py:26:80: E501 line too long (92 > 79 characters)
.\my_etl_package\utils\connect_db.py:14:80: E501 line too long (95 > 79 characters)
.\my_etl_package\utils\connect_db.py:26:80: E501 line too long (84 > 79 characters)
.\my_etl_package\utils\connect_db.py:32:80: E501 line too long (83 > 79 characters)
.\my_etl_package\utils\connect_db.py:34:80: E501 line too long (85 > 79 characters)
.\my_etl_package\utils\connect_db.py:48:80: E501 line too long (102 > 79 characters)
.\test_my_etl_package\test_performance.py:40:80: E501 line too long (83 > 79 characters)
.\test_my_etl_package\test_utils\test_connect_db.py:34:80: E501 line too long (127 > 79 characters)
.\test_my_etl_package\test_utils\test_connect_db.py:50:80: E501 line too long (84 > 79 characters)
```
After adding the following line and running the `flake8` command again will return nothing as all the standards configured for our code are satisfied:

```ini
ignore =
    E501
```
As simple as that, `flake` it, fix it, and `flake` it again until there’s nothing left to `flake`!

To learn more about some of these error codes mentioned here, please follow this [`PEP8 Error Codes`](https://pep8.readthedocs.io/en/release-1.7.x/intro.html#error-codes).

---

### 4. Publishing your Package 

__So You Built a Python Package, Tested and Improved it… Now What?__

You’ve written the code; you’ve tested it until your terminal cried; you’ve improved the quality until even your linter gave you a thumbs up. But now what?

Let’s be honest, if you're the only one using your beautiful piece of software, then congratulations, you’ve just made a very fancy personal tool. But what if your package could be *the next pandas*, *the next requests*, *the next... whatever solves a real problem*?

Welcome to the final and often forgotten boss level of package development: **making it open-source ready**. That means sharing it in a way that’s usable, reproducible, and maintainable by others, not just future-you at 2am in three months.

Here’s a walk-through of the most essential files and steps you need before you can proudly `pip install` your package from `PyPI`, and maybe, just maybe, watch your GitHub repo gain some stars.

#### What Is PyPI?

[**PyPI**](https://pypi.org/), short for the *Python Package Index*, is the **official third-party software repository** for Python. It’s where developers publish open-source Python packages so others can install them easily using `pip`. `pip` is Python’s official package manager, and yes, it’s a command-line tool, but it’s specifically used for installing and managing Python packages from sources like `PyPI`.

Whether it’s `pandas`, `numpy`, or that oddly named CLI tool that turns YAML into poetry, it probably lives on `PyPI`.

#### Why Use PyPI?

* Makes your package **installable with one command**:

  ```bash
  pip install your-package-name
  ```

* Gives your project **visibility** in the Python community

* Enables **versioned releases** and proper dependency management

* Allows others to **integrate, extend, and contribute** to your work

In short, if you want your project to be more than just a GitHub repo, publishing it on `PyPI` is the way to go to make it a open-source tool for the world.

Below are some of the key steps you might need to follow before you publish your package to `PyPI`.

That’s a solid start! Here's a lightly refined and polished version of your **Step 0** section, keeping the tone professional yet developer-friendly, and improving the flow slightly while preserving your intent:


#### Step 0: Start With a Cookie (Template)

Before diving into the individual files, let’s not reinvent the wheel. Use [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/), a command-line utility that helps you generate a professional Python package scaffold with all the essential pieces baked in.

Instead of manually creating `setup.py`, `README.md`, `tests/`, and more, Cookiecutter gives you a clean, modular structure right out of the box. All you need to do is copy your fancy package code — including sub-packages, modules, and tests, into the generated template.

That’s it, no more fiddling with boilerplate. Of course, you’ll likely need to tweak some of the generated files to match your codebase and project goals, but the heavy lifting is already done.

To get started, run the following in the project's top-level directory:

```bash
pip install cookiecutter
cookiecutter https://github.com/audreyfeldroy/cookiecutter-pypackage
```

Follow the prompts and you’ll get a clean, production-ready project structure with many of the files we’re about to talk about.

#### 4.1. Key Project Files (And What They’re For)

These files aren’t just “nice to have”, they’re essential for a professional, open-source-ready project. A brief description of all these files are added below (please see these files in the GitHub repository added in the source code for more details and examples). While the files list may not be exhausted, the following are the basics to get started (please use a `cookiecutter` template for a more modern package structure with all possible files included automatically).

#### a) `CONTRIBUTING.md`: How Others Can Help

Outlines how contributors should get started. Include:

* Steps to clone and set up the environment
* Code style guidelines
* How to run tests and submit pull requests

#### b)  `LICENSE.md`: Legal Permissions

Specifies how your code can be used. Popular choices for open-source:

* **MIT License**: Simple and permissive
* **Apache License 2.0**: Permissive with patent protection
* **GNU General Public License v3.0**: Requires derived works to be open-source

Place this file in the root directory. GitHub detects it automatically. Or you can add a license during GitHub repository creation.

#### c) `MANIFEST.in`: Package Non-Python Files

Ensures required data (e.g. README, configs, assets) are included during distribution.

Example:

```txt
include CONTRIBUTING.md
include HISTORY.md
include LICENSE
include README.md
```

#### d) `README.md`: The First Impression

Explains what the project does, how to install it, and how to use it.
Essential sections:

* Overview
* Installation
* Basic usage
* Contribution guidelines

#### e) `HISTORY.md`: What Changed

Keeps track of version history. Helps users understand what's new, fixed, or removed in each release.

Format:

```md
# Changelog

All notable changes to this project will be documented in this file.

## [1.0.0] - 2025-06-03
### Added
- Initial release of `my_etl_package` package.
- Included core ETL pipeline functionality.
- Added subpackage `my_etl_package`.utills` with utility functions.
- Added dependencies: numpy, pandas, sqlalchemy, psycopg2-binary, python-dotenv.
- Configured basic setup.py, tox.ini, setup.cfg for testing and linting.
```

#### f) `.env` and `.env.example`: Keep Secrets Safe

Your `.env` file contains sensitive environment variables and must **not** be committed.

Instead:

* Create a `.env.example` with placeholder values
* Add `.env` to `.gitignore`

#### 4.2. Uploading to GitHub

#### Why Upload to GitHub?

Uploading your code to GitHub helps you:

* **Keep track of changes** to your code (version control)
* **Collaborate** with others or get help
* **Showcase your project** to the community (especially if you publish to PyPI)
* **Link your code** on the PyPI page (so users can read it or contribute)

It’s not required for PyPI, but **strongly recommended**.

#### Steps to Upload a Python Project to GitHub

1. **Create a new repo on GitHub**
   Go to [https://github.com/new](https://github.com/new), give it a name, and click "Create repository". 
   
__Note:__ Do not forget to add a `.gitignore` for Python and `license` file while creating a new repo. This saves extra hassle for adding all the files for git to ignore and copying a license as the files added through Github mostly do the jobs. 

2. **Initialize Git locally (if not already)**
   In the project's top-level directory:

```bash
   git init
```

3. **Connect to GitHub**

```bash
   git remote add origin https://github.com/your-username/my_etl_package.git
```

4. **Pull from GitHub**
This will pull any additional files like .gitignore, license files, etc. to your local repository without overwriting any local files.

```bash
    git pull origin main
```

5. **Add all your files**

```bash
   git add .
   git commit -m "Initial commit"
```

6. **Rename the local branch to `main` to match GitHub’s default**
```bash
   git branch -M main
```

7. **Push your code**

```bash
   git push -u origin main
```

__Note:__ The above steps require an `SSH` key to connect to GitHub using `CLI`. To create and add an `SSH` key to GitHub, please follow this official guidance:
[Generating a new SSH key and adding it to the ssh-agent](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent)

If you have [GitHub Desktop](https://github.com/apps/desktop) installed, you can use it to clone repositories and not deal with `SSH` keys.

#### 4.3. Uploading to PyPI
Before you can publish your code to `PyPI`, you must build the distribution for your package.

#### What Does "Build the Distribution" Mean, and Why Is It Necessary?

When we say "build" in the context of publishing a Python package, we’re referring to **creating a distributable version** of your code that tools like `pip` can understand and install.

This step transforms your raw project files into a **packaged format**, like a `.tar.gz` source archive or a `.whl` (wheel) binary, which can then be uploaded to **PyPI** and installed via `pip`.


#### Why Do You Need to Build?

1. **`pip` installs built packages**

`pip` doesn’t clone your repo or read your raw `.py` files, it installs from a `.whl` or `.tar.gz` found on PyPI.

2. **Uploading to PyPI requires built files**

[`twine`](https://twine.readthedocs.io/en/stable/) (a utility for publishing Python packages on PyPI) doesn’t upload your code directly, it uploads the package files in `/dist`.

3. **Each build is version-specific**

If you update your code or bump the version, you must rebuild so the distribution matches the new state.

4. **Packaging tools use metadata**

Building includes project metadata (`setup.py` or `pyproject.toml`, dependencies, etc.) which `pip` needs. Make sure the package name doesn’t conflict with existing packages already published and available.

#### What Happens If You Skip It?

* You’ll try to upload a package that doesn’t match your current version, and PyPI will reject it.
* Your new code won’t be included — users will install an outdated or broken package.
* Your upload process (`twine`) will literally have nothing to send.

#### a) Uploading to TestPyPI (Dry Run)

Before going live, test your distribution on [TestPyPI](https://test.pypi.org/). This lets you catch issues without releasing to the real PyPI. I must register an account with verified email and 2FA authentication method to create an API token, which is required to upload your code to TestPyPI using twine.

#### Step-by-Step: 
Run the following commands In the project's top-level directory:

1. **Build the distribution:**

```bash
python setup.py sdist bdist_wheel
```

#### So What’s Actually Being Built?

When you run:

```bash
python setup.py sdist bdist_wheel
```

You get:

| File Type                           | Purpose                                                      |
| ----------------------------------- | ------------------------------------------------------------ |
| `dist/my_etl_package-1.0.0.tar.gz`           | Source distribution, like a zipped version of your codebase |
| `dist/my_etl_package-1.0.0-py3-none-any.whl` | Wheel file, a faster-to-install binary format used by pip   |

2. **Check it’s valid:**

```bash
twine check dist/*
```

3. **Upload to TestPyPI:**

```bash
twine upload --repository-url https://test.pypi.org/legacy/ -u __token__ -p "<your_api_token>" --verbose dist/*
```
Simply replace the placeholder (`<your_api_token>`) with your API Token. If there any errors and cannot be uploaded, e.g., due to an invalid classifier in your `setup.py` file, the `--verbose` argument will show exactly what's wrong. If the upload is successful, the command will show you the link where your package is uploaded.

4. **Install from TestPyPI:**

```bash
pip install -i https://test.pypi.org/simple/ my-etl-package==1.0.0 --extra-index-url https://pypi.org/simple/
```

`--extra-index-url https://pypi.org/simple/` this extra argument will ensure any additional Python dependencies needed for installing your package will be downloaded from `PyPI` if they are not available from `TestPyPI`. `TestPyPI` does not mirror the full `PyPI` index, so many common packages (like pandas) aren’t there. 

#### b) Uploading to PyPI (Production)

Once the test goes well, you’re ready for the real deal.

1. **Register on PyPI:**
   [https://pypi.org/account/register/](https://pypi.org/account/register/)
2. Verify your email, and add a 2FA authentication method to create an API token, which is required to upload your code to TPyPI using twine.

2. **Upload:**

```bash
twine upload -u __token__ -p "<your_api_token>" --verbose dist/*
```
Simply replace the placeholder (`<your_api_token>`) with your API Token. If there any errors and cannot be uploaded, e.g., due to an invalid classifier in your `setup.py` file, the `--verbose` argument will show exactly what's wrong. If the upload is successful, the command will show you the link where your package is uploaded.


#### c) Automating with a Makefile

Why type ten commands when one will do? Here's a `Makefile` to automate everything:

```makefile
.PHONY: build test upload upload-test clean

build:
	python setup.py sdist bdist_wheel

check:
	twine check dist/*

upload:
	twine upload -u __token__ -p "<your_api_token>" --verbose dist/*

upload-test:
	twine upload --repository-url https://test.pypi.org/legacy/ -u __token__ -p "<your_api_token>" --verbose dist/*

clean:
	python -c "import shutil, glob; [shutil.rmtree(d) for d in glob.glob('dist')]"
	python -c "import shutil, glob; [shutil.rmtree(d) for d in glob.glob('build')]"
	python -c "import shutil, glob; [shutil.rmtree(d) for d in glob.glob('*.egg-info')]"

all: clean build check
```

Put this file in your project root as `Makefile` (no extension).

__Note:__ Please ensure that command lines in the `Makefile` start with tabs, not spaces. > Also, do not commit the `Makefile` to GitHub if it contains your TestPyPI/PyPI API tokens. Either add it to `.gitignore` or replace the token with a placeholder.

#### To use it: Run the following commands in the package's top-level directory

Before you can use a Makefile, you need to make sure that `make` is already installed. If it’s not, you can either download it from the official site or install it via Conda:

```bash
make --version
conda install -c conda-forge make
```

Once you have `make` installed, you can run the following based on the configuration of your `makefile`:
```bash
make all         # Clean, build, and check
make upload      # Upload to PyPI
make upload-test # Upload to TestPyPI
make clean       # Remove build artifacts
```

#### 4.4. Managing Version Numbers

#### Understanding Version Numbers (Semantic Versioning)

Python packages usually use **semantic versioning**, which uses a `MAJOR.MINOR.PATCH` format like `1.2.3`. Each number means something specific:

* **Major version** (first number):
  Increase this when you make big changes that break how the package works or introduce completely new features.
  Example: `1.0.0` becomes `2.0.0`

* **Minor version** (second number):
  Increase this when you add small new features or improvements that don’t break anything.
  Example: `1.2.0` becomes `1.3.0`

* **Patch version** (third number):
  Increase this when you fix bugs or make tiny changes.
  Example: `1.2.3` becomes `1.2.4`

#### Tip:

Every time you make a change and want to release it, bump the version number. Then rebuild the package and upload it again.

#### Managing Version Numbers Automatically with `bumpversion`

Manually editing your version in multiple places (`setup.py`, `__init__.py`, `HISTORY.md`, etc.) can get annoying, and error-prone. That’s where [`bumpversion`](https://github.com/c4urself/bump2version) comes in.

It automatically updates version numbers across your project files in a consistent and trackable way, making releases clean and efficient.

#### Installation

```bash
pip install bump2version
```

> ⚠️ Note: The command is still `bumpversion` even though the package is called `bump2version`.

#### Step 1: Configure `.bumpversion.cfg`

Create a `.bumpversion.cfg` file in your project root:

```ini
[bumpversion]
current_version = 1.0.0
commit = True
tag = True

[bumpversion:file:setup.py]
```

This tells `bumpversion` what your current version is, where to update it, and whether to auto-commit and tag.

#### Step 2: Bump the Version

Use one of these commands based on what you want to change:

```bash
bumpversion patch   # 0.1.0 → 0.1.1
bumpversion minor   # 0.1.1 → 0.2.0
bumpversion major   # 0.2.0 → 1.0.0
```

It will:

* Update version numbers in all configured files
* Commit the changes
* Tag the commit (e.g. `v1.0.0`)

__Note:__ Bumpversion command only will work if the git remote repository is up-to-date. Therefore, please make sure that you commit any changes before bumping the version.

#### Best Practice

Run `make all` and `make upload` right after bumping to build and release the new version, also `push` the changes to GitHub.

Example:

```bash
bumpversion patch
make all
make upload
git push origin main
git push origin --tags
```

#### 4.5. How to Remove a Package from PyPI/TestPyPI

PyPI **does not allow deleting entire packages** after upload — this is to prevent breaking dependencies for others.

However, you **can delete a specific release version**:

1. Go to [https://pypi.org/manage/projects/](https://pypi.org/manage/projects/)
2. Click the version dropdown → Manage version
3. Scroll to the bottom → Delete this version

Be cautious, deleted versions cannot be re-uploaded unless you bump the version number.

You can follow the same steps above for `TestPyPI` by going to the `TestPyPI` website ([https://test.pypi.org/manage/projects/](https://test.pypi.org/manage/projects/)).

For full deletion (rare and requires good reasons), email: `admin@pypi.org`

---

### 5. Source Code
[My ETL Package](https://github.com/khhaledahmaad/my_etl_package)

---

### 6. Conclusion 

Building a Python package from scratch is more than just writing code, it’s about creating reusable, maintainable, and professional-grade software. From structuring modules and sub-packages to implementing robust testing strategies, applying code quality standards, and finally publishing to GitHub and PyPI, each step reinforces best practices in software engineering.

By following the principles outlined in this guide, you ensure that your package is not only functional but also scalable, reliable, and user-friendly. Proper testing, modular design, and adherence to Python standards like PEP 8 make your code easier to maintain and extend, while publishing to PyPI and GitHub allows others to benefit from your work, fostering collaboration and open-source contribution.

Ultimately, creating your own Python package transforms repetitive tasks into clean, reusable tools, empowering you to focus on solving problems rather than rewriting code. Whether for personal projects or professional applications, the skills and practices covered here are an investment in long-term code quality, efficiency, and impact.

---

### 7. References

1. Python Software Foundation, 2023. *The Python Tutorial.* \[online] Available at: [https://docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/) \[Accessed 3 April 2025].

2. Python Software Foundation, 2023. *Python Modules.* \[online] Available at: [https://docs.python.org/3/tutorial/modules.html](https://docs.python.org/3/tutorial/modules.html) \[Accessed 4 April 2025].

3. Python Software Foundation, 2023. *Packages.* \[online] Available at: [https://docs.python.org/3/tutorial/modules.html#packages](https://docs.python.org/3/tutorial/modules.html#packages) \[Accessed 5 May 2025].

4. Python Software Foundation, 2023. *The `__init__.py` file.* \[online] Available at: [https://docs.python.org/3/reference/import.html#packages](https://docs.python.org/3/reference/import.html#packages) \[Accessed 6 May 2025].

5. Python Software Foundation, 2023. *The Python Standard Library: unittest — Unit testing framework.* \[online] Available at: [https://docs.python.org/3/library/unittest.html](https://docs.python.org/3/library/unittest.html) \[Accessed 2 June 2025].

6. Python Software Foundation, 2023. *Distutils: Writing the Setup Script.* \[online] Available at: [https://docs.python.org/3.11/distutils/setupscript.html](https://docs.python.org/3.11/distutils/setupscript.html) \[Accessed 3 June 2025].

7. Python Software Foundation, 2023. *Installing Python Modules.* \[online] Available at: [https://packaging.python.org/en/latest/tutorials/installing-packages/](https://packaging.python.org/en/latest/tutorials/installing-packages/) \[Accessed 4 July 2025].

8. Python Software Foundation, 2023. *Packaging Python Projects.* \[online] Available at: [https://packaging.python.org/en/latest/tutorials/packaging-projects/](https://packaging.python.org/en/latest/tutorials/packaging-projects/) \[Accessed 5 July 2025].

9. Python Software Foundation, 2023. *PyPI — the Python Package Index.* \[online] Available at: [https://pypi.org/](https://pypi.org/) \[Accessed 3 August 2025].

10. Python Software Foundation, 2023. *Using Python on Different Platforms.* \[online] Available at: [https://docs.python.org/3/using/index.html](https://docs.python.org/3/using/index.html) \[Accessed 4 August 2025].

11. Python Software Foundation, 2023. *doctest — Test interactive Python examples.* \[online] Available at: [https://docs.python.org/3/library/doctest.html](https://docs.python.org/3/library/doctest.html) \[Accessed 5 August 2025].

12. Python Software Foundation, 2023. *Absolute and Relative Imports.* \[online] Available at: [https://docs.python.org/3/reference/import.html#package-relative-imports](https://docs.python.org/3/reference/import.html#package-relative-imports) \[Accessed 10 August 2025].

13. Python Software Foundation, 2023. *Python Style Guide (PEP 8).* \[online] Available at: [https://peps.python.org/pep-0008/](https://peps.python.org/pep-0008/) \[Accessed 15 August 2025].

14. TestPyPI, 2023. *TestPyPI — the Python Package Index for testing.* \[online] Available at: [https://test.pypi.org/](https://test.pypi.org/) \[Accessed 20 August 2025].

15. Twine Project, 2023. *Twine Documentation.* \[online] Available at: [https://twine.readthedocs.io/en/stable/](https://twine.readthedocs.io/en/stable/) \[Accessed 3 September 2025].

---

### Author: [Khaled Ahmed](https://www.linkedin.com/in/ahmedkhaled40/)

### Date Created: _04/09/2025_

<center>~</center>