# Session 2 Activities
## Objectives

* Be comfortable mixing async and sync code using asyncio.
* Understand how Python uses classmethod, staticmethod, and abstract base classes for object design.
* Use dataclasses to simplify boilerplate.
* Know about native Python modules and binary extensions.
* Write code with type hints and validate it with mypy.
* Run tests with pytest and unittest.
* Use the classic Python packaging tools (pip, venv, requirements.txt, setup.py) and know why modern projects move toward poetry.

## Mixing sync and async

Sometimes is necessary to call sync code -ex. to use a legacy or not async compatible library-

If the sync call is "fast enough" a posibility is to call it's methods/functions directly -acknowledging it may block other tasks in the loop-. In general if the sync call involves IO operations or a considerable CPU use it may be useful to run it in a separated thread -IO bound tasks, where GIL is not an issue- or subprocess -CPU bound tasks-.

In [None]:
# Do not run this example in Jupyter as it has some special handling of the event loop, asyncio.run() would fail.
# Install pandas and fastparquet in venv before running it.
import asyncio
import pandas as pd
import time
import sys
from itertools import cycle


def load_taxi_data() -> tuple[pd.DataFrame, pd.DataFrame]:
    """Mostly IO-Bound blocking operations"""
    # NY yellow taxi trips Jan 2023
    url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
    print("Loading dataset...")
    df = pd.read_parquet(url, engine="fastparquet")
    print("Dataset loaded:", df.shape)

    # NY taxi zones data
    zones_url = "https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv"
    zones = pd.read_csv(zones_url)
    return df, zones


def trips_data_by_borough(trips: pd.DataFrame, zones: pd.DataFrame) -> pd.DataFrame:
    """CPU-Bound blocking operations"""
    start = time.time()

    # Join trips with zone data
    merged = trips.merge(zones, left_on="PULocationID", right_on="LocationID", how="left")

    # GroupBy by Borough sort
    agg = (
        merged.groupby("Borough")
        .agg(
            total_trips=("VendorID", "count"),
            avg_distance=("trip_distance", "mean"),
            avg_fare=("fare_amount", "mean"),
        )
        .sort_values("total_trips", ascending=False)
    )

    print("Execution time:", time.time() - start, "seconds")
    return agg


def sync_code() -> pd.DataFrame:
    trips, zones = load_taxi_data()
    res = trips_data_by_borough(trips, zones)
    return res


async def call_sync_code(done: asyncio.Event) -> pd.DataFrame:
    # FIXME: Make the call to sync_code non-blocking so spinner can run concurrently displaying
    # the spinner at all times.
    # Notice both_taxi_data() and trips_data_by_borough() will block the spinner in this implementation.
    # See `run_in_executor()` function, pick the most appropiate executor for maximum concurrency.
    # Find a video of how the spinner should look like attached.
    res = sync_code()
    done.set()
    return res


async def spinner(done: asyncio.Event) -> None:
    symbols = cycle("/-\\|")
    print()
    for symbol in symbols:
        if done.is_set():
            break
        sys.stdout.write(f"\033[3D[{symbol}]")
        sys.stdout.flush()
        await asyncio.sleep(0.3)
    print("X")


async def main() -> None:
    done = asyncio.Event()
    result, _ = await asyncio.gather(call_sync_code(done), spinner(done))
    print(f"{result=}")


asyncio.run(main())

## ABC, static methods, class methods and properties

1. Modify `Document` class to enforce the `author` property and the `text` method are consistently implemented in subclasses.
2. Modify `Document` class to enforce subclasses to have a "type_name" class method, this method should return a nice string representation of the document type -i.e: "pdf" for PDFDocument and "docx" for DocXDocument-.
3. Add a method to `Document` in order to get the mimetype of a file without creating a `Document` subclass instance.

In [None]:
from pathlib import Path
import mimetypes
import pdfplumber
import docx


class Document:
    def __init__(self, filepath: Path):
        self.filepath = filepath

    @property
    def mime_type(self) -> tuple[str|None, str|None]:
        """Returns a tuple of the form (mimetype, encoding). If either the type or the encoding
        cannot be guessed the value will be None."""
        return mimetypes.guess_type(self.filepath)


class PDFDocument(Document):
    def __init__(self, filepath: Path):
        super().__init__(filepath)
        self.document = pdfplumber.open(self.filepath)

    @property
    def author(self) -> str:
        return self.document.metadata.get("Author", "")

    def text(self) -> str:
        return "\n".join(page.extract_text_simple() for page in self.document.pages)


class DocXDocument(Document):
    def __init__(self, filepath: Path):
        super().__init__(filepath)
        self.document = docx.Document(self.filepath)

    @property
    def docAuthor(self) -> str:
        return self.document.core_properties.author

    def textExtract(self) -> str:
        return "\n".join(para.text for para in self.document.paragraphs)

In [None]:
pdf = PDFDocument(Path("... .pdf"))
doc = DocXDocument(Path("... .docx"))

In [None]:
pdf.author

In [None]:
doc.docAuthor

In [None]:
pdf.mime_type

In [None]:
doc.mime_type

In [None]:
pdf.text()

In [None]:
doc.textExtract()

## Create a simple python package and build a wheel

Create a simple python package project using Poetry. The package is a library to extract text from different document types -use the previous example code and split it in modules-.

* Use Poetry to create and manage the project.
* Add latest versiosn of pdfplumber and python-docx as dependencies.
* Add black -or similar formatter tool-, and pytest as dev dependencies.
* Run poetry install.
* Implement some basic test.
* Run the tests with `poetry run pytest` check `-v` and `--pdb` arguments.
* Introduce a bug in the code or an assertion error in test to drop into debugger with `pytest`'s `--pdb` argument.
* Make the code to work.
* Build with poetry, check the generated files in `dist`.
* If possible publish in **pypi** and add it as dependency in another test project.

Suggested project layout:
```
docreader/
├── README.md
├── docreader
│   ├── __init__.py
│   ├── document.py
│   ├── docx.py
│   └── pdf.py
├── pyproject.toml
└── tests
    ├── file_examples
    │   ├── example.docx
    │   └── example.pdf
    └── test_pdf.py
```

In [None]:
# Example test
from docreader import pdf

def test_author():
    d = pdf.PDFDocument("tests/file_examples/example.pdf")
    assert d.author == "Eric Idle"