# Text Mining Test

In this notebook, you'll find my solution to the problem (I decided to use jupyter notebook to better document my thinking process). 

The approach I chose here (more on this below) was previously used for my other project - you can see it [here](https://github.com/hp0404/speeches) if you want to. 

Writing the code (without comments) took me less than an hour; adding these comments took some extra time - but under 2 hours in total including setting up the repository.

The real challenge for me was not just implementing the solution (I recycled the approach from my previous projects), but rather envisioning the best long-term design (paragraphs, tables - below).

In [1]:
import json
from pathlib import Path

import pandas as pd
import spacy

data_dir = Path("./data").resolve()
input_dir = data_dir / "raw"
input_file = (
    input_dir
    / "solomon-islands-pm-has-millions-in-property-raising-questions-around-wealth.txt"
)

output_dir = data_dir / "processed"

## Loading the model

I chose **spaCy** because it's the simplest tool available. As long as it gets the job done, I prefer to avoid using LLMs (spaCy is faster to implement and more reliable in terms of output).

---

But in a real life scenario, we would probably write an abstract class with shared
attributes (such as model name, version, date used) and methods (extract_named_entitiies, etc.). With this design, we could interchangeably use different options - spacy, nltk, and API calls to LLM services.

An example (in terms of design) is below:

```python
class NamedEntitiesExtractor(ABC):
    def __init__(self, model_name: str, model_version: str, ...) -> None: ...
    @abstractmethod
    def load_model(self, name: str) -> Any: ...
    @abstractmethod
    def extract_named_entities(self, text: str) -> dict[str, ...]: ...

class SpacyNamedEntitiesExtractor(NamedEntitiesExtractor):
    def load_model(self, name: str) -> Any: ...
    def extract_named_entities(self, text: str) -> dict[str, ...]: ...

class LLMNamedEntitiesExtractor(NamedEntitiesExtractor):
    def load_model(self, name: str) -> Any: ...
    def extract_named_entities(self, text: str) -> dict[str, ...]: ...
```

In [2]:
def load_model(name):
    """Loads spaCy model.

    Doesn't make sense to do more for a simple test.
    """
    return spacy.load(name)

## Parsing the input

In real life, the implementation would depend on our needs (e.g. approaching HTML files would be different than PDF files, etc.), and we'd probably use classes for each other kind (e.g. as Langchain library does).

A simpler version reads a file and returns its contents *split on **paragraphs***.
I would keep paragraphs as metadata so that we have more **context** about the actual entities,
e.g. we might want to know what other entities were mentioned in the given context (a sentence,
a paragraph, or a whole document).

It's not very relevant in this example as the input is fairly short, but long-term it's better to include as much metadata as possible (and we could always aggregate from sentence to paragraph, from paragraph to document, but not the other way around).

In [3]:
def parse_input(input_file):
    """Parses the input file and converts it into a suitable format.

    This function would've been an iterator if we worked with large files.
    """

    def read_file(input_file):
        """Reads a file."""
        with open(input_file, "r", encoding="utf-8") as f:
            contents = f.read()
        return contents

    contents = read_file(input_file)

    # comes as a list of tuples, id and an actual text
    # so that it's compatible with nlp.pipe() method used later
    return [
        (paragraph, paragraph_id)
        for paragraph_id, paragraph in enumerate(contents.split("\n\n"), start=1)
    ]

## Extracting named entities

Below is the most straightforward implementation with spaCy: we at least want to capture the type and entities' location within the document; we also want to capture a lemmatized word form.

For an API call, we'd authorize a call to a service of our choice and validate the response with libraries like Pydantic to ensure the response is a valid JSON.

In [4]:
def extract_named_entities(
    doc,
    allowed_entity_types=["ORG", "PERSON", "GPE"],
):
    """Extracts relevant entities from a given text."""
    return [
        {
            "entity": e.text,
            "entity_lemmatized": e.lemma_.lower(),
            "label": e.label_,
            "start_char": e.start_char,
            "end_char": e.end_char,
        }
        for e in doc.ents
        if e.label_ in allowed_entity_types
    ]


Below is an example of integrating the named entities extraction function into the pipeline (so that it's not only a named entity found somewhere in the text but an entity found at a specific location within the text):

In [5]:
def process_text(input_file, model="en_core_web_sm"):
    """Processes the input text with a model of choice."""
    data = []
    nlp = load_model(model)
    contents = parse_input(input_file)

    # nlp.pipe is used here to make the processing faster for larger documents
    # may as well use batch argument
    pipeline = nlp.pipe(contents, as_tuples=True)

    for paragraph, paragraph_id in pipeline:
        for sent_id, sent in enumerate(paragraph.sents, start=1):
            entities = extract_named_entities(sent)
            data.append(
                {
                    # metadata - compound foreign key:
                    # document + paragraph + sent
                    "document_id": input_file.stem,
                    "paragraph_id": paragraph_id,
                    "sent_id": sent_id,
                    # actual text split on sentences
                    "text": sent.text,
                    # entities
                    "entities": entities,
                }
            )
    return data

## Bringing it all together

In [6]:
data = process_text(input_file, model="en_core_web_sm")

In [7]:
# raw data
data[:2]

[{'document_id': 'solomon-islands-pm-has-millions-in-property-raising-questions-around-wealth',
  'paragraph_id': 1,
  'sent_id': 1,
  'text': 'Solomon Islands PM Has Millions in Property, Raising Questions Around Wealth',
  'entities': []},
 {'document_id': 'solomon-islands-pm-has-millions-in-property-raising-questions-around-wealth',
  'paragraph_id': 2,
  'sent_id': 1,
  'text': 'The prime minister of the Pacific Island nation has recently gone on a home-building spree, constructing eight valuable new houses in and around the capital city of Honiara.',
  'entities': [{'entity': 'Honiara',
    'entity_lemmatized': 'honiara',
    'label': 'GPE',
    'start_char': 165,
    'end_char': 172}]}]

In [8]:
print(f'original text: {data[1]["text"]}')

start = data[1]["entities"][0]["start_char"] 
end = data[1]["entities"][0]["end_char"]

# found within text
print(f'found match: {data[1]["text"][start:end]}')

original text: The prime minister of the Pacific Island nation has recently gone on a home-building spree, constructing eight valuable new houses in and around the capital city of Honiara.
found match: Honiara


In [9]:
with open(output_dir / "0_processed_data.json", "w", encoding="utf-8") as f:
    f.write(json.dumps(data))

## Transformation

I'm using Pandas dataframe to illustrate the output and .csv as the storage format. In real life, a database such as Postgres would be a better choice. Assume these .csv files are tables within the same database's schema.

### Table 1: metadata + text
- one sentence per row (the lowest useful level of detail we could use here)

In [10]:
tabular = pd.DataFrame(data)
document = tabular.drop("entities", axis=1)
document

Unnamed: 0,document_id,paragraph_id,sent_id,text
0,solomon-islands-pm-has-millions-in-property-ra...,1,1,"Solomon Islands PM Has Millions in Property, R..."
1,solomon-islands-pm-has-millions-in-property-ra...,2,1,The prime minister of the Pacific Island natio...
2,solomon-islands-pm-has-millions-in-property-ra...,3,1,"Manasseh Sogavare, the four-time prime ministe..."
3,solomon-islands-pm-has-millions-in-property-ra...,4,1,The 69-year-old premier is modestly paid by th...
4,solomon-islands-pm-has-millions-in-property-ra...,5,1,"That relatively low paycheck, however, hasnt ..."
5,solomon-islands-pm-has-millions-in-property-ra...,6,1,The couple have built at least eight new house...
6,solomon-islands-pm-has-millions-in-property-ra...,6,2,The construction costs are estimated to run as...
7,solomon-islands-pm-has-millions-in-property-ra...,7,1,"Rick Houenipwela, himself a former prime minis..."
8,solomon-islands-pm-has-millions-in-property-ra...,8,1,Houenipwelas own home is a sprawling structur...
9,solomon-islands-pm-has-millions-in-property-ra...,8,2,He told reporters his own renovations had take...


In [11]:
document.to_csv(output_dir / "1_text_and_metadata.csv", index=False)

In [12]:
# we can always go back to one document = one row
document.groupby("document_id", as_index=False)["text"].apply(','.join)

Unnamed: 0,document_id,text
0,solomon-islands-pm-has-millions-in-property-ra...,"Solomon Islands PM Has Millions in Property, R..."


### Table 2: entities
- one entity per row
- each record contains a foreign key (document_id, paragraph_id, sent_id) so it can be joined to the sentences

In [13]:
flattened_entities = []
for sent in data:
    if sent["entities"] == []:
        continue
    for ent in sent["entities"]:
        ent["document_id"] = sent["document_id"]
        ent["paragraph_id"] = sent["paragraph_id"]
        ent["sent_id"] = sent["sent_id"]
        flattened_entities.append(ent)

# table 2 - entities
entities = pd.DataFrame(flattened_entities)
entities

Unnamed: 0,entity,entity_lemmatized,label,start_char,end_char,document_id,paragraph_id,sent_id
0,Honiara,honiara,GPE,165,172,solomon-islands-pm-has-millions-in-property-ra...,2,1
1,Manasseh Sogavare,manasseh sogavare,ORG,0,17,solomon-islands-pm-has-millions-in-property-ra...,3,1
2,Solomon Islands,solomon islands,GPE,51,66,solomon-islands-pm-has-millions-in-property-ra...,3,1
3,Solomon Islands dollars,solomon islands dollar,ORG,172,195,solomon-islands-pm-has-millions-in-property-ra...,4,1
4,SBD,sbd,ORG,200,203,solomon-islands-pm-has-millions-in-property-ra...,4,1
5,Sogavare,sogavare,ORG,54,62,solomon-islands-pm-has-millions-in-property-ra...,5,1
6,Emmy,emmy,PERSON,77,81,solomon-islands-pm-has-millions-in-property-ra...,5,1
7,OCCRP,occrp,ORG,208,213,solomon-islands-pm-has-millions-in-property-ra...,5,1
8,In-Depth Solomons,in-depth solomons,ORG,218,235,solomon-islands-pm-has-millions-in-property-ra...,5,1
9,Honiara,honiara,PERSON,102,109,solomon-islands-pm-has-millions-in-property-ra...,6,1


In [14]:
entities.to_csv(output_dir / "2_entities.csv", index=False)

### Table 3: aggregated entities

The output here isn't impressive to be honest (it would be a 'working'/analytical table, not part of the 'database'). 

We can see that lemmatization didn't help much: we have duplicates—different word endings, e.g., Sogavare and Sogavares. We also have some questionable labels (Honiara, Lungga - are these GPE or PERSON?) hinting at these smaller spacy models' limits. 

But as far as the approach goes, I think it's okay.

In [15]:
agg = (
    entities
    .groupby(["entity_lemmatized", "label"], as_index=False)
    .agg({
        "document_id": ["size", lambda x: ','.join(x)]
    })
)
agg.columns = ["entity_lemmatized", "label", "frequency", "documents_found"]
agg.sort_values("frequency", ascending=False)

Unnamed: 0,entity_lemmatized,label,frequency,documents_found
12,sogavare,ORG,2,solomon-islands-pm-has-millions-in-property-ra...
0,anz,ORG,1,solomon-islands-pm-has-millions-in-property-ra...
9,occrp,ORG,1,solomon-islands-pm-has-millions-in-property-ra...
15,taiwan,GPE,1,solomon-islands-pm-has-millions-in-property-ra...
14,solomon islands dollar,ORG,1,solomon-islands-pm-has-millions-in-property-ra...
13,solomon islands,GPE,1,solomon-islands-pm-has-millions-in-property-ra...
11,sbd,ORG,1,solomon-islands-pm-has-millions-in-property-ra...
10,rick houenipwela,PERSON,1,solomon-islands-pm-has-millions-in-property-ra...
8,manasseh sogavare,ORG,1,solomon-islands-pm-has-millions-in-property-ra...
1,emmy,PERSON,1,solomon-islands-pm-has-millions-in-property-ra...


In [16]:
agg.to_csv(output_dir / "3_aggregated_entities.csv", index=False)