# Finding Significance Classification

## Database Content Preparation

In order to classify the contents of the database, I first query for content IDs and abstracts. I store these results in a parquet file, so that I can run the actual inference on the HPI cluster.

In [1]:
base_path = '../..'

In [2]:
import sys
sys.path.insert(0, base_path)

In [3]:
from pathlib import Path

In [4]:
from sqlalchemy import select
from integration.orm import aact, civic, pubmed, trialstreamer
import pandas as pd

In [5]:
import os
os.chdir(base_path)

In [6]:
PATH_INFERENCE_DATASET = Path("output/ids_to_abstracts_for_inference.parquet")

In [7]:
from api.app import app
from fastapi.testclient import TestClient
import os

with TestClient(app) as client:
    # Initialize state
    pass    

session = app.state.session()

### Pubmed

In [8]:
query = (
    select(pubmed.Trial.pm_id, pubmed.Trial.abstract)
    .where(pubmed.Trial.abstract.isnot(None))
    .where(pubmed.Trial.abstract != "")
)
results = session.execute(query).all()

In [9]:
df_pm = pd.DataFrame(results)
df_pm["source"] = "pubmed"
df_pm

Unnamed: 0,pm_id,abstract,source
0,12411355,OBJECTIVE\n\n\nTo measure the effect of giving...,pubmed
1,12411356,OBJECTIVES\n\n\nTo identify which type of smok...,pubmed
2,12411354,OBJECTIVE\n\n\nTo compare the effects and side...,pubmed
3,12411290,Antithymocyte globulin (ATG) has recently been...,pubmed
4,12411282,Mucociliary clearance is determined by ciliary...,pubmed
...,...,...,...
774052,39145668,BACKGROUND AND AIMS\n\n\nSphingosine 1-phospha...,pubmed
774053,39145606,BACKGROUND AND PURPOSE\n\n\nIt is still debata...,pubmed
774054,39145520,BACKGROUND\n\n\nIron and folic acid supplement...,pubmed
774055,39145517,BACKGROUND\n\n\nStroke patients often face dis...,pubmed


### Civic

In [10]:
query = (
    select(civic.Source.pm_id, civic.Source.abstract)
    .where(civic.Source.pm_id.isnot(None))
    .where(civic.Source.abstract.isnot(None))
    .where(civic.Source.abstract != "")
)
results = session.execute(query).all()

In [11]:
df_cv = pd.DataFrame(results)
df_cv["source"] = "civic"
df_cv

Unnamed: 0,pm_id,abstract,source
0,24569458,"Targeted cancer therapies often induce ""outlie...",civic
1,23111194,Non-small cell lung cancer (NSCLC) occurs most...,civic
2,19357394,"Recently the World Health Organization (WHO), ...",civic
3,24578576,Fibrolamellar hepatocellular carcinoma (FL-HCC...,civic
4,24185510,Breast cancer is the most prevalent cancer in ...,civic
...,...,...,...
3501,19412164,"Diffuse large B-cell lymphoma (DLBCL), the mos...",civic
3502,28327945,The field of cancer genomics has demonstrated ...,civic
3503,28833375,Clear cell sarcoma of the kidney (CCSK) is a r...,civic
3504,31876361,Clear cell sarcoma of the kidney (CCSK) is the...,civic


### Clinicaltrials

For Clinicaltrials, we rely on the results data if it is available. Refer to the `classify_clinicaltrials` notebook for more information.

### Combining and saving the results

In [12]:
df_combined = pd.concat(
    [
        df_pm,
        df_cv,
    ]
)
df_combined["source"] = pd.Categorical(
    values=df_combined["source"],
    categories=["pubmed", "civic"],
    ordered=True,
)
df_combined = (
    df_combined.sort_values("source")
    .drop_duplicates(subset=["pm_id"], keep="first")
    .reset_index(drop=True)
    .drop(columns="source")
)
df_combined

Unnamed: 0,pm_id,abstract
0,12411355,OBJECTIVE\n\n\nTo measure the effect of giving...
1,20386478,The goal of this research project was to inves...
2,20386477,The objectives of the present investigation we...
3,20386476,The present study investigated the influence o...
4,20386475,The purpose of the present study was to examin...
...,...,...
776980,21594665,Trastuzumab (T) is effective in metastatic bre...
776981,19513541,"PI-103, the first synthetic multitargeted comp..."
776982,23094721,Regular use of aspirin after a diagnosis of co...
776983,18772396,Glioblastoma multiforme (GBM) is the most comm...


In [13]:
df_combined.to_parquet(PATH_INFERENCE_DATASET, compression="gzip")