## PMC15 Pipeline

This will run the PMC15 pipeline. The steps are as follows:

1. Download the list of PMC Open Access data
2. Download and extract the data
3. Parse all the articles and create a `_results/data/pubmed_parsed_data.jsonl` file

In the `pubmed_parsed_data.jsonl` file, each line is a JSON object with the following shape:

```json
{
    "pmid": "PMID_VALUE like 11178228",
    "pmc": "PMC_VALUE like 15015",
    "location": "LOCATION_PATH: path to where the article is stored on disk",
    "figures": [
        {
            "fig_caption": "FIGURE_CAPTION: the caption of the figure in the article",
            "fig_id": "FIGURE_ID: F1, F2, etc",
            "fig_label": "FIGURE_LABEL: Figure 1, Figure 2, etc. Where the figure is referenced in the article",
            "graphic_ref": "GRAPHIC_REFERENCE_PATH: path to where the imape is stored on disk",
            "pair_id": "PAIR_ID: {pmid}_{fig_id}",
        },
    ]
}
```

In [1]:
# this controls how many articles will be downloaded and processed. Set to `None` to process all articles in the PMCOA list
MAX_ITEMS_TO_PROCESS = 100

In [2]:
from pmc15_pipeline import data
from pmc15_pipeline.utils import fs_utils

In [3]:
repo_root = fs_utils.get_repo_root_path()

In [4]:
list_output_path = repo_root / "_results" / "data" / "pubmed_open_access_file_list.txt"

data.download_pubmed_file_list(
    output_file_path=list_output_path,
)

Downloading OpenAccess file list from: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt to /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_file_list.txt
Saved to: /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_file_list.txt


In [5]:
# remove the subset_size argument to download all files
downloaded_articles_output_path = repo_root / "_results" / "data" / "pubmed_open_access_files_compressed"

data.download_pubmed_files_from_list(
    file_list_path=list_output_path,
    output_folder_path=downloaded_articles_output_path,
    subset_size=MAX_ITEMS_TO_PROCESS,
)

  0%|          | 0/100 [00:00<?, ?it/s]

File: PMC13900.tar.gz size: 108943 bytes


  1%|          | 1/100 [00:01<01:44,  1.06s/it]

File: PMC13901.tar.gz size: 1913305 bytes


  2%|▏         | 2/100 [00:02<02:04,  1.27s/it]

File: PMC13902.tar.gz size: 1090539 bytes


  3%|▎         | 3/100 [00:04<01:59,  1.23s/it]

File: PMC13911.tar.gz size: 100236 bytes


  4%|▍         | 4/100 [00:04<01:47,  1.12s/it]

File: PMC13912.tar.gz size: 454405 bytes


  5%|▌         | 5/100 [00:06<01:46,  1.12s/it]

File: PMC13913.tar.gz size: 283665 bytes


  6%|▌         | 6/100 [00:07<01:46,  1.13s/it]

File: PMC13914.tar.gz size: 179788 bytes


  7%|▋         | 7/100 [00:08<01:46,  1.15s/it]

File: PMC13915.tar.gz size: 1759577 bytes


  8%|▊         | 8/100 [00:10<02:20,  1.52s/it]

File: PMC13916.tar.gz size: 912644 bytes


  9%|▉         | 9/100 [00:12<02:14,  1.48s/it]

File: PMC13917.tar.gz size: 2247749 bytes


 10%|█         | 10/100 [00:13<02:09,  1.43s/it]

File: PMC13918.tar.gz size: 117041 bytes


 11%|█         | 11/100 [00:14<01:58,  1.33s/it]

File: PMC13919.tar.gz size: 904607 bytes


 12%|█▏        | 12/100 [00:15<01:55,  1.31s/it]

File: PMC13920.tar.gz size: 116567 bytes


 13%|█▎        | 13/100 [00:16<01:47,  1.23s/it]

File: PMC13921.tar.gz size: 102676 bytes


 14%|█▍        | 14/100 [00:17<01:40,  1.17s/it]

File: PMC13922.tar.gz size: 1621237 bytes


 15%|█▌        | 15/100 [00:19<01:44,  1.23s/it]

File: PMC13923.tar.gz size: 1378984 bytes


 16%|█▌        | 16/100 [00:20<01:46,  1.27s/it]

File: PMC13924.tar.gz size: 199975 bytes


 17%|█▋        | 17/100 [00:21<01:42,  1.23s/it]

File: PMC14752.tar.gz size: 685837 bytes


 18%|█▊        | 18/100 [00:23<01:42,  1.25s/it]

File: PMC15015.tar.gz size: 8032007 bytes


 19%|█▉        | 19/100 [00:25<02:07,  1.58s/it]

File: PMC15016.tar.gz size: 1307756 bytes


 20%|██        | 20/100 [00:26<01:59,  1.50s/it]

File: PMC15023.tar.gz size: 9004765 bytes


 21%|██        | 21/100 [00:29<02:19,  1.76s/it]

File: PMC15024.tar.gz size: 170657 bytes


 22%|██▏       | 22/100 [00:30<01:58,  1.52s/it]

File: PMC15025.tar.gz size: 844786 bytes


 23%|██▎       | 23/100 [00:31<01:50,  1.44s/it]

File: PMC15026.tar.gz size: 754450 bytes


 24%|██▍       | 24/100 [00:32<01:46,  1.40s/it]

File: PMC15027.tar.gz size: 859007 bytes


 25%|██▌       | 25/100 [00:33<01:42,  1.36s/it]

File: PMC15028.tar.gz size: 630671 bytes


 26%|██▌       | 26/100 [00:35<01:46,  1.44s/it]

File: PMC16139.tar.gz size: 491655 bytes


 27%|██▋       | 27/100 [00:36<01:38,  1.35s/it]

File: PMC16141.tar.gz size: 2269913 bytes


 28%|██▊       | 28/100 [00:37<01:36,  1.34s/it]

File: PMC16144.tar.gz size: 888478 bytes


 29%|██▉       | 29/100 [00:39<01:33,  1.31s/it]

File: PMC16145.tar.gz size: 395543 bytes


 30%|███       | 30/100 [00:40<01:31,  1.31s/it]

File: PMC17597.tar.gz size: 394598 bytes


 31%|███       | 31/100 [00:41<01:26,  1.26s/it]

File: PMC17598.tar.gz size: 1081239 bytes


 32%|███▏      | 32/100 [00:42<01:25,  1.26s/it]

File: PMC17599.tar.gz size: 2282535 bytes


 33%|███▎      | 33/100 [00:44<01:27,  1.31s/it]

File: PMC17774.tar.gz size: 735629 bytes


 34%|███▍      | 34/100 [00:47<02:00,  1.83s/it]

File: PMC17776.tar.gz size: 1855391 bytes


 35%|███▌      | 35/100 [00:50<02:28,  2.28s/it]

File: PMC17779.tar.gz size: 2136644 bytes


 36%|███▌      | 36/100 [00:52<02:11,  2.06s/it]

File: PMC17803.tar.gz size: 1073430 bytes


 37%|███▋      | 37/100 [00:53<01:54,  1.82s/it]

File: PMC17804.tar.gz size: 563240 bytes


 38%|███▊      | 38/100 [00:54<01:41,  1.63s/it]

File: PMC17805.tar.gz size: 832447 bytes


 39%|███▉      | 39/100 [00:55<01:31,  1.50s/it]

File: PMC17806.tar.gz size: 496353 bytes


 40%|████      | 40/100 [00:57<01:25,  1.42s/it]

File: PMC17807.tar.gz size: 1634044 bytes


 41%|████      | 41/100 [00:58<01:20,  1.37s/it]

File: PMC17808.tar.gz size: 879653 bytes


 42%|████▏     | 42/100 [00:59<01:17,  1.33s/it]

File: PMC17809.tar.gz size: 829564 bytes


 43%|████▎     | 43/100 [01:00<01:13,  1.29s/it]

File: PMC17810.tar.gz size: 1407569 bytes


 44%|████▍     | 44/100 [01:02<01:12,  1.30s/it]

File: PMC17811.tar.gz size: 102895 bytes


 45%|████▌     | 45/100 [01:03<01:07,  1.22s/it]

File: PMC17812.tar.gz size: 2648793 bytes


 46%|████▌     | 46/100 [01:04<01:09,  1.29s/it]

File: PMC17813.tar.gz size: 2404403 bytes


 47%|████▋     | 47/100 [01:05<01:09,  1.31s/it]

File: PMC17814.tar.gz size: 938058 bytes


 48%|████▊     | 48/100 [01:07<01:06,  1.28s/it]

File: PMC17815.tar.gz size: 129469 bytes


 49%|████▉     | 49/100 [01:08<01:01,  1.20s/it]

File: PMC17816.tar.gz size: 251650 bytes


 50%|█████     | 50/100 [01:09<00:58,  1.18s/it]

File: PMC17817.tar.gz size: 971664 bytes


 51%|█████     | 51/100 [01:10<00:58,  1.19s/it]

File: PMC17818.tar.gz size: 321170 bytes


 52%|█████▏    | 52/100 [01:11<00:55,  1.15s/it]

File: PMC17819.tar.gz size: 1144961 bytes


 53%|█████▎    | 53/100 [01:13<00:57,  1.23s/it]

File: PMC17820.tar.gz size: 1641712 bytes


 54%|█████▍    | 54/100 [01:14<00:57,  1.25s/it]

File: PMC17821.tar.gz size: 997089 bytes


 55%|█████▌    | 55/100 [01:15<00:56,  1.27s/it]

File: PMC17822.tar.gz size: 551699 bytes


 56%|█████▌    | 56/100 [01:16<00:55,  1.27s/it]

File: PMC17823.tar.gz size: 363269 bytes


 57%|█████▋    | 57/100 [01:17<00:51,  1.20s/it]

File: PMC17824.tar.gz size: 599619 bytes


 58%|█████▊    | 58/100 [01:19<00:51,  1.22s/it]

File: PMC17825.tar.gz size: 1152324 bytes


 59%|█████▉    | 59/100 [01:20<00:49,  1.22s/it]

File: PMC17826.tar.gz size: 835561 bytes


 60%|██████    | 60/100 [01:21<00:49,  1.24s/it]

File: PMC17827.tar.gz size: 2468947 bytes


 61%|██████    | 61/100 [01:27<01:37,  2.49s/it]

File: PMC17828.tar.gz size: 1343971 bytes


 62%|██████▏   | 62/100 [01:28<01:22,  2.17s/it]

File: PMC17829.tar.gz size: 603916 bytes


 63%|██████▎   | 63/100 [01:29<01:10,  1.90s/it]

File: PMC25774.tar.gz size: 6364434 bytes


 64%|██████▍   | 64/100 [01:31<01:06,  1.85s/it]

File: PMC25775.tar.gz size: 1694695 bytes


 65%|██████▌   | 65/100 [01:32<00:58,  1.67s/it]

File: PMC25776.tar.gz size: 172405 bytes


 66%|██████▌   | 66/100 [01:33<00:49,  1.45s/it]

File: PMC28985.tar.gz size: 1165259 bytes


 67%|██████▋   | 67/100 [01:34<00:45,  1.39s/it]

File: PMC28986.tar.gz size: 302608 bytes


 68%|██████▊   | 68/100 [01:36<00:42,  1.31s/it]

File: PMC28987.tar.gz size: 118333 bytes


 69%|██████▉   | 69/100 [01:37<00:37,  1.22s/it]

File: PMC28988.tar.gz size: 121496 bytes


 70%|███████   | 70/100 [01:38<00:35,  1.17s/it]

File: PMC28989.tar.gz size: 146736 bytes


 71%|███████   | 71/100 [01:39<00:32,  1.12s/it]

File: PMC28990.tar.gz size: 99355 bytes


 72%|███████▏  | 72/100 [01:40<00:31,  1.11s/it]

File: PMC28991.tar.gz size: 6534 bytes


 73%|███████▎  | 73/100 [01:41<00:27,  1.02s/it]

File: PMC28992.tar.gz size: 10531 bytes


 74%|███████▍  | 74/100 [01:41<00:24,  1.05it/s]

File: PMC28993.tar.gz size: 10119 bytes


 75%|███████▌  | 75/100 [01:42<00:23,  1.07it/s]

File: PMC28994.tar.gz size: 9055 bytes


 76%|███████▌  | 76/100 [01:43<00:21,  1.14it/s]

File: PMC28995.tar.gz size: 130264 bytes


 77%|███████▋  | 77/100 [01:44<00:21,  1.07it/s]

File: PMC28996.tar.gz size: 76186 bytes


 78%|███████▊  | 78/100 [01:45<00:20,  1.06it/s]

File: PMC28997.tar.gz size: 11338 bytes


 79%|███████▉  | 79/100 [01:46<00:18,  1.15it/s]

File: PMC28998.tar.gz size: 7731 bytes


 80%|████████  | 80/100 [01:46<00:16,  1.20it/s]

File: PMC28999.tar.gz size: 11962 bytes


 81%|████████  | 81/100 [01:47<00:15,  1.26it/s]

File: PMC29000.tar.gz size: 38067 bytes


 82%|████████▏ | 82/100 [01:48<00:14,  1.23it/s]

File: PMC29001.tar.gz size: 14723 bytes


 83%|████████▎ | 83/100 [01:49<00:13,  1.22it/s]

File: PMC29002.tar.gz size: 57256 bytes


 84%|████████▍ | 84/100 [01:50<00:14,  1.14it/s]

File: PMC29003.tar.gz size: 13664 bytes


 85%|████████▌ | 85/100 [01:51<00:12,  1.21it/s]

File: PMC29004.tar.gz size: 104722 bytes


 86%|████████▌ | 86/100 [01:52<00:13,  1.02it/s]

File: PMC29005.tar.gz size: 314438 bytes


 87%|████████▋ | 87/100 [01:53<00:12,  1.01it/s]

File: PMC29006.tar.gz size: 117859 bytes


 88%|████████▊ | 88/100 [01:54<00:11,  1.01it/s]

File: PMC29007.tar.gz size: 222331 bytes


 89%|████████▉ | 89/100 [01:55<00:11,  1.02s/it]

File: PMC29008.tar.gz size: 103557 bytes


 90%|█████████ | 90/100 [01:56<00:09,  1.00it/s]

File: PMC29009.tar.gz size: 954097 bytes


 91%|█████████ | 91/100 [01:57<00:09,  1.06s/it]

File: PMC29010.tar.gz size: 146881 bytes


 92%|█████████▏| 92/100 [01:58<00:08,  1.04s/it]

File: PMC29011.tar.gz size: 145990 bytes


 93%|█████████▎| 93/100 [01:59<00:07,  1.05s/it]

File: PMC29012.tar.gz size: 203809 bytes


 94%|█████████▍| 94/100 [02:00<00:06,  1.04s/it]

File: PMC29013.tar.gz size: 666149 bytes


 95%|█████████▌| 95/100 [02:02<00:05,  1.11s/it]

File: PMC29014.tar.gz size: 235216 bytes


 96%|█████████▌| 96/100 [02:03<00:04,  1.12s/it]

File: PMC29015.tar.gz size: 273495 bytes


 97%|█████████▋| 97/100 [02:04<00:03,  1.13s/it]

File: PMC29016.tar.gz size: 244114 bytes


 98%|█████████▊| 98/100 [02:05<00:02,  1.15s/it]

File: PMC29017.tar.gz size: 138825 bytes


 99%|█████████▉| 99/100 [02:06<00:01,  1.18s/it]

File: PMC29018.tar.gz size: 137266 bytes


100%|██████████| 100/100 [02:07<00:00,  1.27s/it]

Skipped 0 files.





In [6]:
decompressed_folder_path = repo_root / "_results" / "data" / "pubmed_open_access_files"

data.decompress_pubmed_files(
    input_folder_path=downloaded_articles_output_path,
    output_folder_path=decompressed_folder_path,
)

Found 100 files that match *.tar.gz in /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files_compressed


100%|██████████| 100/100 [00:08<00:00, 11.32it/s]

Finished extracting 100 files





In [7]:
pipeline_input_file_path = repo_root / "_results" / "data" / "pubmed_parsed_data.jsonl"

data.generate_pmc15_pipeline_outputs(
    decompressed_folder=decompressed_folder_path,
    output_file_path=pipeline_input_file_path,
)

/workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13900/BCR-3-1-055.nxml
starting...
parsed /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13900/BCR-3-1-055.nxml
no output
/workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13901/BCR-3-1-061.nxml
starting...
parsed /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13901/BCR-3-1-061.nxml
/workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13902/BCR-3-1-066.nxml
starting...
parsed /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13902/BCR-3-1-066.nxml
/workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13911/bcr-2-1-059.nxml
starting...
parsed /workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13911/bcr-2-1-059.nxml
no output
/workspaces/biomedclip_data_pipeline/_results/data/pubmed_open_access_files/PMC13912/bcr

In [8]:
num_lines = fs_utils.get_line_count(pipeline_input_file_path)
print(f"Number of lines in pipeline output file: {num_lines}")

Number of lines in pipeline output file: 78
