<a href="https://colab.research.google.com/github/kimdarwin/ule/blob/main/CIKM_DocIU_workshop_dataset_loading_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This tutorial aims to introduce the dataset format and structure of the PDFVQA competition which is the sub-activity of The first [Workshop on Document Intelligence and Understanding (DocIU)](https://doc-iu.github.io/) in conjunction with the **32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)** in Birmingham, UK. Please go to the DocIU workshop website workshop website to get more detailed information.

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3738436%2F600acecdfc60ba5ef8ea05e6aea7dc7b%2Ftask_samples.png?generation=1690768397248148&alt=media" width = "400" style="display: block; margin: 0 auto" />


## Dataset Loading

Load the dataset from the google drive link (same as the dataset downloaded from kaggle competition). This dataset consists of three main splits: training, validation, and testing. The detailed dataset statistics of each data split can be found in below table. Please refer [PDFVQA paper](https://arxiv.org/abs/2304.06447) to get more helpful detailed information.

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3738436%2Fb2e3cfc548480b53b4ac174d00183524%2Fdataset_statistics.png?generation=1690700510714073&alt=media" width = "400" style="display: block; margin: 0 auto" />

For each split, we provide two types of files:

`CSV File`: This file contains questions and their corresponding ground truth answers. It is used specifically for training, allowing you to develop a model to predict answers based on the given questions accurately.

`PKL File`: The pkl file is a valuable resource as it contains essential document information, including bounding box coordinates of each document layout component, textual contents inside each bounding box, and parent-child relationships (structural relationships between document components). This information is crucial for understanding the structure of the visually-rich documents and identifying relevant Regions of Interest (RoIs) that might contain the answers to the questions.




In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd

# Authenticate
drive = None
def authenticate():
  global drive

  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)

#Download files
def downloadFiles(fileIds):
  authenticate()

  for fileId in fileIds:

    downloaded = drive.CreateFile({"id": fileId[1]})
    downloaded.GetContentFile(fileId[0])

In [None]:
# Loading Training dataframe and pickle file
try:
  _ = open("train_dataframe.csv", "r")
except:
  downloadFiles([["train_dataframe.csv", "1-hU0-rt31_ZUZcTRZ3tXuUL0d_L5ALQM"]])

try:
  _ = open("train_doc_info.pkl", "r")
except:
  downloadFiles([["train_doc_info.pkl", "1-AhYdD-FvbgfgGvMsi8u4eVlGi5Y2M7V"]])

# Loading Validation dataframe and pickle file
try:
  _ = open("val_dataframe.csv", "r")
except:
  downloadFiles([["val_dataframe.csv", "1-xOr1-cCel-LPd75n5W68e_gptzFquWB"]])

try:
  _ = open("val_doc_info.pkl", "r")
except:
  downloadFiles([["val_doc_info.pkl", "1-az6pNeteeUaPlFCiEzkz73Z5G3rVG-m"]])

# Loading Testing dataframe and pickle file
try:
  _ = open("test_dataframe.csv", "r")
except:
  downloadFiles([["test_dataframe.csv", "1-searmQar_snnf3Aiz5dEmUDKZDHQj_c"]])

try:
  _ = open("test_doc_info.pkl", "r")
except:
  downloadFiles([["test_doc_info.pkl", "1-2D0Bn23GGdD7bdq7jLE7uc8kC7EIThG"]])

## Exploring Dataset Formats

As mentioned in the kaggle dataset description, this dataset contains three dataset splits: training, validation and testing. For each data split, we provide a csv file containing document PMCID, natural language questions with question types. Notably, we only provide the ground truth answer with object id (unique **global_id**) for participants training and validating their models.


In [None]:
# Import essential
import json, pickle
import pandas as pd

### CSV file exploring
Let's load the training dataframe to check the column names.

In [None]:
train_df = pd.read_csv('train_dataframe.csv')
train_df.head(5)

Unnamed: 0.1,Unnamed: 0,pmcid,question,answer,global_id,question_type
0,0,27923395,Name out the section that include the Fig. 2.,['Expression of Ascl1 and neuronal markers in ...,"[243838, 243824]",parent_relationship_understanding
1,1,27923395,"When you search for the description of Fig. 3,...",['Differential expression of proneural genes w...,"[243867, 243824, 243849, 243866, 243878]",parent_relationship_understanding
2,2,27923395,Which section does include the description of ...,"['Discussion', 'Results', 'Expression of neuro...","[243877, 243824, 243825, 243884, 243849]",parent_relationship_understanding
3,3,27923395,Where can you find the Table 3?,['Differential expression of proneural genes w...,"[243867, 243849, 243824]",parent_relationship_understanding
4,4,27923395,"Where is the 'Oka C et al,1995' cited in the d...",['Expression of Ascl1 and neuronal markers in ...,"[243838, 243824]",parent_relationship_understanding


From above samples, the train dataframe mainly contain five columns.


Based on the information provided, the train dataframe contains five columns:

1. `pmcid`: This column likely contains identifiers or references to locate the 'train_doc_info.pkl' file. The purpose of this column could be to access the input RoI (Region of Interest) information required for developing the model.

2. `question`: This column contains the text of natural language questions that are used as input for the model.

3. `question_type`: This column contains the types of questions. The two types mentioned are "parent-relationship understanding" and "child-relationship understanding."

4. `answer` (training only): This column is present only in the training dataset. It stores the answer text contents that correspond to the questions in the `question` column.

5. `global_id` (training only): This column is present only in the training dataset. It contains global identifiers or references that help locate the answer Region of Interests (RoIs) associated with each question.

These columns are likely used for training and developing a model for question-answering tasks, where the model is trained to answer questions based on the provided input RoIs and question types.


In [None]:
train_df.head(1)

Unnamed: 0.1,Unnamed: 0,pmcid,question,answer,global_id,question_type
0,0,27923395,Name out the section that include the Fig. 2.,['Expression of Ascl1 and neuronal markers in ...,"[243838, 243824]",parent_relationship_understanding


In [None]:
# Get the attributes of a specific row
pmcid = train_df.loc[1, "pmcid"]
print('"pmcid": {}'.format(pmcid))
question = train_df.loc[0, "question"]
print('"question": {}'.format(question))
question_type = train_df.loc[0, "question_type"]
print('"question_type": {}'.format(question_type))
answer = train_df.loc[0, "answer"]
print('"answer": {}'.format(answer))
global_id = train_df.loc[0, "global_id"]
print('"global_id": {}'.format(global_id))

"pmcid": 27923395
"question": Name out the section that include the Fig. 2.
"question_type": parent_relationship_understanding
"answer": ['Expression of Ascl1 and neuronal markers in the early neuronal populations in the brain was regulated by Notch signalling in mouse', 'Results']
"global_id": [243838, 243824]


### PKL file exploring
Let's loading the train_doc_info.pkl file to check what kind of information can be used for developing your solution.

In [None]:
with open('train_doc_info.pkl','rb') as f:
  train_doc_info = pickle.load(f)

Next, we load the 'train_doc_info.pkl' file to access the dictionary containing RoI information for model development. By utilizing the unique pmcid (e.g., 27923395), we can easily retrieve specific details from the targeting document, such as its content.


In [None]:
print(train_doc_info['27923395'].keys())
print(train_doc_info['27923395']['pages'].keys())
print(train_doc_info['27923395']['pages'][0].keys())

dict_keys(['pages'])
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
dict_keys(['name', 'height', 'width', 'objects', 'ordered_box', 'ordered_id', 'ordered_label'])


The train dataframe includes the following important columns:

- `name`: This key represents the name of the document image page.
- `height/width`: These keys contain the height and width of the current document image, providing information about its size.
- `objects`: This key holds comprehensive information about the different components (Regions of Interest - RoIs) present in the document layout. Each object is described with multiple aspects.
- `ordered_box`, `ordered_id`, `ordered_label`: These keys store the RoIs that have been re-ordered based on the human reading order. The reading order ID is assigned to each object on the document page, following the methodology presented in the [V-Doc paper (CVPR 2022)](https://openaccess.thecvf.com/content/CVPR2022/html/Ding_V-Doc_Visual_Questions_Answers_With_Documents_CVPR_2022_paper.html). The re-ordering ensures that the RoIs are arranged in a manner that facilitates the model in learning proper context representation.

These columns collectively contribute to the effective training and representation learning for the model, particularly in visual question-answering tasks involving documents.

Additionally, getting into the "objects" key, we can check the information provided for each layout component detected by Mask-RCNN.

In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0'].keys())

dict_keys(['bbox', 'category_id', 'score', 'name', 'id', 'gap', 'relations', 'global_id', 'text', 'near_gap'])


As we can see above, each layout component we provided a set of attributes including:

- `bbox`: bounding box coordinates of the layout component

- `category_id`: category id of the component. There are seven category types provided.

- `name`: document image name matched with image name in the corresponding image folders.

- `id`: local object id in this document page. Note, the local id is randomly assigned. If you want to use proper reading order sequence input, you need to use the reading order information provided above or find another replacement way.

- `gap`: represents the gap distance between this component to others in this document page.

- `relations`: shows the parent-child relation between components.

- `global_id`: shows the global id of each object which is aligned with the CSV file. **Participant can use this ID to locate the answers of each question.**

- `text`: this key gives the text content of each layout component.


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['bbox'])

[304.7059326171875, 449.4801025390625, 234.154052734375, 252.0694580078125]


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['category_id'])

1


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['gap'])

{'1': 14.063140869140625, '2': 126.1593017578125, '3': 1.287261962890625, '4': 60.6949462890625, '5': 88.15640258789062, '6': 43.418426513671875, '7': 231.18887329101562, '8': 344.1037902832031, '9': 252.55052185058594, '10': 304.53655405685475, '11': 340.26596525030857}


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['relations'])

[['child', '10']]


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['global_id'])

243796


In [None]:
print(train_doc_info['27923395']['pages'][0]['objects']['0']['text'])

In all neuronal tissue, expression of specific neuronal
transcription factors needs to be tightly controlled to
ensure the correct patterning of neuronal populations
both temporally and spatially [3]. This patterning is
regulated in part by the Notch signalling pathway,
which has remained highly conserved throughout verte-
brate evolution. Lateral inhibition with feedback regula-
tion allows Notch signalling to maintain the number of
neural progenitor cells (NPCS) by controlling the num-
ber of neighbouring cells that can exit the cell cycle
and subsequently undergo neural differentiation [14).
Cell cycle exit is controlled by a limited number of
basic helix-loop-helix (BHLH) proneural genes that are
both necessary and sufficient to activate neurogenesis
[5, 28). Loss of function studies indicate that proneural
transcription factors direct not only general aspects of
neuronal differentiation, but also specific aspects of
neuronal identity within NPCS (23, 39, 60). These pro-
neural tra

Absolutely, not all the information provided is necessary to build a question-answering model. When developing your framework, participants have the flexibility to select and utilize the most relevant information to suit the specific requirements.