Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal
Created: 2024-10-24
-
AQuA.csv(AQuA-RAT)- original: https://github.com/google-deepmind/AQuA
-
RQA.csv-
Processing
- Use data from 2023/11/1 onwards.
- Removed the
atag fromevidenceand created a columnevidence_wo_url. - Used the following prompt to have
gpt-4o-minidetermine whether the question could be answered based solely on the given context. Only questions deemed answerable were used:Follow the instructions and output your response. ## Instructions - Return True if you can answer the question based solely on the context; otherwise, return False. - Respond in JSON format as {{'output': bool}}. ## Context {context} ## Question {question} ## Answer to the Question {answers}
-
UQA.csv- 100 questions generated by LLM from universal documents such as the Universal Declaration of Human Rights as data that potentially contains background knowledge.
-
MskCal.txtThe calculation problems presented as descriptive questions.
-
Common File Contents for Each Folder
[QA_TYPE]_gemma2_9b_*.csv... Response results for each MR under regular masking.[QA_TYPE]_gemma2_9b_*_filtered.csv... Response results for each MR under partial masking.[QA_TYPE]_gemma2_9b_*_no_meaning.csv... Response results for each MR under strict masking.[QA_TYPE]_gemma2_9b_*_no_verb.csv... Response results for each MR under lenient masking.[QA_TYPE]_gemma2_9b_tmp.csv,[QA_TYPE]_step1.csv,[QA_TYPE]_step2.csv,[QA_TYPE]_step2_use.csv,[QA_TYPE]_step4.csv... Intermediate processing files used in the notebook steps.
-
AQuA_case1- results of
notebook/AQA_case1_gpt-4o-mini.ipynb- Case 1: remove the last line of
rationaletext and do not perform numerical conversion
- Case 1: remove the last line of
- results of
-
AQuA_case2- results of
notebook/AQA_case2_gpt-4o-mini.ipynb- Case 2: provide
rationaleand perform numerical conversion
- Case 2: provide
- results of
-
AQuA_case3_gpt-4o-mini-2024-07-18- results of
notebook/AQA_case3_gpt-4o-mini.ipynb- Case 3: without providing
rationaleand without numerical conversion
- Case 3: without providing
- results of
-
AQuA_case3_gpt-4o-2024-08-06- results of
notebook/AQA_case3_gpt-4o.ipynb- Case 3: without providing
rationaleand without numerical conversion - Use
gpt-4o-2024-08-06 - Use files from
AQuA_case3_gpt-4o-mini-2024-07-18for step1 to 3
- Case 3: without providing
- results of
-
MskCal- results of
notebook/MskCal.ipynb - The introduction part is also masked.
- results of
-
MSKCAL_gpt-4o-2024-08-06 -
MSKCAL_gpt-4o-mini-2024-07-18- results of
notebook/MskCal.ipynb- results for MR: 0%
- results of
-
MskCal_2- results of
notebook/MskCal_2.ipynb - The introduction part is not masked.
- results of
-
RQA- results of
notebook/RQA.ipynb
- results of
-
RQA_llama3.2- results of
notebook/RQA_llama3.2.ipynb
- results of
-
UQA- results of
notebook/UQA.ipynb
- results of
-
UQA_llama3.2- results of
notebook/UQA_llama3.2.ipynb
- results of
- Used Ollama
- Model:
gemma2:9b-instruct-q4_0https://ollama.com/library/gemma2:9b/blobs/ff1d1fc78170
- Model:
- Used OpenAI API https://platform.openai.com/docs/models/
- Models:
gpt-4o-mini-2024-07-18,gpt-4o-2024-08-06
- Models:
- Used Ollama
- Model:
llama3.2:3b-instruct-q4_K_Mhttps://ollama.com/library/llama3.2:3b/blobs/dde5aa3fc5ff
- Model:
-
Step 1
- Create a word list by using
Spacyto segment the question, context, and options into morphemes.
- Create a word list by using
-
Step 2
- Cenerate the meanings of the words in the word list using the masking task LLM.
-
Step 3
- Convert numerical values.
-
Step 4
- Select words to be masked based on the masking rate.
-
Step 5
- Generate answers to the masked questions using the decoding task LLM.
-
AQA_case1_gpt-4o-mini.ipynb- Case 1: remove the last line of
rationaletext and do not perform numerical conversion
- Case 1: remove the last line of
-
AQA_case2_gpt-4o-mini.ipynb- Case 2: provide
rationaleand perform numerical conversion
- Case 2: provide
-
AQA_case3_gpt-4o-mini.ipynb- Case 3: without providing
rationaleand without numerical conversion
- Case 3: without providing
-
AQA_case3_gpt-4o.ipynb- Case 3: without providing
rationaleand without numerical conversion - Use
gpt-4o-2024-08-06 - Use files from
AQuA_case3_gpt-4o-mini-2024-07-18for step1 to 3
- Case 3: without providing
-
MskCal.ipynb- The introduction part is also masked.
- MR: 0% (using
gpt-4o-miniandgpt-4o), and MR: 20–80%.
-
MskCal_2.ipynb- The introduction part is not masked.
-
RQA.ipynb -
RQA_llama3.2.ipynb- Use files from
RQA.ipynbfor step1 to 3
- Use files from
-
UQA.ipynb -
UQA_llama3.2.ipynb- Use files from
UQA.ipynbfor step1 to 3
- Use files from