# FEVER Fine-tuning Tabular Dataset Update

**Henry Zelenak | Last updated: 05/12/2023**

This notebook is used to update the FEVER fine-tuning tabular datasets, "tabular_clf_paper_dev_train," "tabular_clf_paper_dev_valid," "tabular_sentEx_paper_dev_train," and "tabular_sentEx_paper_dev_valid." The datasets are updated to include page titles, sentence IDs (indices) and entities from the June 2017 Wikipedia dump (Thorne et al., 2018, April) in addition to the sentence text.

See [FEVER_set_creation.ipynb](https://github.com/hz-zh/Modular-Fact-Checking-with-GPT-and-RAG/blob/main/FEVER_set_creation.ipynb) for the original dataset creation code.

## Setup

In [1]:
# Mount google drive
from google.colab import drive
import gc

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import openai
from openai import OpenAI

import json
import os
import io
import pandas as pd
import numpy as np
import scipy.stats as stats
import re
import time
import datetime
from random import random

In [3]:
%cd ./drive/My Drive/SUNY_Poly_DSA598/

/content/drive/My Drive/SUNY_Poly_DSA598


In [4]:
!ls -a

archive			  .git				  presentation
datasets		  .gitignore			  transcribe_voice_notes.ipynb
FEVER_set_creation.ipynb  liar_gpt4omini_base_eval.ipynb  work_documents
FEVER_set_update.ipynb	  Module_2_dev.ipynb


In [5]:

def load_jsonl(file_path, encoding='utf-8'):
    """Loads a JSON Lines file into a list of Python objects."""
    data = []
    with open(file_path, 'r', encoding=encoding) as f:  # Specify encoding for safety
        for line in f:
            data.append(json.loads(line))  # Parse each line individually
    return data

In [6]:
# Data paths (replace with your actual paths if different)
fever_path = './datasets/FEVER/'
train_clf_path = f"{fever_path}tabular_sets/tabular_clf_paper_dev_train/v1_segmented_n3461_03-29_001.csv"
valid_clf_path = f"{fever_path}tabular_sets/tabular_clf_paper_dev_valid/v1_segmented_n1482_03-29_001.csv"
train_sentEx_path = f"/content/drive/MyDrive/SUNY_Poly_DSA598/datasets/FEVER/tabular_sets/tabular_sentEx_paper_dev_train/v1_segmented_n3461_03-29_001.csv"
valid_sentEx_path = f"/content/drive/MyDrive/SUNY_Poly_DSA598/datasets/FEVER/tabular_sets/tabular_sentEx_paper_dev_valid/v1_segmented_n1482_03-29_001.csv"
test_path = f"{fever_path}paper_test.jsonl"
train_path = f"{fever_path}paper_dev.jsonl"

# Load datasets
train_clf = pd.read_csv(train_clf_path)
valid_clf = pd.read_csv(valid_clf_path)
train_sentEx = pd.read_csv(train_sentEx_path)
valid_sentEx = pd.read_csv(valid_sentEx_path)
test_jsonl = load_jsonl(test_path)
train_jsonl = load_jsonl(train_path)

## Dataset Updates

### Sentence extraction from Wikipedia text, now with sentence text, page titles, sentence IDs, and entities in a list of tuples.

In [7]:
def load_wiki_pages_minimal_memory(wiki_dir, num_files_to_load=50):
  """
  Loads Wikipedia page text from JSONL files into a dictionary (minimal memory usage).
  Loads only the 'text' field and only the first 'num_files_to_load' files.

  Args:
      wiki_dir (str): Path to the directory containing wiki-*.jsonl files.
      num_files_to_load (int): Number of wiki files to load (default: 3).

  Returns:
      dict: A dictionary where keys are page titles and values are the page text.
  """
  print(f"Attempting to load {num_files_to_load} Wikipedia pages from {wiki_dir}...")
  wiki_pages = {}
  loaded_files_count = 0
  for filename in sorted(os.listdir(wiki_dir)):
      if filename.startswith('wiki-') and filename.endswith('.jsonl'):
        filepath = os.path.join(wiki_dir, filename)
        print(f"Loading file: {filepath}")
        with open(filepath, 'r', encoding='utf-8') as f:  # Specify encoding for safety
          for line in f:
            data = json.loads(line)
            # Create a object in the dictionary for the data
            wiki_pages[data['id']] = {
              'text': data['text'],
              'lines': data['lines']
            }

        loaded_files_count += 1
      if loaded_files_count >= num_files_to_load:
            break
  return wiki_pages


wiki_pages_dir = './datasets/FEVER/wiki-pages'
num_to_load = len(os.listdir(wiki_pages_dir))
wiki_page_list_dicts = load_wiki_pages_minimal_memory(wiki_pages_dir, num_to_load)

print(f"Loaded texts from {len(wiki_page_list_dicts)} Wikipedia pages (minimal memory).")

Attempting to load 109 Wikipedia pages from ./datasets/FEVER/wiki-pages...
Loading file: ./datasets/FEVER/wiki-pages/wiki-001.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-002.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-003.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-004.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-005.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-006.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-007.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-008.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-009.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-010.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-011.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-012.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-013.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-014.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-015.jsonl
Loading file: ./datasets/FEVER/wiki-pages/wiki-016.jsonl
Loading file:

In [8]:
for page, i in zip(wiki_page_list_dicts, range(1, 10)):
  print(f"Page {i}: {page}")

Page 1: 
Page 2: 1928_in_association_football
Page 3: 1986_NBA_Finals
Page 4: 1901_Villanova_Wildcats_football_team
Page 5: 1992_Northwestern_Wildcats_football_team
Page 6: 1897_Princeton_Tigers_football_team
Page 7: 1536_in_philosophy
Page 8: ...Di_terra
Page 9: 1967–68_MJHL_season


In [9]:
import random
import requests
import re

def extract_evidence_text_debug(fever_item, wiki_page_dict, verbose=False, debug=False):
    """
    Extracts evidence sentences with debugging prints.
    """
    if verbose:
      print(f"Starting evidence extraction for claim: {fever_item['claim']}")

    if fever_item['label'] == 'NOT ENOUGH INFO':
      if verbose:
        print(f"NOT ENOUGH INFO found for claim: {fever_item['claim']}, STOP PROCESSING AND REMOVE NOT ENOUGH INFO CLAIMS")

      return [], ''

    evidence_sentences = []
    text_str = ''

    # flatten the evidence set
    evidence_set = fever_item['evidence'] # This is a list of lists of lists, where each sublist contains evidence pieces that an annotator has selected for a claim
    evidence_set = [item for sublist in evidence_set for item in sublist] # Flatten the evidence set
    # Trim the first two elements of each evidence piece (annotation_id, evidence_id)
    evidence_set = [evidence_piece[2:] for evidence_piece in evidence_set] # Trim the first two elements of each evidence piece
    if debug:
      print(f"DEBUG 0: Evidence set: {evidence_set}") # DEBUG 0
    # Convert inner lists to tuples before creating a set
    evidence_set = list(set(tuple(item) for item in evidence_set))
    # Convert back to list of lists (optional, based on your later usage)
    evidence_set = [list(item) for item in evidence_set]
    if debug:
      print(f"DEBUG 1: Length of evidence_set: {len(evidence_set)}\nSet: {evidence_set}") # DEBUG 0

    pages_loaded = set() # Set to keep track of loaded pages
    for evidence_piece in evidence_set: # Looping over evidence pieces (there will be 1 or more, these are the evidence sentences from different pages, for the same claim)
        if debug:
          print(f"DEBUG 2: Evidence piece: {evidence_piece}") # DEBUG 1
        if len(evidence_piece) == 2:  # Check if the evidence piece has 4 elements (annotation_id, evidence_id, page_title, sentence_id)
            page_title, sentence_id = evidence_piece # Unpack the evidence piece
            if debug:
              print(f"DEBUG 2.1: Processing evidence: page_title={page_title}, sentence_id={sentence_id}") # DEBUG 2
            if page_title is not None and sentence_id is not None: # Check if page_title and sentence_id are not None
                wiki_page = wiki_page_dict.get(page_title) # Retrieve the wiki page from the dictionary
                if debug:
                  print(f"DEBUG 3: Wiki page retrieved: {wiki_page is not None}") # DEBUG 3
                if wiki_page: # Check if the wiki page for these sentences
                    if debug:
                      print(f"DEBUG 4: Wiki page keys: {wiki_page.keys()}") # DEBUG 4
                    if 'lines' in wiki_page and isinstance(wiki_page['lines'], str): # Check if 'lines' key exists and is a string
                        lines_str = wiki_page['lines']
                        # Check if page has already been loaded and added to the text_str
                        if page_title not in pages_loaded:
                            pages_loaded.add(page_title)
                            text_str += "\n" + wiki_page['text']
                        elif debug:
                            print(f"DEBUG 4.1: Page {page_title} already loaded, skipping text addition.") # DEBUG 4.1
                        sentences = lines_str.strip().split('\n')
                        if debug:
                          print(f"DEBUG 5: Number of sentences found: {len(sentences)}") # DEBUG 5
                        if sentence_id < len(sentences):  # Check if sentence_id is within the range of sentences
                          for sentence, _ in zip(sentences, range(len(sentences))): # Loop over the sentences and their indices
                            if len(sentence.split('\t')) < 2 or sentence.split('\t')[1].strip() == '': #
                              if debug:
                                print(f"DEBUG 6: Skipping blank line at index {_}") # DEBUG 6
                              continue
                            if debug:
                              print(f"DEBUG 7: Line retrieved: {sentence} with ID: " +  sentence.split('\t')[0].strip()) # DEBUG 7
                            if int(sentence.split('\t')[0].strip()) == sentence_id:
                              sentence_text = sentence.split('\t')[1].strip() # Extract the text after the tab character (To skip the index at the beginning and the entities after: "0	The 1905 Tempe Normal Owls football team was an American football team that represented Tempe Normal School -LRB- later renamed Arizona State University -RRB- as an independent during the 1905 college football season .	American football	American football	Arizona State University	Arizona State University	1905 college football season	1905 college football season")
                              # Entities are everything after the second item when splitting by tab
                              entities = sentence.split('\t')[2:] # Extract the entities
                              # Remove any leading or trailing whitespace from the entities
                              entities = [entity.strip() for entity in entities if entity.strip()] # Remove empty entities
                              entities = set(entities) # Remove duplicates
                              entities = list(entities) # Convert back to list
                              if debug:
                                print(f"DEBUG 8: Sentence text for index {sentence_id} / " + sentence.split('\t')[0].strip() + f" extracted: {sentence_text}") # DEBUG 8
                                print(f"DEBUG 9: Entities extracted: {entities}") # DEBUG 9
                              # Append a tuple of (sentence_text, page_title, sentence_id, entities) to the evidence_sentences list
                              evidence_sentences.append([sentence_text, page_title, sentence_id, entities])
                            #else:
                                #if verbose:
                                  #print(f"\tWarning: Sentence index does not match for page: {page_title}, sentence_id: {sentence_id}")
                        else:
                          if verbose:
                            print(f"\tWarning: Sentence index out of range for page: {page_title}, sentence_id: {sentence_id} (Number of sentences: {len(sentences)})")
                    else:
                      if verbose:
                        print(f"\tWarning: Could not retrieve page or lines for title: {page_title}")
                else:
                  if verbose:
                    print(f"\tWarning: Could not retrieve wiki page for title: {page_title}")
        else:
          if verbose:
            print(f"\tWarning: Unexpected evidence format: {evidence_piece}")
        if verbose:
          print(f"DEBUG 10: Evidence sentences for page {page_title} have been extracted.")
          print("_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-")
    if verbose:
      print(f"Evidence extraction completed for claim: {fever_item['claim']}")
      print(f"Number of evidence sentences found: {len(evidence_sentences)}")
      print('--------------------------------------------------------------')
    return evidence_sentences, text_str


In [11]:
# Test run evidence extraction
claim_item = train_jsonl[13]
evidence, full_text = extract_evidence_text_debug(claim_item, wiki_page_list_dicts, verbose=True, debug=True)

print(f"\nClaim: {claim_item['claim']}")
print("\nEvidence Sentences:")
for item in evidence:
    sentence, page_title, sentence_id, entities = item
    print(f"- {sentence}\n")
    print(f"Page: {page_title}, Sentence ID: {sentence_id}, Entities: {entities}")
print("\nFull Text:")
print(full_text)

Starting evidence extraction for claim: Murda Beatz's real name is Marshall Mathers.
DEBUG 0: Evidence set: [['Murda_Beatz', 0]]
DEBUG 1: Length of evidence_set: 1
Set: [['Murda_Beatz', 0]]
DEBUG 2: Evidence piece: ['Murda_Beatz', 0]
DEBUG 2.1: Processing evidence: page_title=Murda_Beatz, sentence_id=0
DEBUG 3: Wiki page retrieved: True
DEBUG 4: Wiki page keys: dict_keys(['text', 'lines'])
DEBUG 5: Number of sentences found: 3
DEBUG 7: Line retrieved: 0	Shane Lee Lindstrom -LRB- born February 11 , 1994 -RRB- , professionally known as Murda Beatz , is a Canadian hip hop record producer from Fort Erie , Ontario .	Fort Erie	Fort Erie, Ontario	Ontario	Ontario	hip hop	Hip hop music	record producer	record producer with ID: 0
DEBUG 8: Sentence text for index 0 / 0 extracted: Shane Lee Lindstrom -LRB- born February 11 , 1994 -RRB- , professionally known as Murda Beatz , is a Canadian hip hop record producer from Fort Erie , Ontario .
DEBUG 9: Entities extracted: ['Hip hop music', 'Ontario', 'r

### Update tabular training files

In [12]:
# Get the claims that are SUPPORTS and REFUTES from the jsonl and the tabular data
jsonl_supports = [item for item in train_jsonl if item['label'] == 'SUPPORTS']
jsonl_refutes = [item for item in train_jsonl if item['label'] == 'REFUTES']

# Iterate through the rows of each dataframe, matching the claim text with the claim in the jsonl file, and extract the evidence sentences for that item
def update_evidence_sentences(df, jsonl_supports, jsonl_refutes, wiki_page_dict, verbose=0):
    """
    Updates the evidence sentences in the dataframe based on the JSONL data.
    """
    new_df = df.copy()  # Create a copy of the dataframe to avoid modifying the original
    for index, row in new_df.iterrows():
        claim = row['claim']
        label = row['label']
        if label == 'SUPPORTS':
            # Find the corresponding JSONL item
            jsonl_item = next((item for item in jsonl_supports if item['claim'] == claim), None)
        elif label == 'REFUTES':
            # Find the corresponding JSONL item
            jsonl_item = next((item for item in jsonl_refutes if item['claim'] == claim), None)
        else:
            # Add none if the label is not SUPPORTS or REFUTES
            if verbose:
                print(f"Adding nothing for claim: {claim} with label: {label}")
            jsonl_item = None
        if jsonl_item:
            # Extract evidence sentences and full text
            evidence_sentences, _ = extract_evidence_text_debug(jsonl_item, wiki_page_dict, verbose=verbose)
            # Update the dataframe with the evidence sentences, directly in the new_df
            new_df.at[index, 'evidence_sentences'] = evidence_sentences
        else:
            if verbose:
                print(f"NOT ENOUGH INFO found for claim: {claim}, not adding anything")

    return new_df

# Update the evidence sentences in the train and valid dataframes
train_sentEx_up = update_evidence_sentences(train_sentEx, jsonl_supports=jsonl_supports, jsonl_refutes=jsonl_refutes, wiki_page_dict=wiki_page_list_dicts, verbose=1)
valid_sentEx_up = update_evidence_sentences(valid_sentEx, jsonl_supports=jsonl_supports, jsonl_refutes=jsonl_refutes, wiki_page_dict=wiki_page_list_dicts, verbose=1)
train_clf_up = update_evidence_sentences(train_clf, jsonl_supports=jsonl_supports, jsonl_refutes=jsonl_refutes, wiki_page_dict=wiki_page_list_dicts, verbose=1)
valid_clf_up = update_evidence_sentences(valid_clf, jsonl_supports=jsonl_supports, jsonl_refutes=jsonl_refutes, wiki_page_dict=wiki_page_list_dicts, verbose=1)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Evidence extraction completed for claim: The Columbia River is too narrow for ships.
Number of evidence sentences found: 2
--------------------------------------------------------------
Adding nothing for claim: Capsicum chinense originates in the Americas. with label: NOT ENOUGH INFO
NOT ENOUGH INFO found for claim: Capsicum chinense originates in the Americas., not adding anything
Starting evidence extraction for claim: The 17th was the day Billie Joe Armstrong was born.
DEBUG 10: Evidence sentences for page Billie_Joe_Armstrong have been extracted.
_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Evidence extraction completed for claim: The 17th was the day Billie Joe Armstrong was born.
Number of evidence sentences found: 1
--------------------------------------------------------------
Starting evidence extraction for claim: The Pelican Brief is based solely on a poem.
DEBUG 10: Evidence sentences for page The_Pelic

In [13]:
pd.set_option('display.max_colwidth', None)
train_clf_up.head(20)


Unnamed: 0,claim,evidence_sentences,label,syntactic_complexity
0,Riddick is in a science fiction film.,"[[Actor Vin Diesel has played the title role in all of the Riddick-based films and video games so far ., Riddick_-LRB-character-RRB-, 1, [Riddick (film), Vin Diesel, title role, Riddick]], [Within the canon of the series , Riddick is shown to be a highly skilled predator -- he is extremely mobile and stealthy - especially for someone of his size , has a vast knowledge of how to kill almost any humanoid in a variety of ways , is an extreme survivalist , and is notoriously hard to contain ., Riddick_-LRB-character-RRB-, 4, [Riddick (film), Riddick]], [Richard B. Riddick , more commonly known as Riddick , is a fictional character and the antihero of four films in the Riddick series -LRB- Pitch Black , The Chronicles of Riddick , the animated movie The Chronicles of Riddick : Dark Fury , and Riddick -RRB- , as well as the two video games The Chronicles of Riddick : Escape from Butcher Bay and The Chronicles of Riddick : Assault on Dark Athena ., Riddick_-LRB-character-RRB-, 0, [antihero, Pitch Black (film), Riddick, Character (arts), fictional character, Riddick (film), The Chronicles of Riddick, Pitch Black]], [Riddick is a 2013 American science fiction thriller film , the third installment in the Riddick film series ., Riddick_-LRB-film-RRB-, 0, [thriller, science fiction film, science fiction, Thriller (genre)]], [Riddick was once a mercenary , then part of a security force , and later a soldier ., Riddick_-LRB-character-RRB-, 13, [mercenary, Riddick (film), Riddick]], [One of his most defining features are his eyes , a characteristic inherent in a certain caste of his species -LRB- the Alpha-Furyans -RRB- , although he implies in Pitch Black that they were `` shined '' by a back-alley surgical operation ., Riddick_-LRB-character-RRB-, 9, [Pitch Black (film), Pitch Black]], [Pitch Black -LRB- titled The Chronicles of Riddick : Pitch Black on its DVD re-release -RRB- is a 2000 American science fiction action horror film co-written and directed by David Twohy ., Pitch_Black_-LRB-film-RRB-, 0, [horror, Horror (genre), Action genre, science fiction film, action, Riddick (film), science fiction, David Twohy, The Chronicles of Riddick, Riddick]], [Riddick is a Furyan , a member of a warrior race obliterated by a military campaign that left Furya desolate , and is one of the last of his kind ., Riddick_-LRB-character-RRB-, 8, [Riddick (film), Riddick]]]",SUPPORTS,0.56
1,Season 2 of Fargo takes place in 1982.,"[[A prequel to the events in its first season , season two of Fargo takes place in the Midwestern United States in March 1979 ., Fargo_-LRB-season_2-RRB-, 6, [Midwestern United States, Fargo (season 1), Fargo (TV series), its first season, Fargo]]]",REFUTES,0.93
2,In the Cretaceous non-avian dogs died out.,"(奇跡の香りダンス。, Kiseki no Kaori Dansu., ""Miraculous Fragrance Dance."")\nis the 12th single from Aya Matsuura, who was a Hello!\n""Kiseki no Kaori Dance.""\nProject solo artist at the time.\nIt was released on January 28, 2 under the Zetima label.",NOT ENOUGH INFO,1.0
3,Tremont Street Subway is a tunnel.,"[[The tunnel originally served five closely spaced stations : Boylston , Park Street , Scollay Square , Adams Square , and Haymarket , with branches to the Public Garden Portal and Pleasant Street Incline south of Boylston ., Tremont_Street_Subway, 5, [Park Street (MBTA station), Adams Square, Boylston, Scollay Square (BERy station), Pleasant Street Incline, Scollay Square, Boylston (MBTA station), Public Garden Portal, Adams Square (BERy station), Haymarket, Park Street, Haymarket (MBTA station), Green Line (MBTA)#Public Garden and Boylston Street]], [The Tremont Street Subway in Boston 's MBTA Subway system is the oldest subway tunnel in North America and the third oldest worldwide to exclusively use electric traction -LRB- after the City and South London Railway in 1890 , and the Budapest Metro 's Line 1 in 1896 -RRB- , opening on September 1 , 1897 ., Tremont_Street_Subway, 0, [electric traction, City and South London Railway, MBTA Subway]]]",SUPPORTS,0.49
4,"Albert S. Ruddy is born on March 29, 1930.","[[Albert S. Ruddy -LRB- born March 28 , 1930 -RRB- is a Canadian-born film and television producer ., Albert_S._Ruddy, 0, [Canadians, Canadian]]]",REFUTES,1.07
5,Annette Badland was in an American soap opera.,"Like the closely related litter moths, the adults have long, upturned labial palps, and the caterpillars have fully or mostly developed prolegs on the abdomen.\nThe Aganainae are distributed across the tropics and subtropics of the Old World.1\nThe adults and caterpillars of this subfamily are typically large and brightly colored, like the related tiger moths.\nThe Aganainae are a small subfamily of moths in the family Erebidae.\nMany of the caterpillars feed on poisonous host plants and acquire toxic cardenolides that make them unpleasant to predators.",NOT ENOUGH INFO,0.63
6,Eddie Guerrero had problems with substance abuse.,"[[He experienced various substance abuse problems , including alcoholism and an addiction to painkillers ; these real-life issues were sometimes incorporated into his storylines ., Eddie_Guerrero, 10, [alcoholism, painkillers, Glossary of professional wrestling terms#A, Analgesic, substance abuse, storylines]]]",SUPPORTS,0.86
7,Highway to Heaven is something other than a drama series.,"[[Highway to Heaven is an American television drama series which ran on NBC from 1984 to 1989 ., Highway_to_Heaven, 0, [NBC]]]",REFUTES,0.77
8,Dreamer (2005 film) was reviewed by John Gatins.,"USS Carmita (IX-152) was a Trefoil-class concrete barge - a supply ship made of concrete - during World War II.\nConsidered an unclassified miscellaneous vessel, she was acquired and placed in service on 1 May 1944.\nShe was attached to Service Force, Pacific Fleet, until 2 September 1 when she was stricken from the Naval Vessel Register.\nThe IX-1 was originally known as Slate.\nThe IX-1 was the second ship of the United States Navy to have the name Carmita and was named for the first Carmita, a schooner captured during the American Civil War.",NOT ENOUGH INFO,1.07
9,Humphrey Bogart is a professional actor for films.,"[[The next year , his performance in Casablanca -LRB- 1943 ; Oscar nomination -RRB- raised him to the peak of his profession and , at the same time , cemented his trademark film persona , that of the hard-boiled cynic who ultimately shows his noble side ., Humphrey_Bogart, 10, [Academy Award for Best Actor, Oscar, Casablanca (film), Casablanca]], [Humphrey DeForest Bogart -LRB- -LSB- ˈboʊgɑrt -RSB- December 25 , 1899January 14 , 1957 -RRB- was an American screen and stage actor whose performances in 1940s films noir such as The Maltese Falcon , Casablanca , and The Big Sleep earned him status as a cultural icon ., Humphrey_Bogart, 0, [The Big Sleep (1946 film), The Maltese Falcon, Casablanca, Mob film, The Big Sleep, films noir, cultural icon, The Maltese Falcon (1941 film), Casablanca (film), films, Film noir]], [Casablanca is a 1942 American romantic drama film directed by Michael Curtiz and based on Murray Burnett and Joan Alison 's unproduced stage play Everybody Comes to Rick 's ., Casablanca_-LRB-film-RRB-, 0, [Murray Burnett, Romance film, Casablanca, Joan Alison, Michael Curtiz, romantic drama]]]",SUPPORTS,0.63


In [14]:
train_sentEx_up.to_csv(f"{fever_path}tabular_sets/tabular_sentEx_paper_dev_train/v1_segmented_sentIDs_n3461_04-04_002.csv", index=False)
valid_sentEx_up.to_csv(f"{fever_path}tabular_sets/tabular_sentEx_paper_dev_valid/v1_segmented_sentIDs_n1482_04-04_002.csv", index=False)

In [15]:
train_clf_up.to_csv(f"{fever_path}tabular_sets/tabular_clf_paper_dev_train/v1_segmented_sentIDs_n3461_04-04_002.csv", index=False)
valid_clf_up.to_csv(f"{fever_path}tabular_sets/tabular_clf_paper_dev_valid/v1_segmented_sentIDs_n1482_04-04_002.csv", index=False)

## References


Sheffieldnlp. (2021). FEVER-scorer. SHEFFIELDNLP/Fever-scorer at release-v2.0. https://github.com/sheffieldnlp/fever-scorer/tree/release-v2.0 

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018, April). Fever dataset. Fact Extraction and VERification. https://fever.ai/dataset/fever.html 

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018, June). FEVER: A large-scale dataset for fact extraction and VERification. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 809–819). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1074