## Baseline EDA

This notebook takes the predictions on an initial training of the ELECTRA model on the SQuAD dataset (epochs=4) and performs baseline exploration of the predictions.

In [12]:
import json
import os
from pprint import pprint
from typing import Dict

import pandas as pd

In [57]:
BASE_DIR = "/Users/marktorres/Documents/school/ut_austin/2022_fall/nlp/final_project/fp-dataset-artifacts/" # noqa
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', None)

In [7]:
preds_location = os.path.join(BASE_DIR, "data", "eval_predictions.jsonl")

res_list = []

with open(preds_location, 'r') as f:
  json_list = list(f)

for json_str in json_list:
  res_list.append(json.loads(json_str))

In [8]:
def is_answer_exact_match(pred: Dict) -> bool:
  """Evaluates if a prediction is within the set of possible answer choices.

  Allows us to evaluate "exact match" metric for ELECTRA evaluation. The
  "exact match" that we see during the model's evaluation portion should
  be replicated with this function.
  
  Returns boolean.
  """

  possible_answers = pred["answers"]["text"]
  predicted_answer = pred["predicted_answer"]
  return predicted_answer in possible_answers

### Splitting and processing the dataset

As a first step, let's split our data between the ones that the model got correct and the ones that the model did not get correct, in order to establish some baselines.

In [17]:
is_correct_bool_list = [is_answer_exact_match(res) for res in res_list]

In [16]:
preds_df = pd.DataFrame(res_list)

In [18]:
preds_df["is_exact_match_bool"] = is_correct_bool_list

As a first pass, let's see what the breakdown of correct/incorrect is based on the document used for context:

In [44]:
num_rows_by_title_and_correctness_df = pd.crosstab(index=preds_df["title"], columns=preds_df["is_exact_match_bool"], margins=True)
num_rows_by_title_and_correctness_df['proportion_correct'] = num_rows_by_title_and_correctness_df[True]/num_rows_by_title_and_correctness_df['All']*100


In [46]:
num_rows_by_title_and_correctness_df.sort_values(by="proportion_correct")

is_exact_match_bool,False,True,All,proportion_correct
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Packet_switching,48,58,106,54.716981
Civil_disobedience,83,114,197,57.86802
Martin_Luther,173,306,479,63.88309
Computational_complexity_theory,70,127,197,64.467005
Pharmacy,43,86,129,66.666667
Chloroplast,97,198,295,67.118644
Intergovernmental_Panel_on_Climate_Change,34,70,104,67.307692
Economic_inequality,95,202,297,68.013468
European_Union_law,76,167,243,68.72428
Rhine,88,203,291,69.75945


There's some pattern in which documents the model did not successfully train on. Let's investigate these in order to see what it could be about those samples that could explain why the model does so poorly on them.

In [50]:
def return_df_subset(df: pd.DataFrame, column: str, query: str) -> pd.DataFrame:
    """Return a subset of the df based on a certain column value."""
    return df.loc[df[column] == query]

In [51]:
packet_switching_rows = return_df_subset(preds_df, "title", "Packet_switching")

In [58]:
packet_switching_rows.loc[packet_switching_rows["is_exact_match_bool"] == False]

Unnamed: 0,id,title,context,question,answers,predicted_answer,is_exact_match_bool
4809,5725d34089a1e219009abf52,Packet_switching,"Starting in the late 1950s, American computer scientist Paul Baran developed the concept Distributed Adaptive Message Block Switching with the goal to provide a fault-tolerant, efficient routing method for telecommunication messages as part of a research program at the RAND Corporation, funded by the US Department of Defense. This concept contrasted and contradicted the theretofore established principles of pre-allocation of network bandwidth, largely fortified by the development of telecommunications in the Bell System. The new concept found little resonance among network implementers until the independent work of Donald Davies at the National Physical Laboratory (United Kingdom) (NPL) in the late 1960s. Davies is credited with coining the modern name packet switching and inspiring numerous packet switching networks in Europe in the decade following, including the incorporation of the concept in the early ARPANET in the United States.",What did this concept contradict,"{'text': ['This concept contrasted and contradicted the theretofore established principles of pre-allocation of network bandwidth', 'theretofore established principles of pre-allocation of network bandwidth', 'principles of pre-allocation of network bandwidth'], 'answer_start': [328, 373, 397]}",established principles of pre-allocation of network bandwidth,False
4815,5725d52f89a1e219009abf79,Packet_switching,"Packet switching contrasts with another principal networking paradigm, circuit switching, a method which pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes. In cases of billable services, such as cellular communication services, circuit switching is characterized by a fee per unit of connection time, even when no data is transferred, while packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",What is circuit switching characterized by,"{'text': ['circuit switching is characterized by a fee per unit of connection time', 'a method which pre-allocates dedicated network bandwidth specifically for each communication session', 'fee per unit of connection time'], 'answer_start': [323, 90, 363]}",a fee per unit of connection time,False
4816,5725d52f89a1e219009abf7a,Packet_switching,"Packet switching contrasts with another principal networking paradigm, circuit switching, a method which pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes. In cases of billable services, such as cellular communication services, circuit switching is characterized by a fee per unit of connection time, even when no data is transferred, while packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",How is packet switching charecterized,"{'text': ['by a fee per unit of information transmitted', 'a fee per unit of information transmitted', 'fee per unit of information transmitted'], 'answer_start': [474, 477, 479]}","packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",False
4819,572634a789a1e219009ac56e,Packet_switching,"Packet switching contrasts with another principal networking paradigm, circuit switching, a method which pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes. In cases of billable services, such as cellular communication services, circuit switching is characterized by a fee per unit of connection time, even when no data is transferred, while packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",How is circuit switching charecterized,"{'text': ['by a fee per unit of connection time, even when no data is transferred', 'a fee per unit of connection time', 'fee per unit of connection time'], 'answer_start': [358, 361, 363]}","pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes",False
4820,572634a789a1e219009ac56f,Packet_switching,"Packet switching contrasts with another principal networking paradigm, circuit switching, a method which pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes. In cases of billable services, such as cellular communication services, circuit switching is characterized by a fee per unit of connection time, even when no data is transferred, while packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",How is packet switching characterized,"{'text': ['by a fee per unit of information transmitted, such as characters, packets, or messages', 'a fee per unit of information transmitted', 'fee per unit of information transmitted'], 'answer_start': [474, 477, 479]}",by a fee per unit of connection time,False
...,...,...,...,...,...,...,...
4896,57264586f1498d1400e8dac9,Packet_switching,"Datanet 1 was the public switched data network operated by the Dutch PTT Telecom (now known as KPN). Strictly speaking Datanet 1 only referred to the network and the connected users via leased lines (using the X.121 DNIC 2041), the name also referred to the public PAD service Telepad (using the DNIC 2049). And because the main Videotex service used the network and modified PAD devices as infrastructure the name Datanet 1 was used for these services as well. Although this use of the name was incorrect all these services were managed by the same people within one department of KPN contributed to the confusion.",Was the Use of the DATANET 1 name correct,"{'text': ['use of the name was incorrect all these services were managed by the same people within one department of KPN contributed to the confusion', 'the name was incorrect', 'Dutch PTT Telecom'], 'answer_start': [476, 483, 63]}",incorrect,False
4899,5726462b708984140094c119,Packet_switching,"The Computer Science Network (CSNET) was a computer network funded by the U.S. National Science Foundation (NSF) that began operation in 1981. Its purpose was to extend networking benefits, for computer science departments at academic and research institutions that could not be directly connected to ARPANET, due to funding or authorization limitations. It played a significant role in spreading awareness of, and access to, national networking and was a major milestone on the path to development of the global Internet.",Funding limitations allowed CSNET to be what,"{'text': ['role in spreading awareness of, and access to, national networking and was a major milestone on the path to development of the global Internet', 'not be directly connected to ARPANET', 'not be directly connected to ARPANET'], 'answer_start': [379, 272, 272]}","extend networking benefits, for computer science departments at academic and research institutions that could not be directly connected to ARPANET",False
4907,572647e2dd62a815002e805e,Packet_switching,"The National Science Foundation Network (NSFNET) was a program of coordinated, evolving projects sponsored by the National Science Foundation (NSF) beginning in 1985 to promote advanced research and education networking in the United States. NSFNET was also the name given to several nationwide backbone networks operating at speeds of 56 kbit/s, 1.5 Mbit/s (T1), and 45 Mbit/s (T3) that were constructed to support NSF's networking initiatives from 1985-1995. Initially created to link researchers to the nation's NSF-funded supercomputing centers, through further public funding and private industry partnerships it developed into a major part of the Internet backbone.",What did NSFNET eventually provide,"{'text': ['it developed into a major part of the Internet backbone', 'a major part of the Internet backbone', 'major part of the Internet backbone'], 'answer_start': [615, 633, 635]}",Initially created to link researchers to the nation's NSF-funded supercomputing centers,False
4908,572648d1708984140094c15d,Packet_switching,"The Very high-speed Backbone Network Service (vBNS) came on line in April 1995 as part of a National Science Foundation (NSF) sponsored project to provide high-speed interconnection between NSF-sponsored supercomputing centers and select access points in the United States. The network was engineered and operated by MCI Telecommunications under a cooperative agreement with the NSF. By 1998, the vBNS had grown to connect more than 100 universities and research and engineering institutions via 12 national points of presence with DS-3 (45 Mbit/s), OC-3c (155 Mbit/s), and OC-12c (622 Mbit/s) links on an all OC-12c backbone, a substantial engineering feat for that time. The vBNS installed one of the first ever production OC-48c (2.5 Gbit/s) IP links in February 1999 and went on to upgrade the entire backbone to OC-48c.",what does vBNS stand for,"{'text': ['The Very high-speed Backbone Network Service', 'Very high-speed Backbone Network Service', 'Very high-speed Backbone Network Service'], 'answer_start': [0, 4, 4]}",high-speed Backbone Network Service,False


Is there a difference in the length of the average answer?