# Data Formatting 

Goal: Given column "A1", use regex to find the answer within the context and generate an "answer_start" value for each given answer. Questions with multiple answers should have a list of "answer_start" values, corresponding to the number of answers the question had. These lists should be stored in a new column: "answer_start"



* note, in SQuAD Development Set, answer_start is the character index of the first character in which the answer begins. In our original experiment, when BERT produced answer_start values these numbers are off.

## Data 

In [3]:
import pandas as pd
import re
data = pd.read_csv('divided_questions11.csv')
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]  #drop unnamed columns

data.head()

Unnamed: 0,category,question,context,A1
0,About the Virus FAQs,Am I at risk for COVID-19 from a package or pr...,There is still a lot that is unknown about the...,"""very low risk"""
1,About the Virus FAQs,Am I likely to get sicker if I'm exposed to mu...,Symptom severity can be influenced by many dif...,"""stay at least 6 feet away from everyone and a..."
2,About the Virus FAQs,Are antibiotics effective in preventing or tre...,No. According to the World Health Organization...,"""No"""
3,About the Virus FAQs,Are there therapies available to treat COVID-19?,Scientists are currently testing different typ...,"""Remdesivir"",\n""Dexamethasone"",\n ""favipiravir..."
4,About the Virus FAQs,Are there two strains of the COVID-19 virus?,The existence of an S strain and an L strain r...,"""The existence of an S strain and an L strain ..."


split each answer in 'A1', create a list of answers for each question

## Preprocess

In [4]:
end = len(data['A1'])
for i in range(0,end):
    split = re.split('"\n,|"\n ,|",\n|", \n', data['A1'][i])    #split by newlines  #added quotes
    nq = [i.replace('"',"").strip() for i in split]         # remove quotes surrounding answers
    one = [i.replace('(',"\(") for i in nq]                 #escape parentheses (regex can't handle them)
    data['A1'][i] = [i.replace(')',"\)") for i in one]      #escape parentheses (regex can't handle them

print(data['A1'])



0                                        [very low risk]
1      [stay at least 6 feet away from everyone and a...
2                                                   [No]
3      [Remdesivir, Dexamethasone, favipiravir, ribav...
4      [The existence of an S strain and an L strain ...
                             ...                        
406    [defer all cruise ship travel worldwide, repor...
407    [the remains must meet the standards for impor...
408    [A list of destinations with coronavirus disea...
409                           [we should all wear masks]
410    [pilots must report all illnesses and deaths t...
Name: A1, Length: 411, dtype: object


## Compute Answer_Start Scores

In [5]:
import re
answer = data['A1']
context = data['context']
data['answer_start'] = ''
for i in range(len(data['A1'])):
    my_list = []                #empty list to store answer_start values
    pattern = answer[i]         #iterate through answers
    text = context[i]           #iterate through context
    idx = len(pattern)          #index depends on how many answers are available
    for k in range(idx):        #if there are three answers, range should be 3
        if re.search(pattern[k], text):
            match = re.search(pattern[k],text)
            answer_start = match.start()    #get answer_start
            my_list.append(answer_start)    #append to a list
        elif re.search(re.escape(pattern[k]), text):
            match = re.search(re.escape(pattern[k]),text)
            answer_start = match.start()    #get answer_start
            my_list.append(answer_start)    #append to a list
        else:
            print(pattern[k])                #print any unmatched patterns
            print('#####################################')    
            
    data['answer_start'][i] = my_list     #put list in appropriate column & row  
        
    

In [6]:
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]  #drop unnamed columns
data.head()

Unnamed: 0,category,question,context,A1,answer_start
0,About the Virus FAQs,Am I at risk for COVID-19 from a package or pr...,There is still a lot that is unknown about the...,[very low risk],[620]
1,About the Virus FAQs,Am I likely to get sicker if I'm exposed to mu...,Symptom severity can be influenced by many dif...,[stay at least 6 feet away from everyone and a...,[679]
2,About the Virus FAQs,Are antibiotics effective in preventing or tre...,No. According to the World Health Organization...,[No],[0]
3,About the Virus FAQs,Are there therapies available to treat COVID-19?,Scientists are currently testing different typ...,"[Remdesivir, Dexamethasone, favipiravir, ribav...","[167, 530, 811, 824, 838]"
4,About the Virus FAQs,Are there two strains of the COVID-19 virus?,The existence of an S strain and an L strain r...,[The existence of an S strain and an L strain ...,[0]


In [16]:
data.to_csv('Desktop/Interactions/formatted_questions.csv')

## Check for any empty lists in "Answer_Start"

In [10]:
k = []
ind = 0
for i in data['answer_start']:
    if not i:
        k.append(ind)
    ind+=1
if len(k) == 0:
    print('No Empty Lists in Answer Start Column')
        

No Empty Lists in Answer Start Column


## Check individual matches (for testing purposes)

In [14]:
import re

pattern = data['A1'][76][0]
text = data['context'][76]
match = re.search(pattern, text) #search for a pattern; only the first match is recorded
s = match.start() # start of match (index)

print(s)
print(pattern)
print(text)