#### Indexing Dataset 2022

Download dataset 2022, split into sentences, DELETE DUPLICATES, extract stance and index them.

Final output: **index22sent**, 

**merged_retreived_sent22.csv** after loading the retreived sentences from notebook 2

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
import pyterrier as pt
import pandas as pd
import os
import matplotlib.pyplot as plt
import string
import re

In [5]:
if not pt.started():
  pt.init()


PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


The corpus is a pre-processed version of the args.me corpus (version 2020-04-01) where each argument is split into sentences.
These sentences and the complete arguments will be indexed. 
Duplicate sentences are an expected part of this dataset. This is because in some of the debate portals from which the args.me corpus was derived, individual arguments are mapped to the same conclusion (which is essentially the discussion topic/title on the corresponding portal). Moreover, we opted to preserve the duplicates to reflect the common information retrieval scenario where filtering out such documents is part of the pipeline.

In [6]:
dataset = pt.get_dataset('irds:argsme/2020-04-01/processed/touche-2022-task-1')

In [7]:
dataset.info_url() #information source

'https://ir-datasets.com/argsme.html#argsme/2020-04-01/processed/touche-2022-task-1'

In [8]:
dataset.get_corpus_lang() #language enghlish ok

'en'

In [9]:
#iterator over the documents
corpus_iter = dataset.get_corpus_iter()

# Convert the iterator to a list 
corpus_list = list(corpus_iter)

# Print the first few elements of the list
print(corpus_list[:5])

argsme/2020-04-01/processed/touche-2022-task-1 documents:   0%|          | 0/365408 [00:00<?, ?it/s]

argsme/2020-04-01/processed/touche-2022-task-1 documents: 100%|██████████| 365408/365408 [04:29<00:00, 1355.84it/s]

[{'conclusion': 'the War in Iraq was Worth the Cost', 'premises': [ArgsMePremise(text='His removal provides stability and security not only for Iraq but for the Middle East as a region', stance=<ArgsMeStance.PRO: 1>, annotations=[])], 'premises_texts': 'His removal provides stability and security not only for Iraq but for the Middle East as a region', 'aspects': [], 'aspects_names': '', 'source_id': 'Sf9294c83', 'source_title': 'This House Believes that the War in Iraq was Worth the Cost | idebate.org', 'source_url': 'https://idebate.org/debatabase/international-middle-east-politics-terrorism-warpeace/house-believes-war-iraq-was-worth', 'source_previous_argument_id': None, 'source_next_argument_id': None, 'source_domain': <ArgsMeSourceDomain.idebate: 4>, 'source_text': "idebate.org Educational and informative news and resources on debate, advocacy and activism for youth. \xa0 \xa0 News Debatabase Events Community Media About Search form Search This House Believes that the War in Iraq w




In [10]:
#Convert the list of documents to a Pandas DataFrame
df = pd.DataFrame(corpus_list)


In [11]:
print(df.columns)

Index(['conclusion', 'premises', 'premises_texts', 'aspects', 'aspects_names',
       'source_id', 'source_title', 'source_url',
       'source_previous_argument_id', 'source_next_argument_id',
       'source_domain', 'source_text', 'source_text_conclusion_start',
       'source_text_conclusion_end', 'source_text_premise_start',
       'source_text_premise_end', 'topic', 'acquisition', 'date', 'author',
       'author_image_url', 'author_organization', 'author_role', 'mode',
       'sentences', 'docno'],
      dtype='object')


In [116]:
# df.to_csv('dataset2022.csv', index=False) #saving the entire dataset 2022 containing 365k documents to retrieve

In [None]:
# pd.set_option('display.max_colwidth', None)

##### **Queries**

In [8]:
topics = dataset.get_topics() #queries 
print(f'There are', len(topics),' topics')
display(topics.head(2))

There are multiple query fields available: ('title', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
There are 50  topics


Unnamed: 0,qid,title,description,narrative
0,1,Should teachers get tenure?,A user has heard that some countries do give t...,Highly relevant arguments make a clear stateme...
1,2,Is vaping with e-cigarettes safe?,When considering to switch from smoking to vap...,Highly relevant arguments support or deny the ...


##### **Query relevance judgment**

In [7]:
qrels = dataset.get_qrels() 
qrels.shape

There are multiple qrel fields available: ['relevance', 'quality', 'coherence']. Defaulting to "relevance", but to use a different one, supply variant


(6841, 6)

In [8]:
qrels.head()

Unnamed: 0,qid,docno,label,quality,coherence,iteration
0,1,"Sc065954f-Ae72bc9c6__PREMISE__41,Sc065954f-Ae7...",2,2,2,0
1,1,"S51530f3f-Ad9a140f__PREMISE__3,S1b03f390-A22af...",2,2,0,0
2,1,"S51530f3f-Ae32a4a1b__PREMISE__13,Sff0947ec-A46...",2,1,1,0
3,1,"S51530f3f-Ae32a4a1b__PREMISE__7,Sff0947ec-A46d...",2,2,0,0
4,1,"S80d1e58b-A5923d626__PREMISE__11,S37b8bc05-A7d...",0,0,1,0


We split the docno column elements into separate rows


In [12]:
qrels.docno.iloc[0].split(",")

['Sc065954f-Ae72bc9c6__PREMISE__41', 'Sc065954f-Ae72bc9c6__CONC__1']

In [13]:
qrels["docno"] = qrels["docno"].apply(lambda x: x.split(","))

In [14]:
qrels = qrels.explode("docno")

In [15]:
print("New shape:", qrels.shape[0])
qrels

New shape: 13682


Unnamed: 0,qid,docno,label,quality,coherence,iteration
0,1,Sc065954f-Ae72bc9c6__PREMISE__41,2,2,2,0
0,1,Sc065954f-Ae72bc9c6__CONC__1,2,2,2,0
1,1,S51530f3f-Ad9a140f__PREMISE__3,2,2,0,0
1,1,S1b03f390-A22aff8a0__PREMISE__57,2,2,0,0
2,1,S51530f3f-Ae32a4a1b__PREMISE__13,2,1,1,0
...,...,...,...,...,...,...
6838,50,S8baeda0e-A13ad333__CONC__1,1,2,0,0
6839,50,S4d1037d1-Ab00d54e7__PREMISE__1,0,0,0,0
6839,50,S4d1037d1-Ab00d54e7__CONC__1,0,0,0,0
6840,50,Sffdf2e2e-A20e9dd06__PREMISE__4,1,1,1,0


#### **Indexing whole corpus - NO** 
Index entire corpus and retrieve using BM25 model, getting 50,000 retrieved documents. 
However, evaluation cannot be done - sentences are needed.


In [12]:
os.getcwd()

'/Users/juliabuixuan/Desktop/TOUCHE'

In [13]:
pt_index_path = '/Users/juliabuixuan/Desktop/TOUCHE/indexing22'

if not os.path.exists(pt_index_path + "/data.properties"):
  # create the index, using the IterDictIndexer indexer 
  indexer = pt.index.IterDictIndexer(pt_index_path, meta = {'docno': 100}) 

  # we give the dataset get_corpus_iter() directly to the indexer
  # while specifying the fields to index and the metadata to record
  index_ref = indexer.index(dataset.get_corpus_iter(), 
                            fields=['conclusion', 'premises_texts', 'aspects_names', 'source_id', 'source_title', 'topic','sentences'])

else:
  # if you already have the index, use it.
  index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
index2 = pt.IndexFactory.of(index_ref) #load the index

[INFO] If you have a local copy of https://zenodo.org/record/6873574/files/args_processed_04_01.tar.gz, you can symlink it here to avoid downloading it again: /Users/juliabuixuan/.ir_datasets/downloads/43bfce957df69bf59b3d59744eb73ded
[INFO] [starting] https://zenodo.org/record/6873574/files/args_processed_04_01.tar.gz
[INFO] [finished] https://zenodo.org/record/6873574/files/args_processed_04_01.tar.gz: [03:54] [1.55GB] [6.58MB/s]
argsme/2020-04-01/processed/touche-2022-task-1 documents: 100%|██████████| 365408/365408 [06:48<00:00, 895.51it/s] 


14:53:41.770 [ForkJoinPool-1-worker-1] ERROR org.terrier.structures.indexing.Indexer - Could not finish MetaIndexBuilder: 
java.io.IOException: Key S2db48a61-A430a7cb1 is not unique: 272650,142020
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.indexDocuments(BasicIndexer.java:270)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:388)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:377)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:131)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:1

The index is a Terrier object with different methods.

In [18]:
print(index2.getCollectionStatistics().toString()) 

Number of documents: 365408
Number of terms: 484997
Number of postings: 38738382
Number of fields: 7
Number of tokens: 130357220
Field names: [conclusion, premises_texts, aspects_names, source_id, source_title, topic, sentences]
Positions:   false



In [28]:
index2.getLexicon()['death'].toString()

'term416 Nt=31659 TF=170722 maxTF=2147483647 @{0 30937638 2} TFf=5068,75645,0,0,5167,5167,79675'

##### Retrieval

BatchRetrieve represents a retrieval transformation, in which queries are mapped to retrieved documents.
As BatchRetrieve is a retrieval transformation, it takes as input dataframes with columns [“qid”, “query”], and returns dataframes with columns [“qid”, “query”, “docno”, “score”, “rank”].

In [75]:
topics.columns

Index(['qid', 'title', 'description', 'narrative'], dtype='object')

In [39]:
dataset.get_topics('title').head() #ad hoc query

Unnamed: 0,qid,query
0,1,should teachers get tenure
1,2,is vaping with e cigarettes safe
2,3,should insider trading be allowed
3,4,should corporal punishment be used in schools
4,5,should social security be privatized


In [71]:
br = pt.BatchRetrieve(index_ref, wmodel='BM25')
#Retrieve docs for each query 
ret_q = br.transform(dataset.get_topics('title'))



In [35]:
ret_q

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,140220,S51530f3f-Ae32a4a1b,0,33.086531,should teachers get tenure
1,1,100070,Sb0680508-Aa5189771,1,33.067429,should teachers get tenure
2,1,315753,Sc065954f-A24a16870,2,33.034253,should teachers get tenure
3,1,315755,Sc065954f-Ae72bc9c6,3,33.024700,should teachers get tenure
4,1,324218,Sff0947ec-A46d54897,4,32.913585,should teachers get tenure
...,...,...,...,...,...,...
49995,50,319737,S8c866652-A90722f69,995,12.102334,should everyone get a universal basic income
49996,50,30590,Se3bb0258-Abb0466ac,996,12.099727,should everyone get a universal basic income
49997,50,71782,Sb002605-Ab8bb6569,997,12.097363,should everyone get a universal basic income
49998,50,134837,S8fe5a288-Af877be51,998,12.095580,should everyone get a universal basic income


In [73]:
ret_q[ret_q.duplicated(subset=['docno', 'qid'], keep=False)] #boolean series denoting duplicate rows, marking all duplicates as true 

Unnamed: 0,qid,docid,docno,rank,score,query
110,1,8595,S5b50ff9c-Ac6506987,110,14.756716,should teachers get tenure
111,1,8602,S5b50ff9c-Ac6506987,111,14.756716,should teachers get tenure
459,1,179446,Sedf8916f-A55d4b034,459,11.296937,should teachers get tenure
460,1,179448,Sedf8916f-A55d4b034,460,11.296937,should teachers get tenure
487,1,63315,S52024653-A6b07bd3e,487,11.280282,should teachers get tenure
...,...,...,...,...,...,...
49041,50,128492,Sd6044911-Ac9563cfc,41,19.398640,should everyone get a universal basic income
49130,50,265626,S2e294e85-A8d3b1d7a,130,16.035621,should everyone get a universal basic income
49131,50,265628,S2e294e85-A8d3b1d7a,131,16.035621,should everyone get a universal basic income
49791,50,107635,Sde791ad8-A3072f8a4,791,12.554577,should everyone get a universal basic income


In [34]:
ret_q[ret_q['qid'] == '50'] 

Unnamed: 0,qid,docid,docno,rank,score,query
49000,50,226772,S4d1037f0-Ae5978524,0,26.276591,should everyone get a universal basic income
49001,50,92049,S4d103793-Addbd3205,1,26.244844,should everyone get a universal basic income
49002,50,146792,Sb7051d6f-A5b500408,2,25.968263,should everyone get a universal basic income
49003,50,137122,S4d103774-A137bd529,3,25.938285,should everyone get a universal basic income
49004,50,137123,S4d103774-A1d44fd,4,25.863005,should everyone get a universal basic income
...,...,...,...,...,...,...
49995,50,319737,S8c866652-A90722f69,995,12.102334,should everyone get a universal basic income
49996,50,30590,Se3bb0258-Abb0466ac,996,12.099727,should everyone get a universal basic income
49997,50,71782,Sb002605-Ab8bb6569,997,12.097363,should everyone get a universal basic income
49998,50,134837,S8fe5a288-Af877be51,998,12.095580,should everyone get a universal basic income


In [None]:
# ret_q.groupby('qid').count()
#ok - every query retrieved 1000 docs

In [53]:
ret_q[['docno']].nunique() #45k unique docs were retrieved out of 365k

docno    44318
dtype: int64

##### experiment

In [71]:
from pyterrier.measures import *

In [46]:
pd.set_option('display.max_colwidth', None)

In [47]:
qrels #qrels have sentences retrieved, not entire arguments.


Unnamed: 0,qid,docno,label,quality,coherence,iteration
0,1,"Sc065954f-Ae72bc9c6__PREMISE__41,Sc065954f-Ae72bc9c6__CONC__1",2,2,2,0
1,1,"S51530f3f-Ad9a140f__PREMISE__3,S1b03f390-A22aff8a0__PREMISE__57",2,2,0,0
2,1,"S51530f3f-Ae32a4a1b__PREMISE__13,Sff0947ec-A46d54897__CONC__1",2,1,1,0
3,1,"S51530f3f-Ae32a4a1b__PREMISE__7,Sff0947ec-A46d54897__CONC__1",2,2,0,0
4,1,"S80d1e58b-A5923d626__PREMISE__11,S37b8bc05-A7d9efcae__PREMISE__26",0,0,1,0
...,...,...,...,...,...,...
6836,50,"S542cf477-Aeae6d1ea__PREMISE__10,S8baeda0e-A13ad333__CONC__1",0,2,2,0
6837,50,"S57eef5d2-A9666abe7__PREMISE__14,S4d103774-A1d44fd__PREMISE__3",1,2,0,0
6838,50,"S5c0f5e60-Ae4d4f67f__PREMISE__8,S8baeda0e-A13ad333__CONC__1",1,2,0,0
6839,50,"S4d1037d1-Ab00d54e7__PREMISE__1,S4d1037d1-Ab00d54e7__CONC__1",0,0,0,0


In [91]:
# pt.Experiment(
#              [ret_q], 
#              adhoc,
#              qrels,
#              eval_metrics=[AP@1000, P@5, P@10])

Unnamed: 0,name,AP@1000,P@5,P@10
0,qid docid docno rank ...,0.0,0.0,0.0


In [None]:
from pyterrier.measures import *

results = pt.Experiment(
    [tfidf],
    adhoc,
    qrels,
    eval_metrics=[AP@1000,P@5,P@10])
display(results)

Getting the dataframe of documents retrieved

In [54]:
for i, doc in enumerate(dataset.get_corpus_iter()): #prova
    print(doc)
    if i == 4:
        break

argsme/2020-04-01/processed/touche-2022-task-1 documents:   0%|          | 4/365408 [00:00<11:20, 536.79it/s]

{'conclusion': 'the War in Iraq was Worth the Cost', 'premises': [ArgsMePremise(text='His removal provides stability and security not only for Iraq but for the Middle East as a region', stance=<ArgsMeStance.PRO: 1>, annotations=[])], 'premises_texts': 'His removal provides stability and security not only for Iraq but for the Middle East as a region', 'aspects': [], 'aspects_names': '', 'source_id': 'Sf9294c83', 'source_title': 'This House Believes that the War in Iraq was Worth the Cost | idebate.org', 'source_url': 'https://idebate.org/debatabase/international-middle-east-politics-terrorism-warpeace/house-believes-war-iraq-was-worth', 'source_previous_argument_id': None, 'source_next_argument_id': None, 'source_domain': <ArgsMeSourceDomain.idebate: 4>, 'source_text': "idebate.org Educational and informative news and resources on debate, advocacy and activism for youth. \xa0 \xa0 News Debatabase Events Community Media About Search form Search This House Believes that the War in Iraq wa




In [55]:
doc.keys()

dict_keys(['conclusion', 'premises', 'premises_texts', 'aspects', 'aspects_names', 'source_id', 'source_title', 'source_url', 'source_previous_argument_id', 'source_next_argument_id', 'source_domain', 'source_text', 'source_text_conclusion_start', 'source_text_conclusion_end', 'source_text_premise_start', 'source_text_premise_end', 'topic', 'acquisition', 'date', 'author', 'author_image_url', 'author_organization', 'author_role', 'mode', 'sentences', 'docno'])

In [56]:
doc['conclusion']

'the War in Iraq was Worth the Cost'

In [57]:
doc['docno']

'Sf9294c83-Aa036c8f7'

In [58]:
doc['aspects_names']

''

In [59]:
doc['source_id']

'Sf9294c83'

In [60]:
doc['source_title']

'This House Believes that the War in Iraq was Worth the Cost | idebate.org'

In [61]:
doc['premises_texts']

'Even if the outcome is a stable democratic Iraq, the war was still a costly, illegal, ideologically-driven mistake'

In [62]:
doc['premises']

[ArgsMePremise(text='Even if the outcome is a stable democratic Iraq, the war was still a costly, illegal, ideologically-driven mistake', stance=<ArgsMeStance.CON: 2>, annotations=[])]

In [104]:
doc['sentences'] #lista con due elementi 

[ArgsMeSentence(id='Sf9294c83-Aa036c8f7__PREMISE__1', text='Even if the outcome is a stable democratic Iraq, the war was still a costly, illegal, ideologically-driven mistake'),
 ArgsMeSentence(id='Sf9294c83-Aa036c8f7__CONC__1', text='the War in Iraq was Worth the Cost')]

In [64]:
string = str(doc['premises'])

In [65]:
string

"[ArgsMePremise(text='Even if the outcome is a stable democratic Iraq, the war was still a costly, illegal, ideologically-driven mistake', stance=<ArgsMeStance.CON: 2>, annotations=[])]"

In [66]:
stance = re.search(r'(?<=stance=<ArgsMeStance\.)\w+(?=:)', string).group(0)

print(stance)  # Output: CON


CON


Get text for the retrived documents

In [None]:
docs_collection  = pd.DataFrame(columns=['topic','docno','conclusion','premises_text','stance', 'sentences'])

count =  0
for doc in dataset.get_corpus_iter():
    if doc['docno'] in ret_q['docno'].tolist(): #retrieve text of the retrieved documents 
        docs_collection = docs_collection.append({
            'topic': doc['topic'],
            'docno': doc['docno'],
            'conclusion': doc['conclusion'],
            'premises_text': doc['premises_texts'],
            'stance' : re.search(r'(?<=stance=<ArgsMeStance\.)\w+(?=:)', str(doc['premises'])).group(0),
            'sentences':doc['sentences']}, ignore_index=True)
    #     count += 1
    # if count == 200: #prova
    #     break

display(docs_collection.sample(20))
print(len(docs_collection))


In [None]:
#save data frame to csv
with open('/content/drive/MyDrive/Second year/TOUCHE/docs_collection22_s.csv', 'w', encoding = 'utf-8-sig') as f:
  docs_collection.to_csv(f, index= False)


In [None]:
ret_q.head()

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,140220,S51530f3f-Ae32a4a1b,0,33.086531,should teachers get tenure
1,1,100070,Sb0680508-Aa5189771,1,33.067429,should teachers get tenure
2,1,315753,Sc065954f-A24a16870,2,33.034253,should teachers get tenure
3,1,315755,Sc065954f-Ae72bc9c6,3,33.0247,should teachers get tenure
4,1,324218,Sff0947ec-A46d54897,4,32.913585,should teachers get tenure


In [None]:
merged_df = pd.merge(ret_q, prova, left_on='docno', right_on='docno', how='left')



In [None]:
pd.reset_option('display.max_colwidth', None)

In [None]:
merged_df.head()

Unnamed: 0,qid,docid,docno,rank,score,query,topic,conclusion,premises_text,stance,sentences
0,1,140220,S51530f3f-Ae32a4a1b,0,33.086531,should teachers get tenure,Should Tenures Be Taken Away,Should Tenures Be Taken Away,Prevent Arbitrary Firings:If teachers did not ...,CON,[ArgsMeSentence(id='S51530f3f-Ae32a4a1b__PREMI...
1,1,100070,Sb0680508-Aa5189771,1,33.067429,should teachers get tenure,Teacher Tenure,Teacher Tenure,Here are some facts against Teacher Tenure: Te...,CON,[ArgsMeSentence(id='Sb0680508-Aa5189771__PREMI...
2,1,315753,Sc065954f-A24a16870,2,33.034253,should teachers get tenure,There should not be a teacher tenure.,There should not be a teacher tenure.,I will include my first round arguments down b...,PRO,[ArgsMeSentence(id='Sc065954f-A24a16870__PREMI...
3,1,315755,Sc065954f-Ae72bc9c6,3,33.0247,should teachers get tenure,There should not be a teacher tenure.,There should not be a teacher tenure.,Thank you for accepting my debate (This round ...,PRO,[ArgsMeSentence(id='Sc065954f-Ae72bc9c6__PREMI...
4,1,324218,Sff0947ec-A46d54897,4,32.913585,should teachers get tenure,Colleges should abolish the ability for teache...,Colleges should abolish the ability for teache...,I thank Pro for challenging me to this debate....,CON,[ArgsMeSentence(id='Sff0947ec-A46d54897__PREMI...


In [None]:
merged_df['docno'].nunique()  

44318

In [None]:
merged_df.shape

(50196, 11)

In [None]:
ret_q.loc[1,'docno']

'Sb0680508-Aa5189771'

In [None]:
ret_q.iloc[1,2]

'Sb0680508-Aa5189771'

In [None]:
merged_df.groupby('qid').count() 

Unnamed: 0_level_0,docid,docno,rank,score,query,topic,conclusion,premises_text,stance,sentences
qid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1020,1020,1020,1020,1020,1020,1020,1020,1020,1020
10,1002,1002,1002,1002,1002,1002,1002,1002,1002,1002
11,1002,1002,1002,1002,1002,1002,1002,1002,1002,1002
12,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
13,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
14,1002,1002,1002,1002,1002,1002,1002,1002,1002,1002
15,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
16,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
17,1006,1006,1006,1006,1006,1006,1006,1006,1006,1006
18,1002,1002,1002,1002,1002,1002,1002,1002,1002,1002


#### **Indexing sentences**

To index sentences, sentences must be obtained from the corpus df. Once we have a sentence for each row, we can retrieve them.
QRELS df needs to be splitted too, having one sentence per row.
In such a way, evaluation can be performed.

In [4]:
# df = pd.read_csv("dataset2022.csv")
dataset = pt.get_dataset('irds:argsme/2020-04-01/processed/touche-2022-task-1')

In [5]:
# #iterator over the documents
corpus_iter = dataset.get_corpus_iter()

# Convert the iterator to a list 
corpus_list = list(corpus_iter)


argsme/2020-04-01/processed/touche-2022-task-1 documents:   0%|          | 0/365408 [00:00<?, ?it/s]

argsme/2020-04-01/processed/touche-2022-task-1 documents: 100%|██████████| 365408/365408 [04:16<00:00, 1423.31it/s]


In [6]:
#Convert the list of documents to a Pandas DataFrame
df = pd.DataFrame(corpus_list)

In [10]:
type(df.sentences.iloc[0]) #sentences column are a list of sentences that we need to split

list

In [12]:
df.sentences.iloc[0][0][1]

'His removal provides stability and security not only for Iraq but for the Middle East as a region'

sentences column is a list, containing premises and conclusions. These one are represented as an object ArgsMeSentence. We first split the sentences into different rows, and then we extract sentences no and text into two different columns.

In [13]:
type(df.sentences.iloc[0][0])

ir_datasets.formats.argsme.ArgsMeSentence

In [14]:
# Set the minimum column width
pd.set_option('display.max_colwidth', 20)

In [13]:
print('The corpus contains', df.shape[0], 'documents and', df.shape[1], 'columns')

The corpus contains 365408 documents and 26 columns


In [14]:
df = df.dropna(axis=1,how='any') #drop columns that contain any NA values
df = df.drop(['aspects', #Drop columns I dont need for retrieval 
       'source_domain', 'source_text',
       'source_text_conclusion_start', 'source_text_conclusion_end',
       'source_text_premise_start', 'source_text_premise_end', 'acquisition','mode'], axis=1)

In [15]:
print('After removing the columns with any NA values, the corpus contains', df.shape[0], 'documents and', df.shape[1], 'columns')

After removing the columns with any NA values, the corpus contains 365408 documents and 9 columns


In [16]:
df.head(2)

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"[(Sf9294c83-Af186e851__PREMISE__1, His removal...",Sf9294c83-Af186e851
1,Saddam Hussein is gone and Iraq is now functio...,[(It's important to be clear that this debate ...,It's important to be clear that this debate is...,Donald Rumsfeld Genocide Saddam Hussein Torture,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"[(Sf9294c83-A9a4e056e__PREMISE__1, It's import...",Sf9294c83-A9a4e056e


In [17]:
df[df['docno'] == 'Sd5c342bd-A6718cbc0']

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno
196152,we should return to the gold standard,"[(https://docs.google.com..., ArgsMeStance.PRO...",https://docs.google.com...,,Sd5c342bd,Debate: we should return to the gold standard ...,we should return to the gold standard,"[(Sd5c342bd-A6718cbc0__PREMISE__1, https://doc...",Sd5c342bd-A6718cbc0
196154,we should return to the gold standard,"[(https://docs.google.com..., ArgsMeStance.PRO...",https://docs.google.com...,,Sd5c342bd,Debate: we should return to the gold standard ...,we should return to the gold standard,"[(Sd5c342bd-A6718cbc0__PREMISE__1, https://doc...",Sd5c342bd-A6718cbc0


In [18]:
df.docno.duplicated().sum()

1216

In [19]:
import pandas

In [None]:
# df.sentences.duplicated().sum()

In [20]:
df = df.drop_duplicates(subset='docno', keep='first') #drop duplicates, keep only the first occurance 


create new df with the sentences splitted, in order to have two columns - id and text

In [21]:
sent = df.explode("sentences")

Create new columns with sentences "text" and sentences "docno". In this way we'll index sentences and use sentences docno for evaluation

In [22]:
sent['sentno'] = sent.sentences.apply(lambda x: x[0])
sent['sentext'] = sent.sentences.apply(lambda x: x[1])

In [23]:
sent.head(2)

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno,sentno,sentext
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"(Sf9294c83-Af186e851__PREMISE__1, His removal ...",Sf9294c83-Af186e851,Sf9294c83-Af186e851__PREMISE__1,His removal provides stability and security no...
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"(Sf9294c83-Af186e851__CONC__1, the War in Iraq...",Sf9294c83-Af186e851,Sf9294c83-Af186e851__CONC__1,the War in Iraq was Worth the Cost


In [24]:
sent.shape

(6118957, 11)

In [21]:
pd.set_option('display.max_colwidth', None)

Sentext is just a link:


In [25]:
sent[sent['sentno'] == "Sd5c342bd-A6718cbc0__PREMISE__1"]

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno,sentno,sentext
196152,we should return to the gold standard,"[(https://docs.google.com..., ArgsMeStance.PRO...",https://docs.google.com...,,Sd5c342bd,Debate: we should return to the gold standard ...,we should return to the gold standard,"(Sd5c342bd-A6718cbc0__PREMISE__1, https://docs...",Sd5c342bd-A6718cbc0,Sd5c342bd-A6718cbc0__PREMISE__1,https://docs.google.com...


Sentext is just a word

In [26]:
sent[sent['sentno'] == 'S444c08ff-A29c534a3__PREMISE__8']

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno,sentno,sentext
249,Judiciary are undermined,[(Should Guinea-Bissau become the new front of...,Should Guinea-Bissau become the new front of t...,,S444c08ff,This House believes Guinea-Bissau should not l...,Guinea-Bissau should not let itself be turned ...,"(S444c08ff-A29c534a3__PREMISE__8, ‘U.S.)",S444c08ff-A29c534a3,S444c08ff-A29c534a3__PREMISE__8,‘U.S.


Extract STANCE
lista contentente un ArgsMeStance di 3 elementi, ci interessa il primo posizione 1

In [27]:
#Extract Stance from Args
def extract_args_me_stance(input_string):
    pattern = re.compile(r'ArgsMeStance\.([a-zA-Z]+)')
    s = str(input_string[0][1])
    matches = pattern.findall(s)
    return matches

In [28]:
sent['stance'] = sent['premises'].apply(extract_args_me_stance)

In [29]:
sent.stance.value_counts()

stance
[PRO]    3099937
[CON]    3019020
Name: count, dtype: int64

In [30]:
def remove_punctuation(input_string):
    return ''.join(char for char in input_string if char not in string.punctuation)

In [31]:
# Remove punctuation from the 'stance' column
sent['stance'] = sent['stance'].apply(lambda x: remove_punctuation(x))

In [32]:
sent.head(2)

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,docno,sentno,sentext,stance
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"(Sf9294c83-Af186e851__PREMISE__1, His removal ...",Sf9294c83-Af186e851,Sf9294c83-Af186e851__PREMISE__1,His removal provides stability and security no...,PRO
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"(Sf9294c83-Af186e851__CONC__1, the War in Iraq...",Sf9294c83-Af186e851,Sf9294c83-Af186e851__CONC__1,the War in Iraq was Worth the Cost,PRO


Rename columns for indexig purposes

In [33]:
new_columns = {'docno': 'documentno', 'sentno': 'docno', 'sentext':'text'} 
sent.rename(columns=new_columns, inplace=True)

In [34]:
sent.head(1)

Unnamed: 0,conclusion,premises,premises_texts,aspects_names,source_id,source_title,topic,sentences,documentno,docno,text,stance
0,the War in Iraq was Worth the Cost,[(His removal provides stability and security ...,His removal provides stability and security no...,,Sf9294c83,This House Believes that the War in Iraq was W...,the War in Iraq was Worth the Cost,"(Sf9294c83-Af186e851__PREMISE__1, His removal ...",Sf9294c83-Af186e851,Sf9294c83-Af186e851__PREMISE__1,His removal provides stability and security no...,PRO


Eliminate duplicates with respect to 'text' column:

In [35]:
sent.docno.duplicated().sum()

0

In [36]:
type(sent.text.iloc[0])

str

In [37]:
print("shape of sentcens:", sent.shape)
print("Duplicated sentences:", sent.text.duplicated().sum())

shape of sentcens: (6118957, 12)
Duplicated sentences: 781548


In [38]:
sent.drop_duplicates(subset=['text'], inplace=True)

In [39]:
print("shape of sentcens:", sent.shape)

shape of sentcens: (5337409, 12)


##### The next objective consists in **indexing** the dataframe containing sentences - DFIndexer

In [43]:
pt_index_path_sent = '/Users/juliabuixuan/Desktop/TOUCHE/index22sent' #25min #run again with no sentences duplicated 

indexer = pt.DFIndexer(pt_index_path_sent, overwrite=True)

index_ref = indexer.index(sent["text"], sent["docno"])
print(index_ref.toString())


11:38:19.906 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 15971 empty documents
/Users/juliabuixuan/Desktop/TOUCHE/index22sent/data.properties


In [None]:
#   # if you already have the index, use it.
# index_ref = pt.IndexRef.of(pt_index_path_sent + "/data.properties")
# index = pt.IndexFactory.of(index_ref)

In [45]:
index = pt.IndexFactory.of(index_ref)

In [46]:
print(index.getCollectionStatistics())

Number of documents: 5337409
Number of terms: 327395
Number of postings: 45957251
Number of fields: 0
Number of tokens: 49800997
Field names: []
Positions:   false



#### Merge retrived df with sent df. In this way you can associate the text of sentences to the "docno" extracted.

In [40]:
retrieved = pd.read_csv("retrieved_sent22.csv")

In [41]:
retrieved

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,5530830,Sc065954f-A6deb09b6__CONC__1,0,31.819347,should teachers get tenure
1,1,5530887,Sc065954f-A24a16870__CONC__1,1,31.819347,should teachers get tenure
2,1,5530930,Sc065954f-A39b0539e__CONC__1,2,31.819347,should teachers get tenure
3,1,5530975,Sc065954f-Ae72bc9c6__CONC__1,3,31.819347,should teachers get tenure
4,1,5530977,Sc065954f-Ac3a1cfc1__CONC__1,4,31.819347,should teachers get tenure
...,...,...,...,...,...,...
49995,50,3318286,S338b77ee-A992365c8__PREMISE__48,995,11.940570,should everyone get a universal basic income
49996,50,3994914,S3b365097-Ad26213e1__PREMISE__12,996,11.940570,should everyone get a universal basic income
49997,50,3994963,S3b365097-A52f34d04__PREMISE__10,997,11.940570,should everyone get a universal basic income
49998,50,4304731,Sf87405ad-Ae94dd302__PREMISE__3,998,11.940570,should everyone get a universal basic income


In [42]:
sent2 = sent[['docno','text','stance']]

In [43]:
sent2

Unnamed: 0,docno,text,stance
0,Sf9294c83-Af186e851__PREMISE__1,His removal provides stability and security no...,PRO
0,Sf9294c83-Af186e851__CONC__1,the War in Iraq was Worth the Cost,PRO
1,Sf9294c83-A9a4e056e__PREMISE__1,It's important to be clear that this debate is...,PRO
1,Sf9294c83-A9a4e056e__PREMISE__2,Whatever one thinks of the initial justificati...,PRO
1,Sf9294c83-A9a4e056e__PREMISE__3,It is easy to criticize the allies but it is w...,PRO
...,...,...,...
365406,S148bb110-A63b9848c__PREMISE__7,This can depend on local features such as shel...,PRO
365406,S148bb110-A63b9848c__PREMISE__8,This can mean that a lot of places just aren't...,PRO
365406,S148bb110-A63b9848c__PREMISE__9,When planning the location major consideration...,PRO
365407,S148bb110-A119d66b0__PREMISE__1,"Barages are fairly massive objects, like Dams,...",PRO


In [44]:
merged = pd.merge(retrieved, sent2, on="docno", how = 'left')

In [45]:
merged

Unnamed: 0,qid,docid,docno,rank,score,query,text,stance
0,1,5530830,Sc065954f-A6deb09b6__CONC__1,0,31.819347,should teachers get tenure,There should not be a teacher tenure.,CON
1,1,5530887,Sc065954f-A24a16870__CONC__1,1,31.819347,should teachers get tenure,,
2,1,5530930,Sc065954f-A39b0539e__CONC__1,2,31.819347,should teachers get tenure,,
3,1,5530975,Sc065954f-Ae72bc9c6__CONC__1,3,31.819347,should teachers get tenure,,
4,1,5530977,Sc065954f-Ac3a1cfc1__CONC__1,4,31.819347,should teachers get tenure,,
...,...,...,...,...,...,...,...,...
49995,50,3318286,S338b77ee-A992365c8__PREMISE__48,995,11.940570,should everyone get a universal basic income,,
49996,50,3994914,S3b365097-Ad26213e1__PREMISE__12,996,11.940570,should everyone get a universal basic income,is outside the basics.,CON
49997,50,3994963,S3b365097-A52f34d04__PREMISE__10,997,11.940570,should everyone get a universal basic income,x+y=z is basic.,PRO
49998,50,4304731,Sf87405ad-Ae94dd302__PREMISE__3,998,11.940570,should everyone get a universal basic income,"Also, we have basically the same username.",CON


In [92]:
merged.to_csv("merged_retreived_sent22.csv", index = False)

NO DUPLICATES: 

In [47]:
retrieved = pd.read_csv("unique_retrieved_sent22.csv")

In [47]:
sent2 = sent[['docno','text','stance']]

In [50]:
merged = pd.merge(retrieved, sent2, on="docno", how = 'left')

In [51]:
merged

Unnamed: 0,qid,docid,docno,rank,score,query,text,stance
0,1,4852096,Sc065954f-A6deb09b6__CONC__1,0,31.873663,should teachers get tenure,There should not be a teacher tenure.,CON
1,1,2130092,S51530f3f-Ab10cafd7__PREMISE__1,1,29.939285,should teachers get tenure,This is a debate of tenures for teachers.,PRO
2,1,1476828,Sb0680508-Aa5189771__PREMISE__1,2,29.274197,should teachers get tenure,Here are some facts against Teacher Tenure: Te...,CON
3,1,2074507,Sbfe05689-Ac1b8b63e__PREMISE__51,3,29.152149,should teachers get tenure,"If only competent teachers are given tenure, t...",PRO
4,1,4852159,Sc065954f-A39b0539e__PREMISE__14,4,28.332152,should teachers get tenure,Teacher tenure protects the academic freedom o...,CON
...,...,...,...,...,...,...,...,...
49995,50,712040,Sd8d74905-Ab7eb61ad__PREMISE__35,995,11.372135,should everyone get a universal basic income,"Under Reagan, the African-American community h...",PRO
49996,50,753520,Sd57a7309-Ac186179a__PREMISE__18,996,11.372135,should everyone get a universal basic income,It is a ratio between the threshold of a famil...,PRO
49997,50,768460,Scba94472-Ac52ad00b__PREMISE__7,997,11.372135,should everyone get a universal basic income,You're much more likely to have a 6 figure inc...,PRO
49998,50,844101,Sb50ede6b-Aedfb3f89__PREMISE__51,998,11.372135,should everyone get a universal basic income,The income acquired by low-income households i...,CON


Retrieved document sentences with BM25 where duplicated sentences were removed.

In [52]:
merged.to_csv("unique_merged_retreived_sent22.csv", index = False) 

### No duplicates and merged with Dirchlet retrieved sentences:

In [46]:
retrieved = pd.read_csv("DMunique_retrieved_sent22.csv")
retrieved.head()

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,4985300,Sff0947ec-A46d54897__PREMISE__34,0,9.095028,should teachers get tenure
1,1,4852132,Sc065954f-A24a16870__PREMISE__38,1,9.076845,should teachers get tenure
2,1,4852137,Sc065954f-A24a16870__PREMISE__43,2,9.06441,should teachers get tenure
3,1,4852192,Sc065954f-Ae72bc9c6__PREMISE__43,3,9.06441,should teachers get tenure
4,1,4852109,Sc065954f-A24a16870__PREMISE__15,4,9.00282,should teachers get tenure


In [None]:
sent2 = sent[['docno','text','stance']]

In [48]:
merged = pd.merge(retrieved, sent2, on="docno", how = 'left')

In [49]:
merged.head()

Unnamed: 0,qid,docid,docno,rank,score,query,text,stance
0,1,4985300,Sff0947ec-A46d54897__PREMISE__34,0,9.095028,should teachers get tenure,"[3]"" [C3]: Research supports tenure Not only h...",CON
1,1,4852132,Sc065954f-A24a16870__PREMISE__38,1,9.076845,should teachers get tenure,This quote further proves why tenure is pretty...,PRO
2,1,4852137,Sc065954f-A24a16870__PREMISE__43,2,9.06441,should teachers get tenure,"Stephey, ""A Brief History of Tenure,"" www.time...",PRO
3,1,4852192,Sc065954f-Ae72bc9c6__PREMISE__43,3,9.06441,should teachers get tenure,"Stephey, ""A Brief History of Tenure,"" www.time...",PRO
4,1,4852109,Sc065954f-A24a16870__PREMISE__15,4,9.00282,should teachers get tenure,(http://teachertenure.procon.org......) This m...,PRO


In [None]:
merged.to_csv("DMunique_merged_retreived_sent22.csv", index = False) 