<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/00_Data/MSMARCO_Corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MS MARCO (Microsoft Machine Reading Comprehension)

You can get the corpus from this [link](https://github.com/microsoft/MSMARCO-Document-Ranking).
For detailed information about the format of the files, see the PragmaLingu [Data Sets](https://pragmalingu.de/docs/guides/data-comparison). You can learn about parsing in general by reading our [parsing guide](https://pragmalingu.de/docs/guides/how-to-parse). 

Since MSMarco has over 3 million documents to index, it is not possible to download and run the parsing in Google Colab without a timeout. To index MSMarco on your own Elasticsearch index, copy this code into a file, set the path for the downloaded MS MARCO file and run it on your own machine. 

## Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

[?25l[K     |█                               | 10kB 14.4MB/s eta 0:00:01[K     |██                              | 20kB 17.5MB/s eta 0:00:01[K     |███                             | 30kB 4.3MB/s eta 0:00:01[K     |████                            | 40kB 5.4MB/s eta 0:00:01[K     |█████                           | 51kB 4.3MB/s eta 0:00:01[K     |██████                          | 61kB 5.0MB/s eta 0:00:01[K     |███████▏                        | 71kB 5.1MB/s eta 0:00:01[K     |████████▏                       | 81kB 5.2MB/s eta 0:00:01[K     |█████████▏                      | 92kB 4.7MB/s eta 0:00:01[K     |██████████▏                     | 102kB 4.3MB/s eta 0:00:01[K     |███████████▏                    | 112kB 4.3MB/s eta 0:00:01[K     |████████████▏                   | 122kB 4.3MB/s eta 0:00:01[K     |█████████████▎                  | 133kB 4.3MB/s eta 0:00:01[K     |██████████████▎                 | 143kB 4.3MB/s eta 0:00:01[K     |███████████████▎         

True

## Parsing

In [None]:
#!usr/bin/python3

# set paths to the dowloaded data as variables
PATH_TO_MARCO_TXT = './msmarco-docs.tsv' # enter path to where your MS MARCO file is saved
from collections import defaultdict
import re
import json
import csv
import sys

# get the text file
csv.field_size_limit(sys.maxsize)

tsv_file = open(PATH_TO_MARCO_TXT)
read_tsv = csv.reader(tsv_file, delimiter="\t")
marco_txt_data = defaultdict(dict)
for row in read_tsv:
  marco_txt_data[row[0]]['link'] = row[1]
  marco_txt_data[row[0]]['title'] = row[2]
  marco_txt_data[row[0]]['text'] = row[3]

tsv_file.close()

#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#e$
marco_index = "pragmalingu-marco-corpus"
es.indices.create(marco_index)

## Indexing

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
marco_index = "marco-corpus"

es.indices.create(marco_index)
#index documents, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in marco_txt_data.items():
  es.index(index=marco_index, id=ID, body=doc_data)
#print new index list
create_response = es.cat.indices()
print(create_response)