# Simple Doc2Vec

- Standard preprocessing
- One network per grounding Document
- Predict the best matching span
- Use RCQuestion set as is (give in question with history, no processing of question other than the same prepocessing)

## 1. Preprocessing Grounding Documents
- uses documents dataset and returns a GroundingDocument class for each of the document

In [1]:
from src.preprocessing_documents import load_documents_df, grounding_documents_for_dataframe
from src.train_and_predict import BatchTrainer

NUMBER_OF_GROUNDING_DOCUMENTS = 488

# load all grounding documents
df = load_documents_df("train")
assert df.shape[0] == NUMBER_OF_GROUNDING_DOCUMENTS

# pre-process grounding documents ready for training
grounding_documents = grounding_documents_for_dataframe(df)
assert len(grounding_documents) == NUMBER_OF_GROUNDING_DOCUMENTS

Reusing dataset doc2dial (./data_cache_src/doc2dial/document_domain/1.0.1/765cb4d9af421b599d910080fd61b4a43440c1232693876470ef3245daa5fa4c)


## 2. Train Doc2Vec
- uses the BatchTrainer and trains a network per grounding document

In [2]:
# train a model for each grounding document
trainer = BatchTrainer(grounding_documents)
assert len(trainer.trainedModels) == NUMBER_OF_GROUNDING_DOCUMENTS

## 3. Predictions
- uses RC Dataset to predict the grounding text for the questions
- no special pre-processing of the question other than what as has been done for the grounding document spans


In [7]:
from src.preprocessing_rc import load_rc_dataset

train_predictions = trainer.predict_answers_for(load_rc_dataset("train"))

Reusing dataset doc2dial (./data_cache_src/doc2dial/doc2dial_rc/1.0.1/765cb4d9af421b599d910080fd61b4a43440c1232693876470ef3245daa5fa4c)


## 4. Evaluations
- uses the 'squad2' metric to calculate the scores for the predictions


In [8]:
train_score = train_predictions.squad2_score()
train_score

{'exact': 1.375360971073369,
 'f1': 15.220391689165405,
 'total': 20431,
 'HasAns_exact': 1.375360971073369,
 'HasAns_f1': 15.220391689165405,
 'HasAns_total': 20431,
 'best_exact': 1.375360971073369,
 'best_exact_thresh': 0.0,
 'best_f1': 15.220391689165405,
 'best_f1_thresh': 0.0}

## 5. Validation
Given that the model is trained on the grounding doc only we can just re-run the predictions and evaluation for the validation dataset

In [9]:
validation_predictions = trainer.predict_answers_for(load_rc_dataset("validation"))
validation_score = validation_predictions.squad2_score()
validation_score

Reusing dataset doc2dial (./data_cache_src/doc2dial/doc2dial_rc/1.0.1/765cb4d9af421b599d910080fd61b4a43440c1232693876470ef3245daa5fa4c)


{'exact': 1.283987915407855,
 'f1': 15.611065094429934,
 'total': 3972,
 'HasAns_exact': 1.283987915407855,
 'HasAns_f1': 15.611065094429934,
 'HasAns_total': 3972,
 'best_exact': 1.283987915407855,
 'best_exact_thresh': 0.0,
 'best_f1': 15.611065094429934,
 'best_f1_thresh': 0.0}