<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Data load and prep for MSR-PC
Notebook which allows you to download [Microsoft Paraphrase](https://www.microsoft.com/en-us/download/details.aspx?id=52398) Corpus and then install it. We also provide utilities to load, clean and transform the data into a pandas dataframe. 


This dataset contains 5800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. The total size of the dataset is 1.3MB. You can read more details about the dataset here: https://www.microsoft.com/en-us/download/details.aspx?id=52398

## 00 Global Settings

In [1]:
import sys
import os

sys.path.append("../../")

import pandas as pd

from utils_nlp.dataset.msrpc import load_pandas_df
from utils_nlp.dataset.preprocess import to_spacy_tokens
from utils_nlp.dataset.url_utils import maybe_download, download_path

INSTALLER_PATH = '../../data/'
print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]


## 01 Download dataset

In [None]:
url = "https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B" \
          "-3604ED519838/MSRParaphraseCorpus.msi"
data_path = maybe_download(url, work_directory=INSTALLER_PATH)
print("Data downloaded to {}".format(data_path)) 

Data downloaded to ../../data/MSRParaphraseCorpus.msi


## 02 Install dataset

In [None]:
print("The Windows Installer for Mircosoft Paraphrase Corpus has been downloaded at ", data_path)
data_directory = input("Please install and provide the installed directory. Thanks! \n")
if os.path.exists(data_directory):
    print("Dataset successfully installed at ", data_directory)

The Windows Installer for Mircosoft Paraphrase Corpus has been downloaded at  ../../data/MSRParaphraseCorpus.msi


## 03 Load data
In this step we load and preview the data.

The MSR Paraphrase Corpus comes with test and train dataset already split. A third dataset containing all the sentences is also provided. We load the train dataset below.

In [None]:
DATASET_DICT = {
    "train": "msr_paraphrase_train.txt",
    "test": "msr_paraphrase_test.txt",
    "all": "msr_paraphrase_data.txt",
}

file_path = os.path.join(data_directory, DATASET_DICT['train'])
df = pd.read_csv(file_path, delimiter="\t", error_bad_lines=False)
df.head(5)

## 04 Clean data
From the cell above we can see that the data comes with a ID associated for each of the two sentences. The quality represents a binary similarity score between the two sentences. The IDs are unimportant to our use case. We drop those columns and rename existing #1 String, #2 String to sentence1 and sentence2 for clarity. 

In [None]:
df = df.drop(columns=["#1 ID", "#2 ID"])\
.dropna()\
.rename(index=str,columns={"Quality": "score","#1 String": "sentence1","#2 String": "sentence2"})

## 05 Preview the cleaned dataset

In [None]:
df.head(5)

## 06 One shot dataset loading

You can use our utils for downloading, installing, loading and cleaning MSR PC dataset to abstract away the dirty parts.

In [None]:
df = load_pandas_df(local_cache_path = INSTALLER_PATH)
df.head(5)