<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Data load and prep for MSR-PC
Notebook which allows you to download [Microsoft Paraphrase](https://www.microsoft.com/en-us/download/details.aspx?id=52398) Corpus and then install it. We also provide utilities to load, clean and transform the data into a pandas dataframe. 

You can read more details about the dataset here: https://www.microsoft.com/en-us/download/details.aspx?id=52398

## 00 Global Settings

In [1]:
import sys
import os

sys.path.append("../../../")

import pathlib
import pandas as pd

from utils_nlp.dataset.msrpc import load_pandas_df
from utils_nlp.dataset.preprocess import to_spacy_tokens
from utils_nlp.dataset.url_utils import maybe_download, download_path

INSTALLER_PATH = '../../../data/'
print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]


## 01 Download dataset

In [2]:
def download_msrpc(download_dir):
    url = "https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B" \
          "-3604ED519838/MSRParaphraseCorpus.msi "
    return maybe_download(url, work_directory=download_dir)

In [3]:
data_path = download_msrpc(INSTALLER_PATH)
print("Data downloaded to {}".format(data_path)) 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.21MB/s]


Data downloaded to ../../../data/MSRParaphraseCorpus.msi 


## 02 Install dataset

In [4]:
print("The Windows Installer for Mircosoft Paraphrase Corpus has been downloaded at ", data_path)
data_directory = input("Please install and provide the installed directory. Thanks! \n")
data_directory = pathlib.Path(data_directory)
if os.path.exists(data_directory):
    print("Dataset successfully installed at ", data_directory)

The Windows Installer for Mircosoft Paraphrase Corpus has been downloaded at  ../../../data/MSRParaphraseCorpus.msi 
Please install and provide the installed directory. Thanks! 
C:\MSRParaphraseCorpus
Dataset successfully installed at  C:\MSRParaphraseCorpus


## 03 Load data
In this step we load and preview the data.

The MSR Paraphrase Corpus comes with test and train dataset already split. A third dataset containing all the sentences is also provided. We load the train dataset below.

In [5]:
DATASET_DICT = {
    "train": "msr_paraphrase_train.txt",
    "test": "msr_paraphrase_test.txt",
    "all": "msr_paraphrase_data.txt",
}

file_path = os.path.join(data_directory, DATASET_DICT['train'])
df = pd.read_csv(file_path, delimiter="\t", error_bad_lines=False)
df.head(5)

b'Skipping line 102: expected 5 fields, saw 6\nSkipping line 656: expected 5 fields, saw 6\nSkipping line 867: expected 5 fields, saw 6\nSkipping line 880: expected 5 fields, saw 6\nSkipping line 980: expected 5 fields, saw 6\nSkipping line 1439: expected 5 fields, saw 6\nSkipping line 1473: expected 5 fields, saw 6\nSkipping line 1822: expected 5 fields, saw 6\nSkipping line 1952: expected 5 fields, saw 6\nSkipping line 2009: expected 5 fields, saw 6\nSkipping line 2230: expected 5 fields, saw 6\nSkipping line 2506: expected 5 fields, saw 6\nSkipping line 2523: expected 5 fields, saw 6\nSkipping line 2809: expected 5 fields, saw 6\nSkipping line 2887: expected 5 fields, saw 6\nSkipping line 2920: expected 5 fields, saw 6\nSkipping line 2944: expected 5 fields, saw 6\nSkipping line 3241: expected 5 fields, saw 6\nSkipping line 3358: expected 5 fields, saw 6\nSkipping line 3459: expected 5 fields, saw 6\nSkipping line 3491: expected 5 fields, saw 6\nSkipping line 3643: expected 5 fields

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


## 04 Clean data
From the cell above we can see that the data comes with a ID associated for each of the two sentences. The quality represents a binary similarity score between the two sentences. The IDs are unimportant to our use case. We drop those columns and rename existing #1 String, #2 String to sentence1 and sentence2 for clarity. 

In [6]:
df = df.drop(columns=["#1 ID", "#2 ID"])\
.dropna()\
.rename(index=str,columns={"Quality": "score","#1 String": "sentence1","#2 String": "sentence2"})

## 05 Preview the cleaned dataset

In [7]:
df.head(5)

Unnamed: 0,score,sentence1,sentence2
0,1,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


## 06 One shot dataset loading

You can use our utils for downloading, installing, loading and cleaning MSR PC dataset to abstract away the dirty parts.

In [8]:
df = load_pandas_df(local_cache_path = INSTALLER_PATH)
df.head(5)

The Windows Installer for Mircosoft Paraphrase Corpus has been downloaded at  C:\Projects\NLP-BP\NLP\data\MSRParaphraseCorpus.msi  

Please install and provide the installed directory. Thanks! 
C:\MSRParaphraseCorpus


b'Skipping line 102: expected 5 fields, saw 6\nSkipping line 656: expected 5 fields, saw 6\nSkipping line 867: expected 5 fields, saw 6\nSkipping line 880: expected 5 fields, saw 6\nSkipping line 980: expected 5 fields, saw 6\nSkipping line 1439: expected 5 fields, saw 6\nSkipping line 1473: expected 5 fields, saw 6\nSkipping line 1822: expected 5 fields, saw 6\nSkipping line 1952: expected 5 fields, saw 6\nSkipping line 2009: expected 5 fields, saw 6\nSkipping line 2230: expected 5 fields, saw 6\nSkipping line 2506: expected 5 fields, saw 6\nSkipping line 2523: expected 5 fields, saw 6\nSkipping line 2809: expected 5 fields, saw 6\nSkipping line 2887: expected 5 fields, saw 6\nSkipping line 2920: expected 5 fields, saw 6\nSkipping line 2944: expected 5 fields, saw 6\nSkipping line 3241: expected 5 fields, saw 6\nSkipping line 3358: expected 5 fields, saw 6\nSkipping line 3459: expected 5 fields, saw 6\nSkipping line 3491: expected 5 fields, saw 6\nSkipping line 3643: expected 5 fields

Unnamed: 0,score,sentence1,sentence2
0,1,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...
