## Getting Data

If you are on this notebook then you have made a choice in the previous section to go ahead with **Collected-Original-Data**. Your intuition might have said that since Bert large uncased is a case unsensitive model and annotating the data might not help to make a model better.

This directory contains collected original data with normalization for numbers/date etc which contain the pre-designed human-robot interaction questions and the user answers. Entire data is in CSV format and is stored

------------------------------------------------------------------------

### importing libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
seed = 123

### Fetching and storing the data

In [None]:
url = "https://raw.githubusercontent.com/xliuhw/NLU-Evaluation-Data/master/Collected-Original-Data/paraphrases_and_intents_26k_normalised_all.csv"
dataframe = pd.read_csv(url, delimiter=";")

### Extracting the relevent columns

In [None]:
dataframe["merged"] = dataframe["scenario"] + "_" + dataframe["intent"]
new_df = dataframe[["answer", "merged"]]
new_df.columns = ["speech_text","intent"]

In [None]:
train, test = train_test_split(new_df, test_size=0.10, random_state = seed)

### Extracting exact number of samples as mentioned in the paper

In [None]:
train = train.sample(9960, random_state=seed)
test = test.sample(1076, random_state=seed)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

### Exporting the data to csv

In [None]:
train.to_csv("train.csv")
test.to_csv("test.csv")

The above cell will produce two csv files as output.

-   train.csv

-   test.csv

------------------------------------------------------------------------

Once we have the data ready, then we can now focus on Augmenting the data.

As discussed in the paper, the Authors are following synonym replacement strategy with a special formulae n = α \* l where n is the number of words that is going to be replaced, α is a constant whose value lies between 0 and 1 and l is the length of the sentence.

Now let’s understand how this synonym replacement works.

Suppose we have a sentence “is there an alarm for ten am” Step 1 is to remove the stopwords from the sentence, so now the sentence would be “there alarm ten am”.

The length of the sentence now is 4 which is l.

Let’s take alpha as 0.6, So, now when we perform calculation we get n = 0.75\*4, which is equal to 3, So now we will pick three random words and replace then with their synonyms.

Now when calculating n, there is high probability that the value will be a decimal value and since n can be only an integer, the author never specified that which value of n we are supposed to pick. ie (ceil or floor). Intent classification task has less number of words as input and even if there is difference of one word in the the augmented text due to this ceil, floor confusion, then it may lead to different results.

For the data preprocessing we have two notebooks which will focus on both the scenarios taking a ceil value for n and taking a floor value for n.

-   [Notebook(DataPreProcess_floor(n))](/3_data_preprocessing_1.ipynb)

-   [Notebook(DataPreProcess_ceil(n))](/3_data_preprocessing_2.ipynb)