## Dataset

If you are on this notebook then you have made a choice in the previous section to go ahead with **CrossValidation-Data**. Let’s understand in more depth what is inside this dataset directory.

The CrossValidation-Data directory contains these 4 folder:

-   autoGeneFromRealAnno/autoGene_2018_03_22-13_01_25_169/CrossValidation

-   out4ApiaiReal/Apiai_trainset_2018_03_22-13_01_25_169/CrossValidation

-   out4LuisReal/Luis_trainset_2018_03_22-13_01_25_169/CrossValidation

-   out4RasaReal/rasa_json_2018_03_22-13_01_25_169_80Train/CrossValidation

-   out4WatsonReal/Watson_2018_03_22-13_01_25_169_trainset/CrossValidation

The first directory (autoGeneFromRealAnno/:) contains generated trainset and testset from the annotated csv file. The other four subdirectories (out4ApiaiReal, out4LuisReal, out4RasaReal and out4WatsonReal) in CrossValidation/ are the converted NLU service input data for Dialogflow, LUIS, Rasa and Watson respectively. So we will only be picking the first directory for our use case.

Inside the directory “autoGeneFromRealAnno” we have 10 fold data among which we wcan choose any of the fold to train but for now we can go ahead with 1st fold. Inside First fold we have train and test directory respectively and inside the repositories we have 64 csv files which contains intent data for each of the 64 intent.

Next we will try to collect the data and store it in a pandas dataframe and then a csv file.

------------------------------------------------------------------------

### importing libraries

In [None]:
import os
import pandas as pd
import requests
from zipfile import ZipFile

### Clone the Github repository

In [None]:
repository_url = 'https://github.com/xliuhw/NLU-Evaluation-Data/archive/refs/heads/master.zip'

response = requests.get(repository_url)
with open('repository.zip', 'wb') as file:
    file.write(response.content)

with ZipFile('repository.zip', 'r') as zip_ref:
    zip_ref.extractall('repository')

### Arranging the relevant data

There are two subfolders inside the repository (train and test) and these folder contains many csv files with name as intent.csv where intent is the different types of intents.

We will be looping through all the csv files and then create a single file which would contain all the data.

In [None]:
data = []
for folder in ["trainset", "testset/csv"]:
  csv_files = [file for file in os.listdir(f'repository/NLU-Evaluation-Data-master/CrossValidation/autoGeneFromRealAnno/autoGene_2018_03_22-13_01_25_169/CrossValidation/KFold_1/{folder}') if file.endswith('.csv')]
  merged_df = pd.DataFrame()
  for csv_file in csv_files:
      file_path = f'repository/NLU-Evaluation-Data-master/CrossValidation/autoGeneFromRealAnno/autoGene_2018_03_22-13_01_25_169/CrossValidation/KFold_1/{folder}' '/' + csv_file
      df = pd.read_csv(file_path,delimiter=";")
      merged_df = pd.concat([merged_df, df], ignore_index=True)
  data.append(merged_df)

### Extracting the relevent columns and then saving the dataframes to a csv file

In [None]:
for i, merged_df in enumerate(data):
  merged_df["merged"] = merged_df["scenario"] + "_" + merged_df["intent"]
  merged_df = merged_df[["answer_from_user", "merged"]]
  merged_df.columns = ["speech_text","intent"]
  if i == 0:
    merged_df.to_csv('train.csv')
  else:
    merged_df.to_csv('test.csv')

The above cell will produce two csv files as output.

-   train.csv

-   test.csv

------------------------------------------------------------------------

Once we have the data ready, then we can now focus on Augmenting the data.

As discussed in the paper, the Authors are following synonym replacement strategy with a special formulae n = α \* l where n is the number of words that is going to be replaced, α is a constant whose value lies between 0 and 1 and l is the length of the sentence.

Now let’s understand how this synonym replacement works.

Suppose we have a sentence “is there an alarm for ten am” Step 1 is to remove the stopwords from the sentence, so now the sentence would be “there alarm ten am”.

The length of the sentence now is 4 which is l.

Let’s take alpha as 0.6, So, now when we perform calculation we get n = 0.75\*4, which is equal to 3, So now we will pick three random words and replace then with their synonyms.

Now when calculating n, there is high probability that the value will be a decimal value and since n can be only an integer, the author never specified that which value of n we are supposed to pick. ie (ceil or floor). Intent classification task has less number of words as input and even if there is difference of one word in the the augmented text due to this ceil, floor confusion, then it may lead to different results.

For the data preprocessing we have two notebooks which will focus on both the scenarios taking a ceil value for n and taking a floor value for n.

-   [Notebook(DataPreProcess_floor(n))](/)

-   [Notebook(DataPreProcess_ceil(n))](/)