AIM: To remove duplicates and split the data into train-test-valid files.

In [None]:
# import libraries
from sklearn.model_selection import train_test_split

from google.colab import drive, files
import pandas as pd
import os

In [None]:
# mount google drive and connect working directory
drive.mount('/content/gdrive', force_remount = True)
print (sorted(os.listdir('gdrive/My Drive/Colab Helper/ICData/Dataset')))
os.chdir('gdrive/My Drive/Colab Helper/ICData/Dataset') # Connect to the directory

Mounted at /content/gdrive
['data_file.csv']


Observations: 

1.   Number of features: 2 (Text and Intent)
1.   Number of total observations: 14484 
2.   Number of duplicate observations: 268
3.   Number of observations left after removing duplicates: 14216

In [None]:
# read file and print it's shape
data_df = pd.read_csv('data_file.csv', usecols=['text','intent'])
print('Shape: ', data_df.shape)

Shape:  (14484, 2)


In [None]:
# check for duplicate rows
duplicate = data_df[data_df.duplicated()]
print("Duplicate Rows :", duplicate)

Duplicate Rows :                                        text                intent
546                        find movie times  SearchScreeningEvent
810                   find a movie schedule  SearchScreeningEvent
1080                   find movie schedules  SearchScreeningEvent
1218           what are the movie schedules  SearchScreeningEvent
1620                       find movie times  SearchScreeningEvent
...                                     ...                   ...
14419             can i get the movie times  SearchScreeningEvent
14472              what are the movie times  SearchScreeningEvent
14473  where can i find the movie schedules  SearchScreeningEvent
14476              what are the movie times  SearchScreeningEvent
14478       what films are playing close by  SearchScreeningEvent

[268 rows x 2 columns]


In [None]:
# drop duplicate rows
data_df.drop_duplicates(inplace=True)
print('Shape (without duplicates): ', data_df.shape)

Shape (without duplicates):  (14216, 2)


Observations:

1.   Number of observations in train set: 9951
2.   Number of observations in valid set: 2985
3.   Number of observations in test set: 1280



In [None]:
# split into train-valid-test dataset
random_seed = 70 # setting seed
train_df, testdf = train_test_split(
    data_df, test_size=0.3, stratify=data_df["intent"].values, random_state=random_seed
)
valid_df, test_df = train_test_split(
    testdf, test_size=0.3, stratify=testdf["intent"].values, random_state=random_seed
)
print('Train Shape: ', train_df.shape)
print('Valid Shape: ', valid_df.shape)
print('Test Shape: ', test_df.shape)

Train Shape:  (9951, 2)
Valid Shape:  (2985, 2)
Test Shape:  (1280, 2)


In [None]:
# save train-test-valid dataframes into separate csv files
train_df.to_csv('train_.csv')
test_df.to_csv('test_.csv')
valid_df.to_csv('valid_.csv')

# download the csv files
files.download('train_.csv')
files.download('test_.csv')
files.download('valid_.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Conclusion:**

1.   Total observations (with duplicates): 14484
2.   Total observations (without duplicates): 14216
    *   Total train: 9951
    *   Total valid: 2985
    *   Total test: 1280
3.   Columns: 2 (Text and Intent)