In [1]:
import string
import numpy as np
import pandas as pd
import random

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
data = pd.read_csv('/home/pratiyushp/Downloads/datasets/train.csv')
data_copy = data.copy()
data.drop(data.columns[0], axis=1, inplace = True)

In [4]:
data['len_hindi'] = data['hindi'].apply(lambda x: len(x.split(' ')))
data['len_english'] = data['english'].apply(lambda x: len(x.split(' ')))

**Checking for null values**

checking for null values is generally the first step in the data analaysis. Here also we can check if any null values are present in the given dataset or not.

In [5]:
data.isnull().sum()

hindi          0
english        0
len_hindi      0
len_english    0
dtype: int64

**Dropping duplicates from the data**
It may be the case that there are some duplicate entries present in the dataset. We can remove such entries.

In [10]:
print(f"Shape of the dataset before removing the duplicates: {data.shape}")
data.drop_duplicates(inplace = True)
print(f"Shape of the dataset after removing the duplicates: {data.shape}")

Shape of the dataset before removing the duplicates: (102296, 4)
Shape of the dataset after removing the duplicates: (102296, 4)


In [11]:
data['hindi'] = data['hindi'].apply(lambda x: x.lower())
data['english'] = data['english'].apply(lambda x: x.lower())

**Removing punctuations**

There are different kinds of punctuations present in both hindi and english sentences. It's better to remove such punctuations in the data cleaning.

In [12]:
def remove_punctuations(sentence):
    punctuations = list(string.punctuation)
    cleaned = ''
    for letter in sentence:
        if letter not in punctuations:
            cleaned += letter
    return cleaned

In [13]:
data['hindi']=data['hindi'].apply(lambda x: remove_punctuations(x))
data['english']=data['english'].apply(lambda x: remove_punctuations(x))

**Removing mixed sentences (those samples which have english words in the hindi sentences)**

On observing the data, we find out that there are some samples in which english words are present between the hindi sentences. We treat these sentences as outliers and can remove them from the dataset.

In [14]:
def is_mixed(sentence):
    letters = 'abcdefgijklmnopqrstuvwxyz'
    for ch in letters:
        if ch in sentence:
            return True
    return False

In [17]:
data['is_mixed'] = data['hindi'].apply(lambda x: is_mixed(x))
data.head()

Unnamed: 0,hindi,english,len_hindi,len_english,is_mixed
0,एल सालवाडोर मे जिन दोनो पक्षों ने सिविलयुद्ध स...,in el salvador both sides that withdrew from t...,22,23,False
1,मैं उनके साथ कोई लेना देना नहीं है,i have nothing to do with them,8,7,False
2,हटाओ रिक,fuck them rick,2,3,False
3,क्योंकि यह एक खुशियों भरी फ़िल्म है,because its a happy film,7,5,False
4,the thought reaching the eyes,the thought reaching the eyes,5,5,True
