# Cleaning the Dataset

In [1]:
import os
import re
import nltk
import pandas as pd

**NOTE:** If this is your first time using the `nltk` library, uncomment the following cell to download the necessary modules. For this notebook, you will need to download the *stopwords* folder.

In [2]:
# nltk.download()

Link to [dataset](https://www.kaggle.com/zarajamshaid/language-identification-datasst) on Kaggle.

The dataset contains 1,000 rows for 22 languages (22,000 data points in total).

In [3]:
df = pd.read_csv("dataset.csv")

df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


## Quick overview of the dataset

In [4]:
df.language.value_counts()

English       1000
Spanish       1000
Korean        1000
Japanese      1000
Latin         1000
Indonesian    1000
Urdu          1000
Pushto        1000
Chinese       1000
Romanian      1000
Tamil         1000
Estonian      1000
Turkish       1000
Persian       1000
Dutch         1000
Arabic        1000
Thai          1000
Swedish       1000
Portugese     1000
Hindi         1000
French        1000
Russian       1000
Name: language, dtype: int64

## Notes on the dataset
- `nltk.word_tokenize` currently only supports English. This is not a very important issue since most of the langauges have spaces, so they can still be tokenized.
- Languages that do not use spaces to separate words (such as Chinese) are greatly affected by this.

For this project, I will only keep the languages that `nltk` currently supports. An idea for a future project would be to include other languages in the dataset with packages that specialize in those languages.

### List of languages in the dataset

In [5]:
languages = df['language'].apply(lambda x: str.lower(x)).unique().tolist()

languages

['estonian',
 'swedish',
 'thai',
 'tamil',
 'dutch',
 'japanese',
 'turkish',
 'latin',
 'urdu',
 'indonesian',
 'portugese',
 'french',
 'chinese',
 'korean',
 'hindi',
 'spanish',
 'pushto',
 'persian',
 'romanian',
 'russian',
 'english',
 'arabic']

### List of languages supported by `nltk`

**NOTE:** In order for this to work on your device, it is necessary to find the location where the stopwords are saved in your system and replace the path in quotes with the correct filepath from your system.

In [6]:
# # Windows
# supported_languages = os.listdir("C:\\users\\johng\\appdata\\roaming\\nltk_data\\corpora\\stopwords")

# Mac OS
supported_languages = os.listdir('/Users/johngonzalez/nltk_data/corpora/stopwords')

supported_languages

['dutch',
 'german',
 'slovene',
 'hungarian',
 'romanian',
 'kazakh',
 'turkish',
 'russian',
 'README',
 'italian',
 'english',
 'greek',
 'tajik',
 'norwegian',
 'portuguese',
 'finnish',
 'danish',
 'french',
 'swedish',
 'azerbaijani',
 'spanish',
 'indonesian',
 'arabic',
 'nepali']

### Intersection between the two lists

In [7]:
# Find the intersection between lists
final_languages = list(set(languages) & set(supported_languages))
final_languages = [lang.capitalize() for lang in final_languages]

final_languages

['Romanian',
 'Indonesian',
 'Spanish',
 'Turkish',
 'Dutch',
 'French',
 'English',
 'Russian',
 'Swedish',
 'Arabic']

In [8]:
print(f'We will be using {len(final_languages)} languages from the {len(languages)} languages in the dataset')

We will be using 10 languages from the 22 languages in the dataset


### Keep the rows with the supported languages

In [9]:
df = df[df.language.isin(final_languages)]
df = df.reset_index(drop=True)

In [10]:
df.head()

Unnamed: 0,Text,language
0,sebes joseph pereira thomas på eng the jesuit...,Swedish
1,de spons behoort tot het geslacht haliclona en...,Dutch
2,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish
3,kemunculan pertamanya adalah ketika mencium ka...,Indonesian
4,association de recherche et de sauvegarde de l...,French


In [11]:
df.language.value_counts()

English       1000
Indonesian    1000
Spanish       1000
Arabic        1000
Dutch         1000
French        1000
Turkish       1000
Romanian      1000
Russian       1000
Swedish       1000
Name: language, dtype: int64

In [12]:
print(f'There are a total of {sum(df.language.value_counts())} data points left.')

There are a total of 10000 data points left.


# Data Cleaning
- Remove special characters from the text
- Remove any extra whitespaces from the text

In [13]:
def preprocess(text):
    text = re.sub("[\[\[\]\]?—\"\"«»]", "", text)  # Remove special characters
    text = text.replace('\u200b', '')
    text = re.sub("-", " ", text)  # Replace '-' with a space
    text = " ".join(text.split())  # Remove any extra spaces
    text = text.lower()
    
    return text

In [14]:
df['Cleaned_Text'] = df['Text'].apply(lambda x: preprocess(x))

df.head()

Unnamed: 0,Text,language,Cleaned_Text
0,sebes joseph pereira thomas på eng the jesuit...,Swedish,sebes joseph pereira thomas på eng the jesuits...
1,de spons behoort tot het geslacht haliclona en...,Dutch,de spons behoort tot het geslacht haliclona en...
2,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish,tsutinalar i̇ngilizce tsuutina kanadada albert...
3,kemunculan pertamanya adalah ketika mencium ka...,Indonesian,kemunculan pertamanya adalah ketika mencium ka...
4,association de recherche et de sauvegarde de l...,French,association de recherche et de sauvegarde de l...


# Save DataFrame
Save as Pickle object

In [15]:
df.to_pickle("./saved-items/df.pkl")