# Text Preprocessing

This jupyter notebook serves the purpose of performing text preprocessing and language detection on a corpus of sentences. The aim of the code is to prepare the text data for further analysis or modeling by applying various preprocessing steps and filtering out non-English sentences. Specifically, the code aims to achieve the following:

- **Data Cleaning**: The code checks the number of duplicated or null datapoints and remove them if needed.

- **Language Detection**: The code detects the language of each sentence in the given corpus and retains only the English sentences for further processing. This step ensures that subsequent analysis or modeling is focused on English text data.

- **Text Preprocessing**: The code applies a series of text preprocessing steps to each English sentence. This includes converting the text to lowercase, removing punctuation, tokenizing the text into individual words, removing stop words, expanding contractions, and lemmatizing the words. These preprocessing steps help clean and normalize the text data, making it more suitable for downstream tasks such as sentiment analysis, topic modeling, or information retrieval.

## Data Cleaning

In [1]:
import pandas as pd
import numpy as np

In [2]:
disney = pd.read_csv("DisneylandReviews.csv", encoding='latin-1')

disney.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42656 entries, 0 to 42655
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Review_ID          42656 non-null  int64 
 1   Rating             42656 non-null  int64 
 2   Year_Month         42656 non-null  object
 3   Reviewer_Location  42656 non-null  object
 4   Review_Text        42656 non-null  object
 5   Branch             42656 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.0+ MB


In [3]:
# This analysis only deals with the text column
df = disney["Review_Text"]

In [14]:
# Extract the first three sentences from the 'Review_Text' column for the comparison with the text after the preprocessing in the end
raw_sentences = disney['Review_Text'][:3]

# Convert each sentence to the desired format
formatted_raw_sentences = ["'" + sentence + "'" for sentence in raw_sentences]

# Print the formatted raw sentences
for sentence in formatted_raw_sentences:
    print(sentence)

'If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you walk into main street! It has a very familiar feel. One of the rides  its a Small World  is absolutely fabulous and worth doing. The day we visited was fairly hot and relatively busy but the queues moved fairly well. '
'Its been a while since d last time we visit HK Disneyland .. Yet, this time we only stay in Tomorrowland .. AKA Marvel land!Now they have Iron Man Experience n d Newly open Ant Man n d Wasp!!Ironman .. Great feature n so Exciting, especially d whole scenery of HK (HK central area to Kowloon)!Antman .. Changed by previous Buzz lightyear! More or less d same, but I'm expecting to have something most!!However, my boys like it!!Space Mountain .. Turns into Star Wars!! This 1 is Great!!!For cast members (staffs) .. Felt bit MINUS point from before!!! Just dun feel like its a Disney brand!! Seems more local like Ocean Park or even worst!!They got no SMILING face, b

In [4]:
# Check for the duplicated rows
df.duplicated().sum()

24

In [5]:
# Check for the null rows
df.isnull().sum()

0

In [6]:
# Remove duplicated rows
df.drop_duplicates(inplace=True)

## Language Detection

In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from langdetect import detect
import string
import re
import contractions

nltk.download('vader_lexicon')
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Text Preprocessing

In [8]:
# Detecting the language and extract only english sentences
corpus = []

for text in df:
    try:
        lang = detect(text)
        if lang == "en":
            corpus.append(text)
    except:
        pass

In [9]:
# Define general text preprocessing model
def text_preprocessing(text):
    # Convert to lowercase
    text = str(text)
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize the text
    text_tokens = nltk.word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in text_tokens if not word in stop_words]
    # Join the filtered words back into a string
    text = ' '.join(filtered_text)
    # Replace contractions with their expanded form
    text = contractions.fix(text)
    return text

# Define a function that takes a sentence as input and returns a list of lemmas
lemmatizer = WordNetLemmatizer()

def lemmatize_nltk(sentence):
    tokens = nltk.word_tokenize(sentence)
    # Perform part-of-speech tagging on the tokens 
    pos_tags = nltk.pos_tag(tokens)
    lemmas = []
    for token, tag in pos_tags:
        # Map the POS tag to the corresponding WordNet POS tag
        tag = get_wordnet_pos(tag)
        if tag:
            lemma = lemmatizer.lemmatize(token, tag)
        else:
            lemma = lemmatizer.lemmatize(token)
        lemmas.append(lemma)
    return lemmas

# Define a function that maps NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('N'):
        return 'n'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('J'):
        return 'a'
    elif tag.startswith('R'):
        return 'r'
    else:
        return None

In [10]:
# Process the text processing and lemmatizing
processed_corpus = []

for text in corpus:
    result = text_preprocessing(text)
    result = lemmatize_nltk(result)
    processed_corpus.append(result)

In [11]:
# Convert list into the Series
text = pd.Series(processed_corpus)

sentences = []

for doc in text:
    sent= " ".join(doc)
    sentences.append(sent)

In [12]:
# Print sample sentences after the text preprocessing
sentences[:3]

['you have ever disneyland anywhere you will find disneyland hong kong similar layout walk main street familiar feel one rid small world absolutely fabulous worth day visit fairly hot relatively busy queue move fairly well',
 'since last time visit hk disneyland yet time stay tomorrowland aka marvel landnow iron man experience n newly open ant man n waspironman great feature n excite especially whole scenery hk hk central area kowloonantman change previous buzz lightyear less i be expect something mosthowever boys like itspace mountain turn star war 1 greatfor cast member staff felt bit minus point dun feel like disney brand seem local like ocean park even worstthey get smile face wan na you enter n attraction n leavehello suppose happy place earth brand really do not feel itbakery main street attractive delicacy n disney theme sweet good pointslast also starbucks inside theme park',
 'thanks god hot humid visit park otherwise would big issue lot shadei arrive around 1030am leave 6pm u

In [13]:
# Check the number of sentences after text preprocessing
len(sentences)

42626

In [14]:
disney_sentences = sentences

In [15]:
# Store variables for further analysis
%store disney disney_sentences

Stored 'disney' (DataFrame)
Stored 'disney_sentences' (list)
