<a href="https://colab.research.google.com/github/mertcan-basut/nlp/blob/main/pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install packages
!pip install -q opendatasets

In [None]:
import opendatasets as od # download datasets

import pandas as pd

import re # regular expressions
from string import punctuation
from unicodedata import normalize

In [None]:
# get data and load it into a dataframe
od.download("https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews")

data = pd.read_csv("womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv")
reviews = data[["Review Text", "Recommended IND"]].dropna().rename(columns={"Review Text": "review", "Recommended IND": "label"})

reviews.head()

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: mertcanbasut01
Your Kaggle Key: ··········
Downloading womens-ecommerce-clothing-reviews.zip to ./womens-ecommerce-clothing-reviews


100%|██████████| 2.79M/2.79M [00:00<00:00, 4.49MB/s]





Unnamed: 0,review,label
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


# Text Cleaning and Normalization

## Converting to lowercase

Python is a case sensitive programming language, therefore converting to lowercase is a common pre-processing step for text normalization.

In [None]:
text = reviews.sample(1)["review"].values[0]

to_lowercase = lambda text: text.lower()

transformed_text = to_lowercase(text)

print(f"{'Text':=^50}\n{text}\n\n{'Transformed Text':=^50}\n{transformed_text}")

reviews["review"] = reviews["review"].str.lower()

Beautiful dress; i ended up getting it in both the red and the green. the fit is absolutely perfect and it flatters pretty much any figure. one thing i would say is that you should get your exact size if possible; i ordered it online because it was so pretty in the 4p. i happened to be in the store the next week and i ttried on the 4r in the store just to see how it looked and it was awful! i was so disappointed but then my dress arrived in the mail and the 4p fit me perfectly. i'm so glad retailer

beautiful dress; i ended up getting it in both the red and the green. the fit is absolutely perfect and it flatters pretty much any figure. one thing i would say is that you should get your exact size if possible; i ordered it online because it was so pretty in the 4p. i happened to be in the store the next week and i ttried on the 4r in the store just to see how it looked and it was awful! i was so disappointed but then my dress arrived in the mail and the 4p fit me perfectly. i'm so glad 

## Removing URLs

URLs are typically not relevant and can be removed from the text data.

In [None]:
text = "This is the dataset URL: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews"

def remove_urls(text, replacement_text=''):
  regex_pattern = re.compile(r'https?://\S+|www\.\S+') # define a regex pattern to match URLs
  transformed_text = regex_pattern.sub(replacement_text, text) # replace URLs with the specified replacement text

  return transformed_text

transformed_text = remove_urls(text)

print(f"{'Text':=^50}\n{text}\n\n{'Transformed Text':=^50}\n{transformed_text}")

reviews["review"] = reviews["review"].str.replace(r'https?://\S+|www\.\S+', '', regex=True)

This is the dataset URL: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews

This is the dataset URL: 


## Removing punctuation

In [None]:
text = "This İs an✨, example-text_ 1 docğument.\n :) ❤️ I've felt %100 AMAZING!!!  ㅋㅋㅋㅋ <3é "

def remove_punctuation(text):
  transformed_text = re.sub(r'[^\w\s]|[_]', " ", text) # non-word and non-whitespace characters

  return transformed_text

transformed_text = remove_punctuation(text)

print(f"{'Text':=^50}\n{text}\n\n{'Transformed Text':=^50}\n{transformed_text}")

This İs an✨, example-text_ 1 docğument.
 :) ❤️ I've felt %100 AMAZING!!!  ㅋㅋㅋㅋ <3é 

This İs an   example text  1 docğument 
       I ve felt  100 AMAZING     ㅋㅋㅋㅋ  3é 


In [None]:
# remove tabs or newlines etc
# punctuations (remove)
# numbers (translate into text[] or leave or remove) kalktı
# ~~emojis or special words (preserve while doing all cleaning)
# non-ascii characters (lossy decode and remove words containing them at the start) (removing accents, strict or lossy)
# birleşik kelimeler underscore ile bağlı
# boşluk silme en son
# i've gibi kelimeleri i have şeklinde düzeltme (özel look-up table ya da dict) (make trans ile sözlük yapılabilir hem dönüştürmeye hem de removelamaya yarıyor)
# typos (correct at the end)

In [None]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# nfkd ascii ve utf8 açıkla
import unicodedata
unicodedata.normalize('NFKD', "Ğafİrㅋㅋㅋㅋ").encode('ascii', 'ignore').decode('utf-8')

'GafIr'

## Removing excess whitespace

Strip trailing, leading, and excess whitespaces away.

In [None]:
text = "  This is   an,  exampl e text document. "

def strip_whitespace(text):
  transformed_text = text.strip() # strip trailing and leading whitespaces
  transformed_text = re.sub(r'\s\s+', " ", transformed_text) # remove whitespaces that are multiple characters long

  return transformed_text

transformed_text = strip_whitespace(text)

print(f"{'Text':=^50}\n{text}\n\n{'Transformed Text':=^50}\n{transformed_text}")

reviews["review"] = reviews["review"].str.strip()
reviews["review"] = reviews["review"].str.replace(r'\s\s+', " ", regex=True)

  This is   an,  exampl e text document. 

This is an, exampl e text document.


# tokenization/lematization

In [None]:
# tokenization/lematization
# stopwrod removal

# References

-