# Cleaning Text with Subset of Data

An important task in Natural Language Processing (NLP) involves text data cleaning. To optimize results, it's crucial to streamline your text to its essential root words within the corpus while removing irrelevant text. 

Stemming is a process that stems or removes the last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

- Converting words into lowercase
- Removing leading and trailing whitespace
- Removing punctuation
- Removing stopwords
- Expanding contractions
- 
Removing special characters (numbers, emojis, etc.)

In [1]:
import json
import glob
import time
import re

import numpy as np 
import pandas as pd

from nltk import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## Function that processes chunks of JSON lines files and adds it to a DataFrame

In [None]:
# def load_jsonl(file):
#     data_list = []  # Initialize an empty list to collect the JSON objects
    
#     with open(file, 'r') as fp:
#         for line in fp:
#             data_list.append(json.loads(line.strip()))  # Append each JSON object to the list
    
#     df = pd.DataFrame(data_list)  
#     return df

In [2]:
n = 1 
total_rows = 0

def process_chunks(file, chunksize = 10000):

    # Setting as global variables
    global n, total_rows  
    
    chunks = pd.read_json(file, lines=True, chunksize = chunksize)
    dfs = []  
    n_chunks = 0

    for chunk in chunks:
        dfs.append(chunk)
        n_chunks += 1  # Count the number of chunks processed
        print(len(chunk), " rows added")
        n += 1 
        total_rows += len(chunk)
        if n_chunks >= 10:  # Process only the first 5 chunks
            break  
            
    print("Done")
    print("Total rows:", total_rows)
    return pd.concat(dfs, ignore_index=True)

In [3]:
reviews = "../data/Home_and_Kitchen.jsonl"
meta = "../data/meta_Home_and_Kitchen.jsonl"

## Loading a subset of the Reviews and Meta data

In [5]:
start = time.process_time()

reviews_subset = process_chunks(reviews)

end = time.process_time()
elapsed_time = end - start
print('Created a subset of the reviews dataset')
print('Execution time:', elapsed_time, 'seconds')

print('--------------')
start = time.process_time()

meta_subset = process_chunks(meta)

end = time.process_time()
elapsed_time = end - start
print('Created a subset of the meta dataset')
print('Execution time:', elapsed_time, 'seconds')

10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
Done
Total rows: 300000
Created a subset of the reviews dataset
Execution time: 2.265625 seconds
--------------
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
10000  rows added
Done
Total rows: 400000
Created a subset of the meta dataset
Execution time: 9.40625 seconds


## Exploring the reviews data

In [6]:
reviews_subset.head(50)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,1,Received Used & scratched item! Purchased new!,Livid. Once again received an obviously used ...,[],B007WQ9YNO,B09XWYG6X1,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2023-02-26 01:03:29.298,1,True
1,5,Excellent for moving & storage & floods!,I purchased these for multiple reasons. The ma...,[],B09H2VJW6K,B0BXDLF8TW,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2022-12-26 08:30:10.846,0,True
2,2,Lid very loose- needs a gasket imo. Small base.,[[VIDEOID:c87e962bc893a948856b0f1b285ce6cc]] I...,[{'small_image_url': 'https://m.media-amazon.c...,B07RL297VR,B09G2PW8ZG,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2022-05-25 02:54:56.788,0,True
3,5,Best purchase ever!,If you live at a higher elevation like me (5k ...,[{'small_image_url': 'https://m.media-amazon.c...,B09CQF4SWV,B08CSZDXZY,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2022-05-06 16:38:16.178,0,True
4,5,Excellent for yarn!,I use these to store yarn. They easily hold 12...,[{'small_image_url': 'https://images-na.ssl-im...,B003U6A3EY,B0C6V27S6N,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2020-05-20 00:28:45.940,1,True
5,5,Perfect for tea leaves!,This is the perfect size for me being single a...,[],B004506V8Q,B00GUAURXY,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2020-03-07 05:09:47.276,0,True
6,5,Excellent!,Lost my old one during my last move. This work...,[],B01CX1RIMQ,B08K8N5FB2,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2020-02-15 20:50:06.318,0,True
7,3,"Works, but weird stuff floating on water","I like the temp settings, but does anybody kno...",[{'small_image_url': 'https://images-na.ssl-im...,B005YR0F40,B0CGY43Y3P,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2019-03-18 02:59:43.695,2,True
8,5,100% cotton! Yay!!!!!,This duvet set is a serious bubble gum girly p...,[{'small_image_url': 'https://images-na.ssl-im...,B077BYFY8J,B077BYFY8J,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2019-01-19 17:49:25.545,1,True
9,5,Soft and sturdy at the same time,I purchased this after having purchased some o...,[],B01GEDH4KU,B01GEDH4KU,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2019-01-19 17:35:32.140,0,True


In [7]:
reviews_subset[['rating', 'title', 'text']]

Unnamed: 0,rating,title,text
0,1,Received Used & scratched item! Purchased new!,Livid. Once again received an obviously used ...
1,5,Excellent for moving & storage & floods!,I purchased these for multiple reasons. The ma...
2,2,Lid very loose- needs a gasket imo. Small base.,[[VIDEOID:c87e962bc893a948856b0f1b285ce6cc]] I...
3,5,Best purchase ever!,If you live at a higher elevation like me (5k ...
4,5,Excellent for yarn!,I use these to store yarn. They easily hold 12...
...,...,...,...
99995,5,Five Stars,wish that made a bigger size to cover the enti...
99996,3,I really like these - I bought one for me and ...,I really like these - I bought one for me and ...
99997,3,I used this product in the past with no proble...,I used this product in the past with no proble...
99998,5,VERY SHARP! BEWARE YOUR FINGERS!!!,VERY SHARP! BEWARE YOUR FINGERS!!!


In [8]:
i = 1
print(f'Title: {reviews_subset.loc[i,"title"]}\n')

print(f'Text: {reviews_subset.loc[i,"text"]}')

Title: Excellent for moving & storage & floods!

Text: I purchased these for multiple reasons. The main reason was that I was moving. I was moving bc my apt kept flooding.  Luckily having been through other floods I generally store all my stuff in plastic containers anyways. Sadly a couple of my vintage Singer’s in wooden boxes had been left out during the last move & the last flood got them moldy & ruined them.  I am a bibliophile & these bags held not only huge stacks of my hardcover books, but my 40 pound Singer sewing machines. I also carried hundreds of pounds of yarn (50 per bag) and thousands of pounds of quilting fabric during my move.  It held up to paints, plants, garden pits, & pretty much anything I threw at it. I even put my pots & pans in these bags.  While I still thought they were a bit pricey I did buy some on sale & I think I bought 18 total that I was able to pack & take to my new place & unload & go pack again. Tho I now intend to leave some of my yarn in them as it

## Function to show the before and after cleaning text

In [9]:
def print_text(sample, clean):
    print(f"Before: {sample}")
    print("------------------")
    print(f"After: {clean}")

## Text Feature Extraction

In [10]:
# Number of words 
def word_count(text):
    words = text.split()
    return len(words)

In [11]:
sample_text = reviews_subset.loc[1, "text"]
clean_text = word_count(sample_text) 
print_text(sample_text, clean_text)

Before: I purchased these for multiple reasons. The main reason was that I was moving. I was moving bc my apt kept flooding.  Luckily having been through other floods I generally store all my stuff in plastic containers anyways. Sadly a couple of my vintage Singer’s in wooden boxes had been left out during the last move & the last flood got them moldy & ruined them.  I am a bibliophile & these bags held not only huge stacks of my hardcover books, but my 40 pound Singer sewing machines. I also carried hundreds of pounds of yarn (50 per bag) and thousands of pounds of quilting fabric during my move.  It held up to paints, plants, garden pits, & pretty much anything I threw at it. I even put my pots & pans in these bags.  While I still thought they were a bit pricey I did buy some on sale & I think I bought 18 total that I was able to pack & take to my new place & unload & go pack again. Tho I now intend to leave some of my yarn in them as it’s waterproof.  Tho I intend to place it on a t

In [12]:
# Average word length 
def avg_word_length(text):
    # Split the string into words
    words = text.split()
    # Compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    # Compute average word length 
    avg_word_length = sum(word_lengths)/len(words)
    # Return average word length
    return(avg_word_length)    

In [13]:
sample_text = reviews_subset.loc[1, "text"]
clean_text = avg_word_length(sample_text) 
print_text(sample_text, clean_text)

Before: I purchased these for multiple reasons. The main reason was that I was moving. I was moving bc my apt kept flooding.  Luckily having been through other floods I generally store all my stuff in plastic containers anyways. Sadly a couple of my vintage Singer’s in wooden boxes had been left out during the last move & the last flood got them moldy & ruined them.  I am a bibliophile & these bags held not only huge stacks of my hardcover books, but my 40 pound Singer sewing machines. I also carried hundreds of pounds of yarn (50 per bag) and thousands of pounds of quilting fabric during my move.  It held up to paints, plants, garden pits, & pretty much anything I threw at it. I even put my pots & pans in these bags.  While I still thought they were a bit pricey I did buy some on sale & I think I bought 18 total that I was able to pack & take to my new place & unload & go pack again. Tho I now intend to leave some of my yarn in them as it’s waterproof.  Tho I intend to place it on a t

## Readability Test Examples

Flesch reading ease
- Greater the average sentence length, harder the text is to read. 
- Greater the average numbers of syllables in a word, harder the text is to read.

Gunning fog index 
- Greater the percentage of complex words, harder the text is to read.
- Higher the index, lesser the readability.

In [14]:
## textatistic libary 
from textatistic import Textatistic

# readability_scores = 
readability_scores = Textatistic(sample_text).scores

# Generate scores 
print(readability_scores['flesch_score'])
print(readability_scores['gunningfog_score'])

88.81993710691825
8.123270440251574


## Cleaning Text - NLTK

In [16]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    tokens = text.split()  # Create tokens 
    clean_tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words]  # Lemmatize and remove stop words
    clean_text = " ".join(clean_tokens)  # Join clean tokens
    clean_text = " ".join(clean_text.split())  # Remove extra spaces, tabs, and new lines
    return clean_text

In [17]:
sample_text = reviews_subset.loc[1, "text"]
clean_text = preprocess_text(sample_text) 
print_text(sample_text, clean_text)
print('--------')
print('Original word count:', word_count(sample_text))
print('Word count:', word_count(clean_text))

Before: I purchased these for multiple reasons. The main reason was that I was moving. I was moving bc my apt kept flooding.  Luckily having been through other floods I generally store all my stuff in plastic containers anyways. Sadly a couple of my vintage Singer’s in wooden boxes had been left out during the last move & the last flood got them moldy & ruined them.  I am a bibliophile & these bags held not only huge stacks of my hardcover books, but my 40 pound Singer sewing machines. I also carried hundreds of pounds of yarn (50 per bag) and thousands of pounds of quilting fabric during my move.  It held up to paints, plants, garden pits, & pretty much anything I threw at it. I even put my pots & pans in these bags.  While I still thought they were a bit pricey I did buy some on sale & I think I bought 18 total that I was able to pack & take to my new place & unload & go pack again. Tho I now intend to leave some of my yarn in them as it’s waterproof.  Tho I intend to place it on a t

## Cleaning Text - spaCy

In [18]:
import spacy 
nlp = spacy.load('en_core_web_sm')

In [48]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Change lowercase 'i' to uppercase 'I'
stopwords = stopwords - {'i'}
stopwords.add('I')

# Update list of stopwords
stopwords = stopwords - {'move'}

def clean_data(doc):
    doc = doc.lower()
    doc = nlp(doc)
    # Lemmatize words 
    lemmas = [token.lemma_ for token in doc]
    # Removing non-alphabetic characters and stopwords
    tokens = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
    cleaned_doc = " ".join(tokens)
    
    return cleaned_doc

In [46]:
stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'I',
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',


In [61]:
sample_text = reviews_subset.loc[2, "text"]
clean_text = clean_data(sample_text) 
print_text(sample_text, clean_text)
print('--------')
print('Original word count:', word_count(sample_text))
print('Word count:', word_count(clean_text))

Before: [[VIDEOID:c87e962bc893a948856b0f1b285ce6cc]] I wanted to love this bc I previously bought a matching turquoise teapot, but the loose lid (defective or design flaw? Idk) on the cups is driving me batty. I’m disabled so my gait is not great to begin with & the lid just bangs non-stop while I walk from my kitchen to wherever I’m going with my tea. It’s incredibly annoying.  I had hoped it was just a one-off so I purchased it in another color & sadly it has the same exact problem.  They could fix the problem by adding a rubber gasket or flange to the lid imo & I even thought of doing so myself until I accidentally knocked the cup over due to a design flaw that has a small base on the cup.  I like the lid bc I run a fan continuously & I live with 2 service dogs so I like to keep my drinks covered beyond just the brew times so I really hope they update this cup bc it does keep the tea warm & the size is perfect for a 2 cup brew.<br /><br />I wish they would fix the obvious design fla

## Part of speech tagging - spaCy

In [58]:
def tagging_parts_of_speech(doc):
    doc = nlp(doc)
    pos = [(token.text, token.pos_) for token in doc]

    return pos

In [60]:
sample_text = reviews_subset.loc[2, "text"]
clean_text = tagging_parts_of_speech(sample_text) 
print_text(sample_text, clean_text)

Before: [[VIDEOID:c87e962bc893a948856b0f1b285ce6cc]] I wanted to love this bc I previously bought a matching turquoise teapot, but the loose lid (defective or design flaw? Idk) on the cups is driving me batty. I’m disabled so my gait is not great to begin with & the lid just bangs non-stop while I walk from my kitchen to wherever I’m going with my tea. It’s incredibly annoying.  I had hoped it was just a one-off so I purchased it in another color & sadly it has the same exact problem.  They could fix the problem by adding a rubber gasket or flange to the lid imo & I even thought of doing so myself until I accidentally knocked the cup over due to a design flaw that has a small base on the cup.  I like the lid bc I run a fan continuously & I live with 2 service dogs so I like to keep my drinks covered beyond just the brew times so I really hope they update this cup bc it does keep the tea warm & the size is perfect for a 2 cup brew.<br /><br />I wish they would fix the obvious design fla

## Correcting shortened slangs 

1. Using BeautifulSoup, extract the list of text abbreviations and acronyms from slicktext.com (https://www.slicktext.com/blog/2019/02/text-abbreviations-guide/).
2. 

In [63]:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import logging

In [None]:
def retrieve_info(url):
    """
    This function retrieves data from the website and creates a DataFrame.
    """
    retry = Retry(total = 5,
                         backoff_factor = 0.5,
                         status_forcelist = [429, 500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries = retry)
    try:
        response = requests.get(url, timeout = 5)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text)    

            results = []

            # Main html element where data is located 
            content = soup.find('div', id="content")

            if content:
                tables = content.findAll('table')
            else: 
                print(f'Error retrieving table element from {url}')
        else: 
            print(f'Error retrieving {url}')

    except Exception as e:
        print(f'Error retrieving data from {url}: {e}')

In [67]:
URL = 'https://www.slicktext.com/blog/2019/02/text-abbreviations-guide/'

response = requests.get(URL)
print(response.status_code)
soup = BeautifulSoup(response.text)
print(soup.prettify())

200
<!DOCTYPE html>
<html lang="en-US" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://www.slicktext.com/blog/xmlrpc.php" rel="pingback"/>
  <link href="/images/favicon.ico" rel="icon" type="image/x-icon"/>
  <meta content="en-US" http-equiv="content-language"/>
  <meta content="en-US" name="language"/>
  <title>
   100+ Text Abbreviations and How To Use Them [UPDATED] | SlickText
  </title>
  <!-- Social Warfare v3.3.3 https://warfareplugins.com -->
  <style>
   @font-face {font-family: "sw-icon-font";src:url("https://www.slicktext.com/blog/wp-content/plugins/social-warfare/assets/fonts/sw-icon-font.eot?ver=3.3.3");src:url("https://www.slicktext.com/blog/wp-content/plugins/social-warfare/assets/fonts/sw-icon-font.eot?ver=3.3.3#iefix") format("embedded-opentype"),url("https://www.slicktext.com/blog/wp-c