# Data Preprocessing

Data preprocessing is an essential step in natural language processing (NLP), which involves cleaning, formatting, and transforming text data to make it suitable for use in NLP algorithms. Some common tasks that are performed as part of data preprocessing in NLP include:
- Tokenization: Tokenization is the process of breaking a piece of text into individual words or tokens. This is typically done by splitting the text on whitespace and punctuation, and then applying additional rules to identify and separate words that are joined together (such as contractions or compound nouns).
- Removing stop words: Stop words are common words that are typically filtered out of NLP algorithms because they do not add significant meaning to the text (such as "the," "and," or "but"). Removing stop words can help to reduce the size of the dataset and can make it easier to identify the most important words and phrases in the text.
- Stemming and lemmatization: Stemming and lemmatization are techniques for reducing inflected words (such as verbs in different tenses) to their base form. This can help to group together words that have the same meaning, which can improve the performance of NLP algorithms.
- Encoding the text: NLP algorithms typically require that text be encoded as numbers rather than as raw text. This can be done using techniques such as one-hot encoding, which creates a binary feature for each possible word in the vocabulary, or using word embeddings, which represent each word as a dense vector of numbers.

Data preprocessing can help to improve the performance and accuracy of NLP algorithms by making the text data more suitable for analysis.

In [1]:
# Load required packages
import re
import pandas as pd
import gensim
import spacy
import swifter
from nltk.tokenize import word_tokenize
from gensim.parsing.preprocessing import STOPWORDS

from tqdm import tqdm

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
tqdm.pandas()

To reduce the time of processing, it will be prudent to analyze only a subset of tweets. Thus, for the further analysis 4000 tweets will be selected at random.

In [2]:
# Load the data and select a subset of 4000 observations
df = pd.read_hdf('./../../code/data/starbucks/data.h5', key='starbucks')
df = df.sample(n=4000, random_state=42)

In [3]:
# Confirm the shape of the data
df.shape

(4000, 28)

Based on the observation from the exploratory data analysis, the following prelimnary preprocessing steps will be performed:
1. Remove the duplicate tweets
2. Remove blank tweets data
3. Select the subset of data which are posted in the English language
4. Remove tweet data from official company handles.

In [4]:
# Remove duplicates
df.drop_duplicates('tweet', inplace=True)

In [5]:
# Remove blank tweets data
df.dropna(subset='tweet', axis=0, inplace=True)

In [6]:
print(f"After removing duplicate and blank tweets, {df.shape[0]} number of tweets still exist in the data.")

After removing duplicate and blank tweets, 3658 number of tweets still exist in the data.


In [7]:
# Select observations posted in the English language
df = df[df['language'] == 'en']

In [8]:
# Remove data from official company handles
df = df[~df['username'].isin(['starbucks_es', 'starbucks_cstm', 'starbucks_j_cpg', 'starbuckspoho', 
                                'starbuckspoho', 'starbuckshomeme', 'starbucksperu'])]

In [9]:
print(f"After removing duplicate, blank tweets, selecting subset based on language, removing data from official handles - {df.shape[0]} number of tweets remian in the data.")

After removing duplicate, blank tweets, selecting subset based on language, removing data from official handles - 2770 number of tweets remian in the data.


Before applying any algorithm to the data, it is necessary clean the data. The preprocessing steps that needs to be performed to make the tweet data usable are as follows:
1. Convert the tweets to lower case
2. Remove any tagged users
3. Remove hashtags from the tweets
4. Remove links from the tweets
5. Remove any special characters and retain only alphanumeric data
6. Perform lemmatization
7. Remove the stopwords

In [10]:
def preprocess_tweet(text):
    """Method to perform preprocessing of tweets"""
    text = str(text).lower() # Convert tweets to lower case
    text = re.sub(r'@ *\w*', '', str(text)) # Remove tagged usernames from the tweets
    text = re.sub(r'#\w+', '', str(text)) # Remove hashtags from the tweets
    text = re.sub('\n', ' ', str(text)) # Remove newline characters from the tweets
    text = re.sub('\xa0', ' ', str(text)) # Remove special characters coverted to strings from the tweets. In this case it is "\xa0"
    text = re.sub('&amp', ' ', str(text)) # Remove "&amp" from the tweets
    text = re.sub(r'http\S+', '', str(text)) # Remove links from the tweets
    text = re.sub(r'[^A-Za-z0-9 ]+', '', str(text)) # Keep only alphanumeric characters in the tweets
    return text

df.loc[:, 'preprocessed_tweet'] = df['tweet'].apply(preprocess_tweet)

In [11]:
# Remove blank tweets after preprocessing
df = df[df['preprocessed_tweet'].map(bool)]

Lemmatization is a technique used in natural language processing (NLP) to reduce inflected words to their base form, or lemma. This is done by identifying the part of speech and the context of the word, and then applying rules to map it to its lemma. For example, the lemma of the word "was" is "be," and the lemma of the word "better" is "good." Lemmatization is similar to stemming, which also reduces words to their base form, but lemmatization is more accurate because it considers the context of the word and produces a valid lemma, whereas stemming simply removes suffixes from the word, which may not result in a valid word. Lemmatization can be useful for grouping together words that have the same meaning, which can improve the performance of NLP algorithms.

In [12]:
# Perform lemmatization
def lemmatization(text):
    """Function to lemmatized tokenied sentence"""
    return " ".join([token.lemma_ if token.lemma_ != '-PRON-' else token.lower_ for token in nlp(text)])

df['preprocessed_tweet'] = df['preprocessed_tweet'].swifter.apply(lemmatization)

Pandas Apply:   0%|          | 0/2770 [00:00<?, ?it/s]

In [13]:
# Remove stopwords
def remove_stopwords(text):
    """Method to remove stopwords from tweet text"""

    stopwords = STOPWORDS.union(set(['starbucks', 'starbuck']))

    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in stopwords]
    return " ".join(tokens_without_sw)

df['preprocessed_tweet'] = df['preprocessed_tweet'].swifter.apply(remove_stopwords)

Pandas Apply:   0%|          | 0/2770 [00:00<?, ?it/s]

In [14]:
# Check the sample data after performing all the preprocessing steps
df[['tweet', 'preprocessed_tweet']]

Unnamed: 0,tweet,preprocessed_tweet
477970,So now you up 3am for work. These boomers 🤬 th...,3 a.m. work boomer wake early bullshxt stop bu...
362474,@ScottPalmer61 Yes. It’s a special Starbucks a...,yes special attachment
312427,I like the caramel frappe from Starbucks https...,like caramel frappe
71318,Why is No Time To Die playing on repeat at #St...,time die playing repeat notice
412468,https://t.co/tgR9Z5p8Ts | Criminals steal 200 ...,criminal steal 200 000 customer datum singapor...
...,...,...
38551,@ShevyShevrolet @HAWTToys @McFaul @elonmusk On...,month solve profs dilemma
441914,If @Starbucks and @Delta are doing this I real...,hope
744590,#Starbucks is taking its first steps into the ...,step partnership build blockchainbase loyalty ...
31846,@notany1youllkno @THETOMMYDREAMER @Starbucks @...,feel ok sure maga trump proud rock maga brother


In [15]:
# Save the preprocessed data for further use
df.to_hdf('./../../code/data/starbucks/data.h5', key='preprocessed_starbucks')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block1_values] [items->Index(['Unnamed: 0', 'tweet', 'conversation_id', 'date', 'hashtags',
       'inReplyToTweetId', 'reply_to', 'language', 'likes_count', 'media',
       'mentions', 'quoted_tweet', 'retweets_count', 'link',
       'user_status_count', 'location', 'name', 'description', 'verified',
       'url', 'user_id', 'username', 'preprocessed_tweet'],
      dtype='object')]

  df.to_hdf('./../../code/data/starbucks/data.h5', key='preprocessed_starbucks')
