https://www.kaggle.com/code/nishkoder/customer-support-on-twitter-data-preprocessing

# Data Description 
The dataset is a CSV, where each row is a tweet. The different columns are described below. Every conversation included has at least one request from a consumer and at least one response from a company. Which user IDs are company user IDs can be calculated using the inbound field.

- **tweet_id**

A unique, anonymized ID for the Tweet. Referenced by response_tweet_id and in_response_to_tweet_id.

- **author_id**

A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.

- **inbound**

Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.

- **created_at**

Date and time when the tweet was sent.

- **text**

Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like --email--.

- **response_tweet_id**

IDs of tweets that are responses to this tweet, comma-separated.

- **in_response_to_tweet_id**

ID of the tweet this tweet is in response to, if any.

In [1]:
import re
import string
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from textblob import TextBlob
from sklearn.preprocessing import OneHotEncoder
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer



def read_csv_with_limit(filename, nrows=500):
    """
    Read a CSV file with pandas and limit the number of rows.

    Parameters:
    filename (str): The path to the CSV file.
    nrows (int): The maximum number of rows to read.

    Returns:
    pandas.DataFrame: The DataFrame containing the data from the CSV file.
    """
    twcs = pd.read_csv(filename, nrows=nrows)
    return twcs

# Example usage:
twcs = read_csv_with_limit('twcs.csv')


- Data Pre-processing for Text Colum in twcs dataset

In [2]:
twcs.tail()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
495,802,Delta,False,Tue Oct 31 22:23:07 +0000 2017,"@115884 Oh, no! Please speak to a member of th...",,803.0
496,803,115884,True,Tue Oct 31 21:33:27 +0000 2017,.@delta this has been my inflight studio exper...,802.0,
497,804,Delta,False,Tue Oct 31 22:22:01 +0000 2017,@115885 2/2 https://t.co/6iDGBJAc2m,805806.0,807.0
498,805,115885,True,Tue Oct 31 22:39:33 +0000 2017,@Delta Is that not what I’ve done already?,,804.0
499,806,115885,True,Tue Oct 31 22:39:44 +0000 2017,@Delta Can you reply on the DM thread?,,804.0


In [3]:
# checking __email__ masking in the text column
rows_with_email_string = twcs[twcs['text'].apply(lambda x: re.search(r'__email__', x) is not None)]
rows_with_email_string.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
254,347,115797,True,Tue Oct 31 01:57:03 +0000 2017,@AirAsiaSupport lost my booking # chua loreen ...,343,348.0


 - Convert text column into string type

In [4]:
def preprocess_dataframe(df,text_column):
    """
    Preprocess the DataFrame by selecting the 'any text' column and converting it to string type.

    Parameters:
    df (pandas.DataFrame): The DataFrame to preprocess.

    Returns:
    pandas.DataFrame: The preprocessed DataFrame with only the 'required text' column and converted to string type.
    """
    df = df[[text_column]].copy()  
    df[text_column] = df[text_column].astype(str)  # Convert the ' any text' column to string type
    return df


twcs = preprocess_dataframe(twcs,'text')
twcs.head()


Unnamed: 0,text
0,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...
4,@sprintcare I did.


- Remove HTML, URL, and punctuation 
- Convert into all text into lower case

In [5]:
def preprocess_text(text):
    """
    Preprocess the text by removing HTML tags, URLs, punctuation, and converting to lowercase.

    Parameters:
    text (str): The text to preprocess.

    Returns:
    str: The preprocessed text.
    """
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    return text

def preprocess_dataframe(df, text_column):
    """
    Preprocess the specified text column of the DataFrame.

    Parameters:
    df (pandas.DataFrame): The DataFrame to preprocess.
    text_column (str): The name of the text column to preprocess.

    Returns:
    pandas.DataFrame: The DataFrame with the specified text column preprocessed.
    """
    df[text_column] = df[text_column].apply(preprocess_text)
    return df

twcs_preprocessed = preprocess_dataframe(twcs, 'text')
twcs_preprocessed.head(10)


  text = BeautifulSoup(text, "html.parser").get_text()


Unnamed: 0,text
0,115712 i understand i would like to assist you...
1,sprintcare and how do you propose we do that
2,sprintcare i have sent several private message...
3,115712 please send us a private message so tha...
4,sprintcare i did
5,115712 can you please send us a private messag...
6,sprintcare is the worst customer service
7,115713 this is saddening to hear please shoot ...
8,sprintcare you gonna magically change your con...
9,115713 we understand your concerns and wed lik...


- Remove Numbers 

In [6]:
def remove_numbers(text):
    # Remove numbers using regular expression
    text_without_numbers = re.sub(r'\d+', '', text)
    return text_without_numbers

# Apply the remove_numbers function to the 'text' column of the DataFrame
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(remove_numbers)

# Display the updated DataFrame
print(twcs_preprocessed['text'])

0       i understand i would like to assist you we wo...
1           sprintcare and how do you propose we do that
2      sprintcare i have sent several private message...
3       please send us a private message so that we c...
4                                       sprintcare i did
                             ...                        
495     oh no please speak to a member of the flt cre...
496    delta this has been my inflight studio experie...
497                                                     
498             delta is that not what i’ve done already
499                 delta can you reply on the dm thread
Name: text, Length: 500, dtype: object


- Handle Contactions 

In [7]:
contraction_mapping = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so is",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have",
    "tis": "it is",
    "twas": "it was"
}

def expand_contractions(text, contraction_mapping):
    # Create a regular expression pattern to match contractions
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        expanded_contraction = contraction_mapping.get(match.lower(), match)
        return expanded_contraction

    # Replace contractions with their expansions
    expanded_text = contractions_pattern.sub(expand_match, text)
    return expanded_text

# Apply the expand_contractions function to the 'text' column of the DataFrame
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(expand_contractions, contraction_mapping=contraction_mapping)

# Display the updated DataFrame
print(twcs_preprocessed['text'])


0       i understand i would like to assist you we wo...
1           sprintcare and how do you propose we do that
2      sprintcare i have sent several private message...
3       please send us a private message so that we c...
4                                       sprintcare i did
                             ...                        
495     oh no please speak to a member of the flt cre...
496    delta this has been my inflight studio experie...
497                                                     
498             delta is that not what i’ve done already
499                 delta can you reply on the dm thread
Name: text, Length: 500, dtype: object


- Checking emoji in text column of dataset

In [8]:

def has_emoji(text):
    """
    Check if the text contains emojis.

    Parameters:
    text (str): The text to check.

    Returns:
    bool: True if the text contains emojis, False otherwise.
    """
    emoji_pattern = re.compile("["
                               "\U0001F600-\U0001F64F"  # emoticons
                               "\U0001F300-\U0001F5FF"  # symbols & pictographs
                               "\U0001F680-\U0001F6FF"  # transport & map symbols
                               "\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "\U00002500-\U00002BEF"  # chinese char
                               "\U00002702-\U000027B0"
                               "\U00002702-\U000027B0"
                               "\U000024C2-\U0001F251"
                               "\U0001f926-\U0001f937"
                               "\U00010000-\U0010ffff"
                               "\u2640-\u2642"
                               "\u2600-\u2B55"
                               "\u200d"
                               "\u23cf"
                               "\u23e9"
                               "\u231a"
                               "\ufe0f"  # dingbats
                               "\u3030"
                               "]+", flags=re.UNICODE)
    return bool(emoji_pattern.search(text))



twcs_preprocessed['has_emoji'] = twcs_preprocessed['text'].apply(has_emoji)
print(twcs_preprocessed[twcs_preprocessed['has_emoji'] == True])


                                                  text  has_emoji
8    sprintcare you gonna magically change your con...       True
31   somebody from verizonsupport please help meeee...       True
62   chipotletweets messed up today and didn’t give...       True
75   happy halloween since im too old to trick or t...       True
77   chipotletweets thank you chipotletweets for re...       True
78   so frustrated with chipotletweets 😡 ordered di...       True
80   btw chipotletweets giving out  burritos if you...       True
90    burritos and i’m nowhere near a chipotletweet...       True
91                                       noted 😊 becky       True
95   considering walking to chipotletweets in my ll...       True
108  we had to have a count colin 💛 marksandspencer...       True
110  ‘ere marksandspencer never mind avocado 🥑  how...       True
116  marksandspencer aren’t require charge  a bag\n...       True
118   love the aesthetic of your colin the creeperp...       True
119  excel

- Remove emojis 

In [9]:
def remove_emoji(text):
    """
    Remove emojis from the text.

    Parameters:
    text (str): The text to remove emojis from.

    Returns:
    str: The text with emojis removed.
    """
    emoji_pattern = re.compile("["
                               "\U0001F600-\U0001F64F"  # emoticons
                               "\U0001F300-\U0001F5FF"  # symbols & pictographs
                               "\U0001F680-\U0001F6FF"  # transport & map symbols
                               "\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "\U00002500-\U00002BEF"  # chinese char
                               "\U00002702-\U000027B0"
                               "\U00002702-\U000027B0"
                               "\U000024C2-\U0001F251"
                               "\U0001f926-\U0001f937"
                               "\U00010000-\U0010ffff"
                               "\u2640-\u2642"
                               "\u2600-\u2B55"
                               "\u200d"
                               "\u23cf"
                               "\u23e9"
                               "\u231a"
                               "\ufe0f"  # dingbats
                               "\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub('', text)


# twcs_preprocessed 'text' column containing emojis remove using above function
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(remove_emoji)
print(twcs_preprocessed)


                                                  text  has_emoji
0     i understand i would like to assist you we wo...      False
1         sprintcare and how do you propose we do that      False
2    sprintcare i have sent several private message...      False
3     please send us a private message so that we c...      False
4                                     sprintcare i did      False
..                                                 ...        ...
495   oh no please speak to a member of the flt cre...      False
496  delta this has been my inflight studio experie...      False
497                                                         False
498           delta is that not what i’ve done already      False
499               delta can you reply on the dm thread      False

[500 rows x 2 columns]


In [10]:

def remove_stopwords(text, language='english'):
    """
    Remove stopwords from a given text.

    Args:
        text (str): The input string from which to remove stopwords.
        language (str): The language of the text and the stopwords to be removed. Defaults to 'english'.

    Returns:
        str: A string with stopwords removed.

    Example:
        >>> sample_text = "This is an example showing off stop word filtration."
        >>> remove_stopwords(sample_text)
        'This example showing stop word filtration.'
    """
    # Ensure NLTK stopword list is available; otherwise, download it
    nltk.download('stopwords')
    nltk.download('punkt')

    # Load the list of stopwords for the specified language
    stop_words = set(stopwords.words(language))

    # Tokenize the text into words
    word_tokens = word_tokenize(text)

    # Filter out the stopwords
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]

    # Join the filtered words back into a string
    filtered_text = ' '.join(filtered_text)

    return filtered_text

# Example usage:

# Apply the remove_stopwords function to each element of the 'text' column
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(remove_stopwords)

# Print the preprocessed DataFrame
print(twcs_preprocessed.head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\

                                                text  has_emoji
0  understand would like assist would need get pr...      False
1                                 sprintcare propose      False
2  sprintcare sent several private messages one r...      False
3  please send us private message assist click ‘ ...      False
4                                         sprintcare      False


- Expand abbreviations/slang

In [11]:
# slag dictionary
abb_dict = {
    "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "B4N": "Bye For Now",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek you (also a chat program)",
    "ILU": "I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA?": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your sex and age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That feeling when",
    "MFW": "My face when",
    "MRW": "My reaction when",
    "IFYP": "I feel your pain",
    "TNTL": "Trying not to laugh",
    "JK": "Just kidding",
    "IDC": "I don’t care",
    "ILY": "I love you",
    "IMU": "I miss you",
    "ADIH": "Another day in hell",
    "ZZZ": "Sleeping, bored, tired",
    "WYWH": "Wish you were here",
    "TIME": "Tears in my eyes",
    "BAE": "Before anyone else",
    "FIMH": "Forever in my heart",
    "BSAAW": "Big smile and a wink",
    "BWL": "Bursting with laughter",
    "LMAO": "Laughing my a** off",
    "BFF": "Best friends forever",
    "CSL": "Can’t stop laughing",
    "IMO": "In My Opinion",
    "IMHO": "In My Humble Opinion",
    "IIRC": "If I Remember Correctly",
    "AF": "As F**k",
    "FTW": "For The Win",
    "ICYMI": "In Case You Missed It",
    "SMH": "Shaking My Head",
    "TBH": "To Be Honest",
    "ICYDK": "In Case You Didn't Know",
    "TBT": "Throwback Thursday",
    "FOMO": "Fear Of Missing Out",
    "OOTD": "Outfit Of The Day",
    "AMA": "Ask Me Anything",
    "TL;DR": "Too Long; Didn't Read",
    "TMI": "Too Much Information",
    "DIY": "Do It Yourself",
    "ETA": "Estimated Time of Arrival",
    "SFW": "Safe For Work",
    "NSFW": "Not Safe For Work",
    "DM": "Direct Message",
    "RT": "Retweet",
    "MT": "Modified Tweet",
    "HBD": "Happy Birthday",
    "IMK": "In My Knowledge",
    "FTFY": "Fixed That For You",
    "ISO": "In Search Of",
    "NSFL": "Not Safe For Life",
    "BRB": "Be Right Back",
    "NM": "Never Mind",
    "YMMV": "Your Mileage May Vary",
    "RTFM": "Read The F***ing Manual",
    "OOTD": "Outfit Of The Day",
    "YOLO": "You Only Live Once",
    "OMG": "Oh My God",
    "OMW": "On My Way",
    "OOMF": "One Of My Friends/Followers",
    "STFU": "Shut The F*** Up",
    "WTH": "What The Hell",
    "WYD": "What You Doing"
}


- First Checking slang/abbreviation in twcs dataset

In [12]:
def check_slang(abb_slang):
    """
    Check if the given string is present in the internet slang dictionary.

    Parameters:
        slang (str): The string to check.

    Returns:
        bool: True if the slang is found in the dictionary, False otherwise.
    """
    return abb_slang in abb_dict

# Apply the check_slang function to a column in the DataFrame
twcs_preprocessed['is_slang_present'] = twcs_preprocessed['text'].apply(check_slang)
print(twcs_preprocessed)

                                                  text  has_emoji  \
0    understand would like assist would need get pr...      False   
1                                   sprintcare propose      False   
2    sprintcare sent several private messages one r...      False   
3    please send us private message assist click ‘ ...      False   
4                                           sprintcare      False   
..                                                 ...        ...   
495  oh please speak member flt crew immediate assi...      False   
496  delta inflight studio experience today nothing...      False   
497                                                         False   
498                               delta ’ done already      False   
499                              delta reply dm thread      False   

     is_slang_present  
0               False  
1               False  
2               False  
3               False  
4               False  
..                ...  
495

- Expand Slangs 

In [13]:
def expand_text(text, abbr_dict):
    """
    Expand abbreviations and slang in a given text based on a predefined dictionary.
    
    Args:
        text (str): The input string containing abbreviations and/or slang.
        abbr_dict (dict): A dictionary containing mappings of abbreviations to their expansions.
    
    Returns:
        str: The processed string with abbreviations and slang expanded.
    """
    # Make the search case-insensitive
    lowercase_dict = {key.lower(): value for key, value in abbr_dict.items()}
    
    # Split the text into words
    words = text.split()
    
    # Expand abbreviations and slang
    expanded_words = [lowercase_dict.get(word.lower(), word) for word in words]
    
    # Join expanded words back into a string
    expanded_text = ' '.join(expanded_words)
    
    return expanded_text

# Apply the expand_text function to the 'text' column of the DataFrame
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(expand_text, abbr_dict=abb_dict)

# Display the updated DataFrame
twcs_preprocessed.head()


Unnamed: 0,text,has_emoji,is_slang_present
0,understand would like assist would need get pr...,False,False
1,sprintcare propose,False,False
2,sprintcare sent several private messages one r...,False,False
3,please send us private message assist click ‘ ...,False,False
4,sprintcare,False,False


- Perform Lemmatization 
  --Reduces words to their lemma or dictionary form, considering the morphological analysis of the word.

In [14]:
# Initialize the WordNet lemmatizer
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

# Function to map NLTK's POS tags to WordNet's POS tags
def nltk_to_wordnet_pos(nltk_pos):
    if nltk_pos.startswith('J'):
        return wordnet.ADJ
    elif nltk_pos.startswith('V'):
        return wordnet.VERB
    elif nltk_pos.startswith('N'):
        return wordnet.NOUN
    elif nltk_pos.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Function to perform lemmatization on a text
def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)  # Tokenize the text
    pos_tags = nltk.pos_tag(tokens)  # Get the POS tags for the tokens
    lemmatized_tokens = []
    for token, pos_tag in pos_tags:
        wordnet_pos = nltk_to_wordnet_pos(pos_tag)  # Map NLTK POS tags to WordNet POS tags
        if wordnet_pos is not None:
            lemmatized_token = lemmatizer.lemmatize(token, pos=wordnet_pos)  # Lemmatize the token
        else:
            lemmatized_token = lemmatizer.lemmatize(token)  # Lemmatize the token without POS information
        lemmatized_tokens.append(lemmatized_token)
    lemmatized_text = ' '.join(lemmatized_tokens)  # Join the lemmatized tokens back into text
    return lemmatized_text

# Apply the lemmatize_text function to the 'text' column of the DataFrame
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(lemmatize_text)

# Display the updated DataFrame
print(twcs_preprocessed['text'])


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0      understand would like assist would need get pr...
1                                     sprintcare propose
2      sprintcare send several private message one re...
3      please send u private message assist click ‘ m...
4                                             sprintcare
                             ...                        
495    oh please speak member flt crew immediate assi...
496    delta inflight studio experience today nothing...
497                                                     
498                                   delta ’ do already
499                    delta reply Direct Message thread
Name: text, Length: 500, dtype: object


Generally spelling correction and whitespace removal perform after lemmatization in the text preprocessing pipeline.

- White space removal 

In [15]:
def remove_whitespace(df, column):
    """
    Remove leading and trailing white spaces from the specified column in a DataFrame.
    
    Args:
        df (pandas.DataFrame): The DataFrame containing the column with text.
        column (str): The name of the column containing text with white spaces.
    
    Returns:
        pandas.DataFrame: The DataFrame with leading and trailing white spaces removed from the specified column.
    """
    # Apply strip() function to remove leading and trailing white spaces
    df[column] = df[column].apply(lambda x: x.strip())
    
    return df

# Call the remove_whitespace function and update the DataFrame
twcs_preprocessed = remove_whitespace(twcs_preprocessed, 'text')

# Display the updated DataFrame
print(twcs_preprocessed['text'])


0      understand would like assist would need get pr...
1                                     sprintcare propose
2      sprintcare send several private message one re...
3      please send u private message assist click ‘ m...
4                                             sprintcare
                             ...                        
495    oh please speak member flt crew immediate assi...
496    delta inflight studio experience today nothing...
497                                                     
498                                   delta ’ do already
499                    delta reply Direct Message thread
Name: text, Length: 500, dtype: object


- Spelling Correction 

In [16]:
def correct_spelling(text):
    """
    Correct the spelling in a given text using TextBlob library.
    
    Args:
        text (str): The input text with potential spelling errors.
    
    Returns:
        str: The text with corrected spelling.
    """
    # Create a TextBlob object for the input text
    blob = TextBlob(text)
    
    # Correct the spelling using TextBlob's built-in spellchecker
    corrected_text = blob.correct()
    
    return str(corrected_text)

# Apply the correct_spelling function to the 'text' column of the DataFrame
twcs_preprocessed['text'] = twcs_preprocessed['text'].apply(correct_spelling)

# Display the updated DataFrame
print(twcs_preprocessed['text'])


0      understand would like assist would need get pr...
1                                     sprintcare propose
2      sprintcare send several private message one re...
3      please send u private message assist click ‘ m...
4                                             sprintcare
                             ...                        
495    oh please speak member felt crew immediate ass...
496    felt flight studio experience today nothing wo...
497                                                     
498                                    felt ’ do already
499                     felt reply Direct Message thread
Name: text, Length: 500, dtype: object


- Feature Engineering: N-grams (2-grams, 3-grams, 4-grams)

In [17]:
def extract_ngram_features(text_data, ngram_range=(2, 2), use_tfidf=False):
    """
    Extract n-gram features from text data.

    Parameters:
    - text_data: List of strings containing text data.
    - ngram_range: Tuple specifying the range of n-grams to consider (default: (2, 2)).
    - use_tfidf: Boolean indicating whether to use TF-IDF instead of CountVectorizer (default: False).

    Returns:
    - feature_names: List of n-gram features.
    - X: Sparse matrix containing the extracted features.
    """
    if use_tfidf:
        vectorizer = TfidfVectorizer(ngram_range=ngram_range)
    else:
        vectorizer = CountVectorizer(ngram_range=ngram_range)
    
    X = vectorizer.fit_transform(text_data)
    feature_names = vectorizer.get_feature_names_out()
    
    return feature_names, X


# Extracting n-gram features directly from the 'text' column of twcs_preprocessed
text_data = twcs_preprocessed['text']

ngram_ranges = [(2, 2), (3, 3), (4, 4)]

for ngram_range in ngram_ranges:
    feature_names, X = extract_ngram_features(text_data, ngram_range=ngram_range)
    print(f"{ngram_range}-gram features:")
    print(feature_names)
    print()


(2, 2)-gram features:
['aahhrrgh keep' 'able bring' 'able control' ... 'your still'
 'your welcome' 'zu taken']

(3, 3)-gram features:
['aahhrrgh keep sum' 'able control lockscreen' 'able export sg' ...
 'your still issue' 'your welcome william' 'zu taken schönen']

(4, 4)-gram features:
['aahhrrgh keep sum still' 'able export sg ai' 'able reach link do' ...
 'your still issue back' 'your welcome william give'
 'zu taken schönen bend']



- Bag of Words (BOW)

In [18]:
# Extract text data from the DataFrame
text_data = twcs_preprocessed['text']

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X_bow = vectorizer.fit_transform(text_data)

# Get feature names
feature_names_bow = vectorizer.get_feature_names_out()

# Display results
print("Bag-of-Words (BoW) features:")
print(feature_names_bow)
print("Encoded text data:")
print(X_bow.toarray())


Bag-of-Words (BoW) features:
['aahhrrgh' 'ab' 'able' ... 'you' 'your' 'zu']
Encoded text data:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


- One Hot Encoding

In [19]:
# Extract text data from the 'text' column
text_data = twcs_preprocessed['text']

# Step 1: Tokenization and Vocabulary Creation
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(text_data)
vocabulary = vectorizer.get_feature_names_out()

# Convert vocabulary to a 2D array with shape (n_features, 1)
vocabulary_2d = np.array(vocabulary).reshape(-1, 1)

# Step 2: One-Hot Encoding
one_hot_encoder = OneHotEncoder(categories='auto', sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(vocabulary_2d)

# Print one-hot encoded matrix
print("One-hot encoded matrix:")
print(one_hot_encoded)

One-hot encoded matrix:
[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]




- Tokenization (Word, Sentence, and Character)

In [20]:
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')

# Function for word tokenization
def word_tokenization(text):
    """
    Tokenize text into words.
    
    Args:
        text (str): Input text to tokenize.
    
    Returns:
        list: List of tokens (words).
    """
    tokens = word_tokenize(text)
    return tokens

# Function for sentence tokenization
def sentence_tokenization(text):
    """
    Tokenize text into sentences.
    
    Args:
        text (str): Input text to tokenize.
    
    Returns:
        list: List of sentences.
    """
    sentences = sent_tokenize(text)
    return sentences

# Function for character tokenization
def character_tokenization(text):
    """
    Tokenize text into characters.
    
    Args:
        text (str): Input text to tokenize.
    
    Returns:
        list: List of characters.
    """
    characters = list(text)
    return characters


# Apply word tokenization to the 'text' column
twcs_preprocessed['word_tokens'] = twcs_preprocessed['text'].apply(word_tokenization)

# Apply sentence tokenization to the 'text' column
twcs_preprocessed['sentence_tokens'] = twcs_preprocessed['text'].apply(sentence_tokenization)

# Apply character tokenization to the 'text' column
twcs_preprocessed['character_tokens'] = twcs_preprocessed['text'].apply(character_tokenization)

# Display the DataFrame with tokenized text
print(twcs_preprocessed.head())



                                                text  has_emoji  \
0  understand would like assist would need get pr...      False   
1                                 sprintcare propose      False   
2  sprintcare send several private message one re...      False   
3  please send u private message assist click ‘ m...      False   
4                                         sprintcare      False   

   is_slang_present                                        word_tokens  \
0             False  [understand, would, like, assist, would, need,...   
1             False                              [sprintcare, propose]   
2             False  [sprintcare, send, several, private, message, ...   
3             False  [please, send, u, private, message, assist, cl...   
4             False                                       [sprintcare]   

                                     sentence_tokens  \
0  [understand would like assist would need get p...   
1                               [sprintc

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prema\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


- TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). 

- Apply TF-IDF on word and sentence 

In [21]:
def calculate_tfidf(corpus):
    """
    Calculate TF-IDF scores for a corpus of documents.
    
    Args:
        corpus (list): List of documents (strings).
    
    Returns:
        scipy.sparse.csr_matrix: TF-IDF matrix representation of the corpus.
        list: Vocabulary of terms.
    """
    # Create TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer()

    # Fit the vectorizer on the corpus and transform the documents
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
    
    # Get the vocabulary of terms
    vocabulary = tfidf_vectorizer.get_feature_names_out()
    
    return tfidf_matrix, vocabulary

# Sample data for demonstration
corpus = [" ".join(words) for words in twcs_preprocessed['word_tokens']]

# Apply TF-IDF on word tokens
word_tfidf_matrix, word_vocabulary = calculate_tfidf(corpus)

# Sample data for sentence tokens
corpus = twcs_preprocessed['sentence_tokens'].apply(lambda x: " ".join(x))

# Apply TF-IDF on sentence tokens
sentence_tfidf_matrix, sentence_vocabulary = calculate_tfidf(corpus)

# Print TF-IDF matrix and vocabulary for word tokens
print("TF-IDF Matrix for Word Tokens:")
print(word_tfidf_matrix.toarray())
print("\nVocabulary for Word Tokens:")
print(word_vocabulary)

# Print TF-IDF matrix and vocabulary for sentence tokens
print("\nTF-IDF Matrix for Sentence Tokens:")
print(sentence_tfidf_matrix.toarray())
print("\nVocabulary for Sentence Tokens:")
print(sentence_vocabulary)


TF-IDF Matrix for Word Tokens:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Vocabulary for Word Tokens:
['aahhrrgh' 'ab' 'able' ... 'you' 'your' 'zu']

TF-IDF Matrix for Sentence Tokens:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Vocabulary for Sentence Tokens:
['aahhrrgh' 'ab' 'able' ... 'you' 'your' 'zu']
