# Scenario: AI Chatbot for Customer Service

### Context:
###### You are developing an AI-powered chatbot for a large e-commerce website. Customers often ask questions like:

###### “Hey, can u plz tell me where’s my order??”

###### “i didn’t receive my parcel yet!!!!”

###### “Whr’s my ordr 😡😡”

###### “delivery late af... i want refund now”

### Question:
###### Design a preprocessing pipeline that will normalize these kinds of user inputs before feeding them into an intent classification model. What specific steps would you take to clean and standardize the text? Your solution should handle abbreviations, slang, emojis, misspellings, and emotions or other tex preprocessing techniques.

### 🔸 Bonus: 
###### How would you preserve emotional intensity (like frustration) in preprocessing without losing model accuracy?





In [3]:
!pip install pandas numpy nltk emoji textblob


Defaulting to user installation because normal site-packages is not writeable
Collecting emoji
  Obtaining dependency information for emoji from https://files.pythonhosted.org/packages/91/db/a0335710caaa6d0aebdaa65ad4df789c15d89b7babd9a30277838a7d9aac/emoji-2.14.1-py3-none-any.whl.metadata
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/1e/d6/40aa5aead775582ea0cf35870e5a3f16fab4b967f1ad2debe675f673f923/textblob-0.19.0-py3-none-any.whl.metadata
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Collecting nltk
  Obtaining dependency information for nltk from https://files.pythonhosted.org/packages/4d/66/7d9e26593edda06e8cb531874633f7c2372279c3b0f46235539fe546df8b/nltk-3.9.1-py3-none-any.whl.metadata
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
   -------------------------------------



In [6]:
import pandas as pd
import numpy as np
import re
import nltk
import emoji
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag, word_tokenize
from nltk.stem import PorterStemmer
from textblob import TextBlob

In [7]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
# Dictionary of slang and abbreviations
chat_words_dict = {
    "u": "you",
    "brb": "be right back",
    "tbh": "to be honest",
    "idk": "I don’t know",
    "omw": "on my way",
    "afk": "away from keyboard",
    "np": "no problem",
    "thx": "thanks",
    "btw": "by the way",
    "imo": "in my opinion",
    "lol": "laugh out loud",
    "omg": "oh my god",
    "rofl": "rolling on the floor laughing",
    "icymi": "in case you missed it",
    "Whr’s": "Where is",
    "late af": "is too late",
    "plz": "please", 
    "af": "as f***", 
    "whr": "where", 
    "ordr": "order"
}

In [38]:
def replace_chat_words(text, chat_words_dict):
    # Split the text into words
    words = text.split()
    
    # Replace chat words using the dictionary
    replaced_words = [chat_words_dict.get(word.lower(), word) for word in words]
    
    # Join the words back into a single string
    replaced_text = " ".join(replaced_words)
    return replaced_text

In [39]:
def correct_spelling(text):
    try:
        return str(TextBlob(text).correct())
    except:
        return text

In [40]:
#def tag_elongation(text):
    #return re.sub(r'(\w*)(\w)\2{2,}(\w*)', r'\1\2\2\3 <elongated>', text)

In [41]:
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    #text = re.sub(r":\(", "sad", text)
    #text = re.sub(r":\)", "happy", text)
    #text = emoji.demojize(text)  # 😡 -> :angry_face:
    #text = text.replace(':angry_face:', 'angry')
    text = re.sub(r'[!]{2,}', ' <exclaim> ', text)
    text = re.sub(r'[?]{2,}', ' <question> ', text)
    #text = correct_spelling(text)  # <-- spell correction added here
    text = replace_chat_words(text, chat_words_dict)
    #text = tag_elongation(text)
    tokens = word_tokenize(text)
    #tokens = [word for word in tokens if word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return " ".join(tokens)

In [42]:
def preprocess_data(df):
    df = df.drop_duplicates()  # Remove duplicate rows
    df.fillna(df.mean(numeric_only=True), inplace=True)  # Replace NaN with mean for numeric columns
    text_columns = df.select_dtypes(include=['object']).columns  # Identify text columns
    
    for col in text_columns:
        df[col] = df[col].apply(clean_text)  # Apply text cleaning
    
    return df

In [43]:
data = {
    'Text': ["Hey, can u plz tell me where’s my order??", "i didn’t receive my parcel yet!!!!", "Whr’s my ordr 😡😡", np.nan,"delivery late af... i want refund now"],
    'Numbers': [10, np.nan, 30, 40,12]
}
df = pd.DataFrame(data)
cleaned_df = preprocess_data(df)
print(cleaned_df)

                                         Text  Numbers
0    hey can you pleas tell me where my order     10.0
1                i didnt receiv my parcel yet     23.0
2                                whr my order     30.0
3                                                 40.0
4  deliveri late as f * * * i want refund now     12.0
