## Text Preprocessing Techniques on Email Spam Data

1. Rename columns
2. Expand contractions
3. Lower case
4. Remove punctuations
5. Remove digits and word containing digits
6. Remove stop words and specified words

In [1]:
!pip install nltk
!pip install pandas



In [2]:
import pandas as pd

data = pd.read_csv("./datasets/emails.csv", usecols=["text", "spam"])

print(data.head(10))
print(data.info())

data.rename(columns={"spam": "class"}, inplace=True)
print(data.head(10))

                                                text spam
0  Subject: naturally it's your irresistible your...    1
1  Subject: the stock trading gunslinger  fanny i...    1
2  Subject: unbelievable new homes made easy  im ...    1
3  Subject: 4 color printing special  request add...    1
4  Subject: do not have money , get software cds ...    1
5  Subject: great nnews  hello , welcome to medzo...    1
6  Subject: here ' s a hot play in motion  homela...    1
7  Subject: save your money buy getting this thin...    1
8  Subject: undeliverable : home based business f...    1
9  Subject: save your money buy getting this thin...    1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5730 entries, 0 to 5729
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5730 non-null   object
 1   spam    5728 non-null   object
dtypes: object(2)
memory usage: 89.7+ KB
None
                                                text class
0  Subj

## Load data

In [3]:
import pandas as pd

df = pd.read_csv("./datasets/emails.csv", usecols=["text", "spam"])
print(df)

                                                   text spam
0     Subject: naturally it's your irresistible your...    1
1     Subject: the stock trading gunslinger  fanny i...    1
2     Subject: unbelievable new homes made easy  im ...    1
3     Subject: 4 color printing special  request add...    1
4     Subject: do not have money , get software cds ...    1
...                                                 ...  ...
5725  Subject: re : research and development charges...    0
5726  Subject: re : receipts from visit  jim ,  than...    0
5727  Subject: re : enron case study update  wow ! a...    0
5728  Subject: re : interest  david ,  please , call...    0
5729  Subject: news : aurora 5 . 2 update  aurora ve...    0

[5730 rows x 2 columns]


In [5]:
# 1. Rename columns

column_mapping = {
    "text": "Email content",
    "spam": "Spam messages",
    # Add more columns as needed
}

df.rename(columns=column_mapping, inplace=True)
df.head()

# Save the dataframe back to CSV file

Unnamed: 0,Email content,Spam messages
0,Subject: naturally it's your irresistible your...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## 2. Expand contractions

eg. don't -> do not, it's -> it is

In [6]:
%pip install contractions

Note: you may need to restart the kernel to use updated packages.


In [7]:
import contractions

text = df["Email content"][0]
print("Original Text:\n", text)
print("Expanded Text:\n")
for i in text.split():
    print(contractions.fix(i), end=" ")

Original Text:
 Subject: naturally it's your irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we don't promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you'll see logo drafts within three business days . aff

## 3. Lower case

In [8]:
df["text"] = df["Email content"].str.lower()
df["text"][0]

"subject: naturally it's your irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we don't promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you'll see logo drafts within three business days . affordability : yo

## 4. Remove Punctuations

In [9]:
import re

df["text"] = df["text"].apply(lambda x: re.sub(r"[^\w\s]", "", x))
df["text"]

0       subject naturally its your irresistible your c...
1       subject the stock trading gunslinger  fanny is...
2       subject unbelievable new homes made easy  im w...
3       subject 4 color printing special  request addi...
4       subject do not have money  get software cds fr...
                              ...                        
5725    subject re  research and development charges t...
5726    subject re  receipts from visit  jim   thanks ...
5727    subject re  enron case study update  wow  all ...
5728    subject re  interest  david   please  call shi...
5729    subject news  aurora 5  2 update  aurora versi...
Name: text, Length: 5730, dtype: object

## 5. Remove digits and word containing digits

In [10]:
df["text"] = df["text"].apply(lambda x: re.sub("\w*\d\w*", "", x))
df["text"]

0       subject naturally its your irresistible your c...
1       subject the stock trading gunslinger  fanny is...
2       subject unbelievable new homes made easy  im w...
3       subject  color printing special  request addit...
4       subject do not have money  get software cds fr...
                              ...                        
5725    subject re  research and development charges t...
5726    subject re  receipts from visit  jim   thanks ...
5727    subject re  enron case study update  wow  all ...
5728    subject re  interest  david   please  call shi...
5729    subject news  aurora    update  aurora version...
Name: text, Length: 5730, dtype: object

## 6. Remove stop words and specified words

Stopwords are the most commonly occurring words which do not provide any valuable information

In [11]:
import nltk

nltk.download("stopwords")
from nltk.corpus import stopwords

df["text"][3]


stop_words = set(stopwords.words("english"))
stop_words.add("subject")


def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])


df["text"] = df["text"].apply(lambda x: remove_stopwords(x))
df["text"][3]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/prathwik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'color printing special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix printing azusa canyon rd irwindale ca e mail message advertisement solicitation'