Will be analyzing text message data using Natural Language Processing techniques and libraries.

Project Objectives:

Import and look over your dataset.
Conduct text-preprocessing.
Plan and conduct any number of NLP techniques to analyze and gain insight into the data.

# Import the data set 

In [1]:
import pandas as pd 
df = pd.read_csv('clean_nus_sms.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,Message,length,country,Date
0,0,10120,Bugis oso near wat...,21,SG,2003/4
1,1,10121,"Go until jurong point, crazy.. Available only ...",111,SG,2003/4
2,2,10122,I dunno until when... Lets go learn pilates...,46,SG,2003/4
3,3,10123,Den only weekdays got special price... Haiz......,140,SG,2003/4
4,4,10124,Meet after lunch la...,22,SG,2003/4


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48598 entries, 0 to 48597
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  48598 non-null  int64 
 1   id          48598 non-null  int64 
 2   Message     48595 non-null  object
 3   length      48598 non-null  object
 4   country     48598 non-null  object
 5   Date        48598 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.2+ MB


In [3]:
df.isnull().sum()

Unnamed: 0    0
id            0
Message       3
length        0
country       0
Date          0
dtype: int64

In [8]:
df.dropna(inplace = True)# dropping null values 

In [9]:
df.isnull().sum() #checking the null values are dropped or not 

Unnamed: 0    0
id            0
Message       0
length        0
country       0
Date          0
dtype: int64

In [10]:
import pandas as pd
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob

# Load SpaCy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Text processing function

In [11]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters, numbers, and punctuations
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # Process text using SpaCy
    doc = nlp(text)

    # Lemmatization & Stopword Removal
    tokens = [token.lemma_ for token in doc if not token.is_stop]

    return " ".join(tokens)

# Apply preprocessing
df["cleaned_message"] = df["Message"].apply(preprocess_text)

# Display cleaned messages
print(df[["Message", "cleaned_message"]])


                                                 Message  \
0                                  Bugis oso near wat...   
1      Go until jurong point, crazy.. Available only ...   
2         I dunno until when... Lets go learn pilates...   
3      Den only weekdays got special price... Haiz......   
4                                 Meet after lunch la...   
...                                                  ...   
48593                              Come to me AFTER NOON   
48594                                     I LOVE YOU TOO   
48595                                               C-YA   
48596                                        BE MY GUEST   
48597                              MANY MANY MANY PEOPLE   

                                         cleaned_message  
0                                     bugis oso near wat  
1      jurong point crazy available bugis n great wor...  
2                                 dunno let learn pilate  
3      den weekday get special price haiz n

# NLP techniques

In [12]:
def pos_tagging(text):
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

df["POS_tags"] = df["cleaned_message"].apply(pos_tagging)

# Print POS tagging
print(df[["cleaned_message", "POS_tags"]])


                                         cleaned_message  \
0                                     bugis oso near wat   
1      jurong point crazy available bugis n great wor...   
2                                 dunno let learn pilate   
3      den weekday get special price haiz not eat lia...   
4                                          meet lunch la   
...                                                  ...   
48593                                          come noon   
48594                                               love   
48595                                                cya   
48596                                              guest   
48597                                             people   

                                                POS_tags  
0      [(bugis, ADJ), (oso, NOUN), (near, ADP), (wat,...  
1      [(jurong, PROPN), (point, NOUN), (crazy, PROPN...  
2      [(dunno, NOUN), (let, AUX), (learn, VERB), (pi...  
3      [(den, VERB), (weekday, NOUN), (get,

# Named Entity Recognition (NER)

In [None]:
def named_entity_recognition(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

df["NER"] = df["Message"].apply(named_entity_recognition)

# Print named entities
print(df[["Message", "NER"]])


# Sentiment Analysis

In [None]:
def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity  # Returns polarity (-1 to 1)

df["sentiment_score"] = df["Message"].apply(sentiment_analysis)

# Print sentiment scores
print(df[["Message", "sentiment_score"]])


# Word Vectorization (TF-IDF)

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["cleaned_message"])

# Convert to DataFrame
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Print TF-IDF features
print(tfidf_df)


Insights from the Analysis
✅ Preprocessing cleans text for analysis.
✅ POS tagging helps understand word roles.
✅ NER identifies important entities like locations.
✅ Sentiment analysis shows emotional tone.
✅ TF-IDF highlights key terms in messages.

