# SMS Spam Classification
- NLP Classification Problem
- Using TF-IDF, Random Forest

**About Dataset**
- Context
  - The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.


In [1]:
import pandas as pd
import nltk

In [2]:
pd.set_option('display.max_colwidth',100)
messages = pd.read_csv('spam.csv',encoding="latin-1")


In [3]:
messages.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives around here though",,,


**Drop unused columns**

In [4]:
columns_to_drop = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
messages = messages.drop(columns=columns_to_drop, axis=1)

- Rename the columns to 'label' and 'text'

In [5]:
messages.columns = ['label','text']
messages['label'].value_counts()
print('Number of nulls in label: {}'.format(messages['label'].isnull().sum()))
print('Number of nulls in test: {}'.format(messages['text'].isnull().sum()))



Number of nulls in label: 0
Number of nulls in test: 0


In [6]:
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [7]:
messages['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

**Clean Text**

In [8]:
import string 
import re
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def clean_text(text):
    text = "".join([char.lower() for char in text if char not in string.punctuation])
    tokens = re.split('\W+',text)
    text = [word for word in tokens if word not in stopwords]
    return text

- Define a function that takes a text input, converts it to lowercase, remove punctuation, tokenizes the text using regular expressions, and remove stopwords

In [10]:
clean_text('I love NLP')

['love', 'nlp']

**TF-IDF Vectorizer**

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(messages['text'])


In [12]:
print(X_tfidf)

  (0, 8886)	0.1897972404129813
  (0, 1179)	0.3328768943327282
  (0, 3833)	0.15637942028705787
  (0, 2210)	0.2812157454836214
  (0, 1874)	0.3177669769222775
  (0, 3016)	0.19731991212206204
  (0, 4823)	0.2812157454836214
  (0, 9120)	0.2295545966345145
  (0, 3873)	0.18527465020848258
  (0, 5644)	0.18029851718516762
  (0, 1876)	0.2812157454836214
  (0, 1412)	0.2531259556287681
  (0, 2494)	0.2577902404588279
  (0, 6396)	0.2603613040823829
  (0, 4664)	0.3328768943327282
  (0, 3776)	0.15134204543956442
  (1, 5992)	0.5356050320303347
  (1, 8551)	0.1965077811821439
  (1, 9016)	0.4229284479747434
  (1, 4632)	0.5131236718683402
  (1, 4862)	0.4000945017495694
  (1, 5960)	0.2688344407445697
  (2, 73)	0.23183680257116945
  (2, 1265)	0.1674787110187263
  (2, 6738)	0.23183680257116945
  :	:
  (5568, 9380)	0.34959841919185314
  (5568, 3795)	0.311771694488399
  (5568, 1443)	0.37760482542763296
  (5568, 4169)	0.3138841617806791
  (5569, 7937)	0.520467167163554
  (5569, 7562)	0.520467167163554
  (5569, 63

- (0, 8886) 0.1897972404129813: In the first row (document 0) and the 8886th column (unique word 8886), the TF-IDF value is approximately 0.1898.
- (0, 1179) 0.3328768943327282: In the first row and the 1179th column, the TF-IDF value is approximately 0.3329.

- analyzer=clean_text means that the clean_text function will be applied to each text message in the 'messages' dataset before TF-IDF vectorization.

In [13]:
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names_out())

(5572, 9395)
['' '0' '008704050406' ... 'ûïharry' 'ûò' 'ûówell']


In [14]:
X_features = pd.DataFrame(X_tfidf.toarray())
X_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9385,9386,9387,9388,9389,9390,9391,9392,9393,9394
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Is used to convert the sparse TF-IDF matrix X_tfidf into a dense pandas DataFrame for ease of data manipulation and analysis, modeling and machine learning as many machine learning libraries and models, including scikit-learn, expect input data in the form of dense arrays or DataFrames. Converting the sparse matrix to a dense DataFrame makes it compatible with a wide range of machine learning tools.

**Random Forest for Classification**

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split

In [17]:
X_train,X_test,y_train,y_test = train_test_split(X_features,messages['label'],test_size=0.2)

In [18]:
X_test.shape

(1115, 9395)

In [19]:
rf = RandomForestClassifier()
rf_model = rf.fit(X_train,y_train)

In [20]:
y_pred = rf_model.predict(X_test)

In [21]:
print(X_test)

          0     1     2     3     4     5     6     7     8     9     ...  \
2046  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4445  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4482  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
5160  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2873  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
...        ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   
3907  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
3408  0.163335   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4866  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
574   0.251263   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
1612  0.000000   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   

      9385  9386  9387  9388  9389  9390  9391  9392  9393  9394  
2046   0

In [22]:
print(y_pred)

['ham' 'ham' 'ham' ... 'ham' 'ham' 'spam']


In [23]:
y_pred.shape

(1115,)

In [24]:
y_pred[0]

'ham'

In [25]:
y_pred[1]

'ham'

In [26]:
precision = precision_score(y_test,y_pred,pos_label='spam')
recall = recall_score(y_test,y_pred,pos_label='spam')
print('precision : {} / Recall: {}'. format(round(precision, 3), round(recall, 3)))

precision : 1.0 / Recall: 0.833


- Precision (or Positive Predictive Value): The precision score is a measure of the model's accuracy when it predicts the positive class, in this case, 'spam.' A precision score of 1.0 means that all the instances classified as 'spam' by the model were correct, indicating that there were no false positives.

- Recall (or True Positive Rate, Sensitivity): The recall score measures the model's ability to correctly identify all positive instances ('spam'). A recall score of 0.833 means that the model correctly identified approximately 83.3% of the actual 'spam' instances, indicating there were some 'spam' instances that the model did not capture (false negatives).

# Manual Testing

In [27]:
text = ["Free entryy in 2 a wkly comp to win FA Cup final tkts 21st May 2005"]
text_tfidf = tfidf_vect.transform(text)

X_features = pd.DataFrame(text_tfidf.toarray())
X_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9385,9386,9387,9388,9389,9390,9391,9392,9393,9394
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
y_pred = rf_model.predict(text_tfidf)

In [29]:
y_pred

array(['spam'], dtype=object)