In [2]:
import numpy as np
import pandas as pd

## Data Cleaning

In [18]:
df = pd.read_csv("data/spam.csv", encoding = "latin-1")

In [19]:
df.head() 

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


The reason for dropping these columns is that they are unnamed and likely not useful for analysis or modeling purposes. By dropping them, we can simplify the DataFrame and reduce the memory usage. Given the amount of non-null values had been significant in our dataframe, I may have thought of keeping them or doing some research about it. Although, given the number of values that we currently have, I can confidently drop them since they may add little to no value in our analysis. 

In [21]:
# drop the last 3 columns using iloc
df = df.iloc[:, :-3]

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [23]:
df.shape

(5572, 2)

In [24]:
df.sample(5)

Unnamed: 0,v1,v2
4389,ham,Do you know why god created gap between your f...
2801,ham,And smile for me right now as you go and the w...
5515,ham,You are a great role model. You are giving so ...
3889,spam,Double Mins & 1000 txts on Orange tariffs. Lat...
999,ham,"Aight will do, thanks again for comin out"


In [25]:
# Rename the 'v1' column to 'target' and 'v2' column to 'text'
df.rename(columns={'v1':'target','v2':'text'}, inplace=True)

# Print a random sample of 5 rows to confirm the column renaming
df.sample(5)


Unnamed: 0,target,text
1794,ham,How much i gave to you. Morning.
5165,ham,ÌÏ still got lessons? ÌÏ in sch?
1076,ham,Where can download clear movies. Dvd copies.
5281,ham,"And how you will do that, princess? :)"
2080,ham,Where is it. Is there any opening for mca.


In [26]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [27]:
df['target'] = encoder.fit_transform(df['target'])

In [28]:
df.head()

Unnamed: 0,target,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [34]:
# Count the number of missing values in each column
missing_values = df.isnull().sum()

# Print the results
print("Number of missing values:\n", missing_values)


Number of missing values:
 target    0
text      0
dtype: int64


In [35]:
# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()

# Print the result
print("Number of duplicate rows:", num_duplicates)


Number of duplicate rows: 403


In [36]:
# Remove duplicate rows, keeping the first occurrence
df = df.drop_duplicates(keep='first')

# Print the shape of the DataFrame to confirm the removal of duplicate rows
print("Shape of the DataFrame after removing duplicates:", df.shape)


Shape of the DataFrame after removing duplicates: (5169, 2)


In [37]:
# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()

# Print the result
print("Number of duplicate rows:", num_duplicates)


Number of duplicate rows: 0
