# **Philippine Scam SMS**
**Phase 1: Preprocessing and Cleaning**

**Author/s: [Anton Reyes](https://www.github.com/AGR-yes)**

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis

In [75]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [76]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions

In [77]:
import re

#### **Datasets and Files**

The following files were used for this project:

- `Scam_SMS_Reports.xlsx` contains the reports of users with the phone numbers, type of scam, proof, and name-inclusion.
- `SPAM_SMS.csv` contains text messages of one person with number, text itself, and the time and date received.
- `text-scams-incidents-philippines-2019-by-region.xlsx` contains number of scam messages (in thousands) received per region
- `networks.csv` contains the first 4-5 numbers in a Philippine phone number to identify the network it belongs to.

## **Data Collection**

Importing the dataset using pandas.

In [78]:
dataset = "Raw Datasets/Scam_SMS_Reports.xlsx"

report = pd.read_excel(dataset)
report.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes,Unnamed: 6,GRAPHS AND STUFF,Unnamed: 8,Unnamed: 9,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,1,9103239417,,Work from home,,False,,,,,...,,,,,,,,,,
1,2,95348643,,Dating Scam,,False,,,,,...,,,,,,,,,,
2,3,931804865,,work,,False,,,,,...,,,,,,,,,,
3,4,981197529,,nanalo sa lotto,,False,,,,,...,,,,,,,,,,
4,5,981369614,,Abroad Opportunity kuno,,False,,,,,...,,,,,,,,,,


In [79]:
dataset = "Raw Datasets/SPAM_SMS.csv"

spam = pd.read_csv(dataset)
spam.head()

Unnamed: 0.1,Unnamed: 0,_id,address,date,text,threadId
0,0,8787,+6396***32373,2022-11-12 14:02:10.079,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",836
1,1,8788,+6398***78852,2022-11-12 14:33:48.916,"My god, at least 999P rewards waiting for you\...",837
2,2,8789,+6394***80113,2022-11-13 23:03:15.023,"DEAR VIP <REAL NAME>, No. 1 Online Sabong Site...",838
3,3,8790,+6395***34934,2022-11-14 00:07:18.715,"<REAL NAME>! Today, you can win the iphone14PR...",839
4,4,8791,+6396***74401,2022-11-15 02:28:56.636,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",841


In [80]:
dataset = "Raw Datasets/text-scams-incidents-philippines-2019-by-region.xlsx"

incidents = pd.read_excel(dataset, sheet_name="Data")
incidents.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,,
1,,Number of SMS fraud or text scams incidents Ph...,
2,,Total number of SMS fraud or text scam inciden...,
3,,,
4,,Region 3,3484.73


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [81]:
sets = [report, spam, incidents]

for set in sets:
    print(set.shape)

(10493, 28)
(170, 6)
(21, 3)


By looking at the `info` of the dataframe, we can see that there are `null` values. 

In [82]:
report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10493 entries, 0 to 10492
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0                                  10493 non-null  int64  
 1   Number                         4883 non-null   object 
 2   Network (Auto-Generates)       9974 non-null   object 
 3   Type of Scam                   4686 non-null   object 
 4   Proofs                         1257 non-null   object 
 5   Knows your name?
Check if yes  10492 non-null  object 
 6   Unnamed: 6                     0 non-null      float64
 7   GRAPHS AND STUFF               27 non-null     object 
 8   Unnamed: 8                     8 non-null      object 
 9   Unnamed: 9                     1 non-null      object 
 10  Unnamed: 10                    0 non-null      float64
 11  Unnamed: 11                    1 non-null      object 
 12  Unnamed: 12                    1 non-null     

In [83]:
spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  170 non-null    int64 
 1   _id         170 non-null    int64 
 2   address     170 non-null    object
 3   date        170 non-null    object
 4   text        170 non-null    object
 5   threadId    170 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 8.1+ KB


In [84]:
incidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  0 non-null      float64
 1   Unnamed: 1  19 non-null     object 
 2   Unnamed: 2  17 non-null     float64
dtypes: float64(2), object(1)
memory usage: 632.0+ bytes


## **Exploratory Data Analysis**

### **Report**

In [85]:
report.columns

Index([' ', 'Number', 'Network (Auto-Generates)', 'Type of Scam', 'Proofs',
       'Knows your name?\nCheck if yes', 'Unnamed: 6', 'GRAPHS AND STUFF',
       'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16',
       'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20',
       'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24',
       'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27'],
      dtype='object')

In [86]:
report['Network (Auto-Generates)'].value_counts()

                         5106
Smart or Talk ‘N Text    3205
Globe or TM              1240
Smart                     289
Sun Cellular              134
Name: Network (Auto-Generates), dtype: int64

In [87]:
report['Proofs'].value_counts()

https://t.ly/stsH -compiled spam message screenshot                                                                                                                                                                                                                  62
Row 469 - Proof                                                                                                                                                                                                                                                      39
Screenshot                                                                                                                                                                                                                                                           37
'                                                                                                                                                                                                               

In [88]:
report['Knows your name?\nCheck if yes'].value_counts()

False                                                                                                                          10279
True                                                                                                                             195
linky-ph/-BingoPlus                                                                                                                2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit        2
Mentioned my name and the link. Ayaw nya daw ako maging mahirap HAHHAHA                                                            1
98 Games; Cash in                                                                                                                  1
http://gcxvd2ny.com                                                                                                                1
mentioned my name                                                    

In [89]:
report.iloc[:, 0:6].dtypes

                                   int64
Number                            object
Network (Auto-Generates)          object
Type of Scam                      object
Proofs                            object
Knows your name?\nCheck if yes    object
dtype: object

In [90]:
report['Number'].value_counts()[report['Number'].value_counts() > 2]

9602956931    4
9171832274    4
9811905645    4
9813126760    4
9173211259    4
9812913402    4
9852597532    4
9317418500    4
9750223903    4
9177095646    3
9813696569    3
9504889147    3
9813810743    3
9261762496    3
9811905386    3
9270512120    3
9171050224    3
9171874218    3
9761276078    3
9813458245    3
9171473405    3
9811993224    3
9389379718    3
9811905673    3
9178255960    3
9125514092    3
9702629918    3
9813810774    3
9171453599    3
9811905410    3
9097709911    3
9812142562    3
9813458247    3
9171780185    3
9096938537    3
9171344686    3
9171342236    3
9171299857    3
Name: Number, dtype: int64

### **Spam**

In [91]:
spam.columns

Index(['Unnamed: 0', '_id', 'address', 'date', 'text', 'threadId'], dtype='object')

In [92]:
spam['date'].describe()

count                         170
unique                        170
top       2022-11-12 14:02:10.079
freq                            1
Name: date, dtype: object

In [93]:
spam['date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 170 entries, 0 to 169
Series name: date
Non-Null Count  Dtype 
--------------  ----- 
170 non-null    object
dtypes: object(1)
memory usage: 1.5+ KB


## **Data Preprocessing**

### **Networks**

In [94]:
network = pd.read_csv("Raw Datasets/networks.csv")
network['Network'].value_counts()

Globe/TM          25
Smart             14
Sun               14
TNT                9
Globe PostPaid     9
DITO               8
Globe              1
Globe/GOMO         1
Name: Network, dtype: int64

We need to make `Globe` consistent, so we clean it up.

In [95]:
#replce the Globe Postpaid, Globe/TM, into just Globe
network['Network'] = network['Network'].replace(['Globe/TM', 'Globe PostPaid', 'Globe/GOMO', 'Globe'], 'Globe or TM')
network['Network'] = network['Network'].replace(['Smart', 'TNT'], "Smart or Talk 'N Text")
network['Network'] = network['Network'].replace(['Sun'], "Sun Cellular")

network['Network'].value_counts()

Globe or TM              36
Smart or Talk 'N Text    23
Sun Cellular             14
DITO                      8
Name: Network, dtype: int64

In [96]:
#keep the first three digits of the number
network['Prefix'] = network['Prefix'].astype(str).str[:3]

network

Unnamed: 0,Prefix,Network
0,817,Globe or TM
1,895,DITO
2,896,DITO
3,897,DITO
4,898,DITO
...,...,...
76,925,Globe or TM
77,925,Globe or TM
78,925,Globe or TM
79,925,Globe or TM


### **Reports**

#### **Data Preprocessing**

##### **Dropping**

In [97]:
report.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes,Unnamed: 6,GRAPHS AND STUFF,Unnamed: 8,Unnamed: 9,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,1,9103239417,,Work from home,,False,,,,,...,,,,,,,,,,
1,2,95348643,,Dating Scam,,False,,,,,...,,,,,,,,,,
2,3,931804865,,work,,False,,,,,...,,,,,,,,,,
3,4,981197529,,nanalo sa lotto,,False,,,,,...,,,,,,,,,,
4,5,981369614,,Abroad Opportunity kuno,,False,,,,,...,,,,,,,,,,


In [98]:
select = report.iloc[:, 0:6]
select.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes
0,1,9103239417,,Work from home,,False
1,2,95348643,,Dating Scam,,False
2,3,931804865,,work,,False
3,4,981197529,,nanalo sa lotto,,False
4,5,981369614,,Abroad Opportunity kuno,,False


In [99]:
select = select.dropna(subset=['Number'], axis = 0)
select.tail()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes
4885,4886,9207721859,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., Just a minimum deposit, you...",True
4886,4887,9854472269,Smart or Talk ‘N Text,SBET,"*insert my name*., Experience SBET, STABLE SYS...",True
4887,4888,9811248577,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., JACKPOT CITY has the best g...",True
4888,4889,9855665323,Smart or Talk ‘N Text,Dear VIP,"Why are you still waiting, DEAR VIP Get your P...",False
4890,4891,9264224386,Globe or TM,fake seller,,True


##### **Columns**

In [100]:
#rename all columns
select.columns = ['id','number', 'network', 'type', 'proof', 'name']


In [101]:
select.head()

Unnamed: 0,id,number,network,type,proof,name
0,1,9103239417,,Work from home,,False
1,2,95348643,,Dating Scam,,False
2,3,931804865,,work,,False
3,4,981197529,,nanalo sa lotto,,False
4,5,981369614,,Abroad Opportunity kuno,,False


In [102]:
#get the first 4 digits of the number
select['indicator'] = select['number'].astype(str).str[:3]
select['indicator']

0       910
1       953
2       931
3       981
4       981
       ... 
4885    920
4886    985
4887    981
4888    985
4890    926
Name: indicator, Length: 4883, dtype: object

#### **Data Cleaning**

##### **Network Column**

In [103]:
# Create a dictionary mapping prefixes to networks from the second dataframe
prefix_network_map = dict(zip(network['Prefix'], network['Network']))

# Fill null values with corresponding network information using map() but if the number doesn't have a network, just put "unknwown"
select['network'] = select['number'].map(prefix_network_map).fillna(select['network'])


# Print the updated first dataframe
select

Unnamed: 0,id,number,network,type,proof,name,indicator
0,1,9103239417,,Work from home,,False,910
1,2,95348643,,Dating Scam,,False,953
2,3,931804865,,work,,False,931
3,4,981197529,,nanalo sa lotto,,False,981
4,5,981369614,,Abroad Opportunity kuno,,False,981
...,...,...,...,...,...,...,...
4885,4886,9207721859,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., Just a minimum deposit, you...",True,920
4886,4887,9854472269,Smart or Talk ‘N Text,SBET,"*insert my name*., Experience SBET, STABLE SYS...",True,985
4887,4888,9811248577,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., JACKPOT CITY has the best g...",True,981
4888,4889,9855665323,Smart or Talk ‘N Text,Dear VIP,"Why are you still waiting, DEAR VIP Get your P...",False,985


In [104]:
select['network'].value_counts()

Smart or Talk ‘N Text    3205
Globe or TM              1240
Smart                     289
Sun Cellular              134
Name: network, dtype: int64

In [105]:
select['network'].value_counts().sum()

4868

In [106]:
#making the Smart sim consistent
select['network'] = select['network'].replace(['Smart or Talk ‘N Text', 'Smart'], "Smart or Talk 'N Text")

#fill null values with "Unknown"
select['network'] = select['network'].fillna('Unknown')
select['network'].value_counts()

Smart or Talk 'N Text    3494
Globe or TM              1240
Sun Cellular              134
Unknown                    15
Name: network, dtype: int64

In [107]:
select['network'].value_counts().sum()

4883

##### **Type of Spam column**

In [108]:
select.head()

Unnamed: 0,id,number,network,type,proof,name,indicator
0,1,9103239417,Unknown,Work from home,,False,910
1,2,95348643,Unknown,Dating Scam,,False,953
2,3,931804865,Unknown,work,,False,931
3,4,981197529,Unknown,nanalo sa lotto,,False,981
4,5,981369614,Unknown,Abroad Opportunity kuno,,False,981


In [109]:
#get the value counts less than 50 in the type column
select['type'].value_counts()[select['type'].value_counts() < 10]

Fake COVID 19 Cash Grant    9
Cash Support                9
T1bet7                      9
Paload                      9
Verification Code           9
                           ..
6.9m                        1
5.8m                        1
5000 php                    1
good offer                  1
fake seller                 1
Name: type, Length: 249, dtype: int64

In [110]:
#show top 50 value counts in the type column
select['type'].value_counts()[:25]

Online Games        352
Not Specified       322
Casino              273
Online Casino       211
Solar Lights        206
Lazada Kuno         150
Email               149
Work                146
Unclaimed Bonus     110
Work from home      109
Loan/Pautang        108
Register and win    106
Bank Scam            99
Raffle               97
Play to Win          82
Job Offer            72
Rewards              64
Deposit scam         63
Bonus                57
Name                 55
BINGO                51
Passive Income       51
Funds                48
Investment           45
Gcash scam           44
Name: type, dtype: int64

In [111]:
select['type'].value_counts()[26:50]

Nanalo Sa Lotto                  40
Claim Money                      40
Web Platform                     39
Libreng Pera                     37
nanalo sa lotto                  37
Online Cockfighting              36
Mybitglobal                      30
Free Gifts                       29
Gaming Platform                  29
Can assist in cash essentials    29
Play to Earn                     27
W.Plus                           27
Globe                            24
work                             24
your phone no has won            22
Netflix kuno                     22
Cryptocurrency                   21
Big J@CKP0T                      20
JILI and FC Jackpot              20
Top-up Bonus                     19
Legit Site daw                   19
Free Spin                        19
Political                        19
Abroad Opportunity kuno          19
Name: type, dtype: int64

In [112]:
#get allunqite values in the type column
select['type'].value_counts().sum()

4679

In [113]:
#lowercasing the type column
select['type'] = select['type'].str.lower()

#removing all punctuation using regex
select['type'] = select['type'].str.replace('[^\w\s]','')

#removing all numbers using regex
select['type'] = select['type'].str.replace('\d+', '')

#removing all extra spaces
select['type'] = select['type'].str.strip()


select['type'].value_counts()

  select['type'] = select['type'].str.replace('[^\w\s]','')
  select['type'] = select['type'].str.replace('\d+', '')


online games      354
not specified     322
casino            275
online casino     212
solar lights      206
                 ... 
laismcom            1
irpkesocom          1
claim rebate        1
globe bernales      1
fake seller         1
Name: type, Length: 302, dtype: int64

In [114]:
#replace specific values in the type column
select['type'] = select['type'].replace(['casino','online casino','bingo', 'big jckpt','jili and fc jackpot','online cockfighting', 'bingo plus'], "casino/gambling")
select['type'] = select['type'].replace(['lazada kuno', 'amazon', 'shoppee kuno'], "online shopping")
select['type'] = select['type'].replace(['loanpautang','bank scam','investment'], "loan/bank")
select['type'] = select['type'].replace(['register and win', 'play to win', 'resgister and win', 'play to earn'], "raffle/play")
select['type'] = select['type'].replace(['nanalo sa lotto'], "lotto")
select['type'] = select['type'].replace(['gcash scam'], "gcash")
select['type'] = select['type'].replace(['claim bnus', 'claim money','your phone no has won', 'pindutin para kunin', 'salary claim'], "claiming")
select['type'] = select['type'].replace(['cryptocurrency', 'libreng pera', 'funds', 'bonus','can assist in cash essentials', 'topup bonus', 'deposit bonus', 'cash prize', 'fastcashvip', 'cashback'], "money")
select['type'] = select['type'].replace(['abroad opportunity kuno', 'work from home', 'interview'], "work")
select['type'] = select['type'].replace(['web platform', 'gaming platform','legit site daw','magbukas ng account'], "platform")
select['type'] = select['type'].replace(['free gifts','free spin', 'free luckyphil'], "free")
select['type'] = select['type'].replace(['netflix kuno'], "netflix")
select['type'] = select['type'].replace(['solar lights'], "products")
select['type'] = select['type'].replace([''], "")
select['type'] = select['type'].replace([''], "")
select['type'] = select['type'].replace([''], "")


select['type'] = select['type'].replace(['mybitglobal','wplus', 'tbet','okbet','m kita', 'name', 'not specified', 'blank message', 'urgent'], "not specified")

#replace values less than or equal to 3 with "others"
select['type'] = select['type'].replace(select['type'].value_counts()[select['type'].value_counts() <= 10].index, "others")
select['type'] = select['type'].replace(['reliefassistance fund', 'pagcor licensed', 'dating scam'], "others")

select['type'].value_counts()

casino/gambling    632
others             597
not specified      512
online games       354
work               313
money              272
loan/bank          256
raffle/play        234
products           206
online shopping    181
email              150
claiming           130
unclaimed bonus    111
platform           102
raffle              98
lotto               78
job offer           72
rewards             64
deposit scam        63
free                61
passive income      51
gcash               44
globe               24
netflix             22
political           19
fake news           17
phone call scam     16
Name: type, dtype: int64

In [115]:
select['type'].value_counts().sum()

4679

##### **Proof column**

In [116]:
select.head()

Unnamed: 0,id,number,network,type,proof,name,indicator
0,1,9103239417,Unknown,work,,False,910
1,2,95348643,Unknown,others,,False,953
2,3,931804865,Unknown,work,,False,931
3,4,981197529,Unknown,lotto,,False,981
4,5,981369614,Unknown,work,,False,981


In [117]:
select['proof'].value_counts().sum()

1256

In [118]:
#use regex to remove " *Insert my name*., "
select['proof'] = select['proof'].str.replace(r'\*insert my name\*\., ', '', flags=re.IGNORECASE)

  select['proof'] = select['proof'].str.replace(r'\*insert my name\*\., ', '', flags=re.IGNORECASE)


In [119]:
select['proof'].info

<bound method Series.info of 0                                                     NaN
1                                                     NaN
2                                                     NaN
3                                                     NaN
4                                                     NaN
                              ...                        
4885    Just a minimum deposit, you can double your mo...
4886    Experience SBET, STABLE SYSTEM, NO MAINTENANCE...
4887    JACKPOT CITY has the best gaming experience an...
4888    Why are you still waiting, DEAR VIP Get your P...
4890                                                  NaN
Name: proof, Length: 4883, dtype: object>

##### **Name column**

In [120]:
select['name'].value_counts()

False                                                                                                                          4669
True                                                                                                                            195
linky-ph/-BingoPlus                                                                                                               2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit       2
Mentioned my name and the link. Ayaw nya daw ako maging mahirap HAHHAHA                                                           1
98 Games; Cash in                                                                                                                 1
http://gcxvd2ny.com                                                                                                               1
mentioned my name                                                           

In [121]:
select['name'].value_counts().sum()

4882

In [122]:
select['name'] = select['name'].apply(lambda x: True if isinstance(x, str) and re.search(r'\bname\b', x) else x)

In [123]:
select['name'].value_counts()

False                                                                                                                          4669
True                                                                                                                            199
linky-ph/-BingoPlus                                                                                                               2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit       2
98 Games; Cash in                                                                                                                 1
http://gcxvd2ny.com                                                                                                               1
https://linnki.in/WNgzX                                                                                                           1
Mybitglobal (https://t.co/9GDnInpR0M                                        

In [124]:
select['name'] = select['name'].astype(str)

# Replace non-Boolean values with False
select.loc[~select['name'].str.lower().isin(['true', 'false']), 'name'] = 'False'

select['name'].value_counts()

False    4684
True      199
Name: name, dtype: int64

In [125]:
select['name'].value_counts().sum()

4883

### **Spam**

#### **Data Preprocessing**

In [126]:
spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  170 non-null    int64 
 1   _id         170 non-null    int64 
 2   address     170 non-null    object
 3   date        170 non-null    object
 4   text        170 non-null    object
 5   threadId    170 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 8.1+ KB


In [127]:
#dropping unnecessary columns
spam = spam.drop(['Unnamed: 0', '_id', 'address', 'threadId'], axis = 1)

spam.head()

Unnamed: 0,date,text
0,2022-11-12 14:02:10.079,"Welcome ! your have P1222 for S!ot , \nWeb: 11..."
1,2022-11-12 14:33:48.916,"My god, at least 999P rewards waiting for you\..."
2,2022-11-13 23:03:15.023,"DEAR VIP <REAL NAME>, No. 1 Online Sabong Site..."
3,2022-11-14 00:07:18.715,"<REAL NAME>! Today, you can win the iphone14PR..."
4,2022-11-15 02:28:56.636,"Welcome ! your have P1222 for S!ot , \nWeb: gr..."


In [128]:
#drop rows that contain '<<Content not supported.>>'
spam = spam[~spam['text'].str.contains('<<Content not supported\.>>', case=False)]

#### **Data Cleaning**

In [129]:
spam['date'] = pd.to_datetime(spam['date'])

In [130]:
spam['Date'] = spam['date'].dt.date
spam['Time'] = spam['date'].dt.strftime('%H:%M')

In [131]:
spam.head()

Unnamed: 0,date,text,Date,Time
0,2022-11-12 14:02:10.079,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",2022-11-12,14:02
1,2022-11-12 14:33:48.916,"My god, at least 999P rewards waiting for you\...",2022-11-12,14:33
2,2022-11-13 23:03:15.023,"DEAR VIP <REAL NAME>, No. 1 Online Sabong Site...",2022-11-13,23:03
3,2022-11-14 00:07:18.715,"<REAL NAME>! Today, you can win the iphone14PR...",2022-11-14,00:07
4,2022-11-15 02:28:56.636,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",2022-11-15,02:28


In [132]:
#drop date column
spam = spam.drop(['date'], axis = 1)

In [133]:
spam['text'].value_counts().sum()

159

In [134]:
spam['name'] = spam['text'].apply(lambda x: True if '<REAL NAME>' in x else False)

spam['name'].value_counts()

False    88
True     71
Name: name, dtype: int64

In [135]:
#remove <REAL NAME> from the text column
spam['text'] = spam['text'].str.replace('<REAL NAME>', '', flags=re.IGNORECASE)

### **Incidents**

#### **Data Preprocessing**

In [136]:
incidents

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,,
1,,Number of SMS fraud or text scams incidents Ph...,
2,,Total number of SMS fraud or text scam inciden...,
3,,,
4,,Region 3,3484.73
5,,Region 6,2778.36
6,,NCR,2739.52
7,,Region 4-A,2426.91
8,,Region 7,1366.43
9,,CARAGA,500.0


In [137]:
#drop unnamed: 0
incidents = incidents.drop(['Unnamed: 0'], axis = 1)

#drop rows 0-3
incidents = incidents.drop([0,1,2,3], axis = 0)

In [138]:
#rename columns
incidents = incidents.rename(columns={'Unnamed: 1': 'region', 'Unnamed: 2': 'number'})

#reorder rows alphabetically
incidents = incidents.sort_values(by=['region']).reset_index(drop=True)

In [139]:
incidents

Unnamed: 0,region,number
0,BARMM,390.48
1,CAR,112.6
2,CARAGA,500.0
3,NCR,2739.52
4,Region 1,113.3
5,Region 10,374.01
6,Region 11,189.71
7,Region 12,390.41
8,Region 2,83.33
9,Region 3,3484.73


## **Feature Selection**

In [140]:
spam

Unnamed: 0,text,Date,Time,name
0,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",2022-11-12,14:02,False
1,"My god, at least 999P rewards waiting for you\...",2022-11-12,14:33,False
2,"DEAR VIP , No. 1 Online Sabong Site here in SB...",2022-11-13,23:03,True
3,"! Today, you can win the iphone14PROMAX while ...",2022-11-14,00:07,True
4,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",2022-11-15,02:28,False
...,...,...,...,...
159,"Araw-araw na suweld0 5000, kai1angan lang ng 1...",2023-05-15,02:52,False
161,Goodnews! VIP perks at SUGARPLAY Sign up & Cla...,2023-05-22,04:23,False
162,Start referring & earning at SUGARPLAY Earn e...,2023-05-22,10:38,True
164,", Experience the thrill at JackpotCity! Enjoy ...",2023-05-23,05:56,True


In [141]:
select

Unnamed: 0,id,number,network,type,proof,name,indicator
0,1,9103239417,Unknown,work,,False,910
1,2,95348643,Unknown,others,,False,953
2,3,931804865,Unknown,work,,False,931
3,4,981197529,Unknown,lotto,,False,981
4,5,981369614,Unknown,work,,False,981
...,...,...,...,...,...,...,...
4885,4886,9207721859,Smart or Talk 'N Text,others,"Just a minimum deposit, you can double your mo...",True,920
4886,4887,9854472269,Smart or Talk 'N Text,others,"Experience SBET, STABLE SYSTEM, NO MAINTENANCE...",True,985
4887,4888,9811248577,Smart or Talk 'N Text,others,JACKPOT CITY has the best gaming experience an...,True,981
4888,4889,9855665323,Smart or Talk 'N Text,others,"Why are you still waiting, DEAR VIP Get your P...",False,985


In [142]:
#renaming spam `text` column to `proof`
spam = spam.rename(columns={'text': 'proof'})

spam.head()

Unnamed: 0,proof,Date,Time,name
0,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",2022-11-12,14:02,False
1,"My god, at least 999P rewards waiting for you\...",2022-11-12,14:33,False
2,"DEAR VIP , No. 1 Online Sabong Site here in SB...",2022-11-13,23:03,True
3,"! Today, you can win the iphone14PROMAX while ...",2022-11-14,00:07,True
4,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",2022-11-15,02:28,False


In [143]:
display(
    "select", select['name'].value_counts(), select['name'].value_counts().sum(),
    "spam", spam['name'].value_counts(), spam['name'].value_counts().sum(),
    "total count", select['name'].value_counts().sum() + spam['name'].value_counts().sum(),
)

'select'

False    4684
True      199
Name: name, dtype: int64

4883

'spam'

False    88
True     71
Name: name, dtype: int64

159

'total count'

5042

In [144]:
#adding label with "" in spam dataset
spam['type'] = ""

In [145]:
#make new dataframe from the two dataframes with proof and name columns
proof = pd.concat([select[['proof', 'name', 'type']], spam[['proof', 'name', 'type']]], axis=0).reset_index(drop=True)

proof


Unnamed: 0,proof,name,type
0,,False,work
1,,False,others
2,,False,work
3,,False,lotto
4,,False,work
...,...,...,...
5037,"Araw-araw na suweld0 5000, kai1angan lang ng 1...",False,
5038,Goodnews! VIP perks at SUGARPLAY Sign up & Cla...,False,
5039,Start referring & earning at SUGARPLAY Earn e...,True,
5040,", Experience the thrill at JackpotCity! Enjoy ...",True,


# **Saving Dataframes as CSVs**

In [147]:
#save dtaframe sa csv in different folder
#proof.to_csv('Processed Datasets/proof.csv', index=False)
#spam.to_csv('Processed Datasets/spam.csv', index=False)
#select.to_csv('Processed Datasets/select.csv', index=False)
#incidents.to_csv('Processed Datasets/incidents.csv', index=False)