1D CNN 이용한 스펨 메일 분류<br>
참고 자료: https://wikidocs.net/80787

In [1]:
import nltk
nltk.download('punkt')
import urllib.request

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
import numpy as np
import pandas as pd

케라스에서 제공하는 스팸 메일 데이터 사용<br>
https://www.kaggle.com/uciml/sms-spam-collection-dataset

In [3]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv", filename="spam.csv")
data = pd.read_csv('spam.csv',encoding='latin1')

In [4]:
data[:5]

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Unnamed2, Unnamed3, Unnaed4 제거

In [5]:
del data['Unnamed: 2']
del data['Unnamed: 3']
del data['Unnamed: 4']
data['v1']=data['v1'].replace(['ham', 'spam'], [0, 1])

In [6]:
data[:5]

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


데이터 전처리<br><br>
1) 대소문자 변경(정제 과정)

In [8]:
for i in range(data.shape[0]):
  data['v2'][i]=data['v2'][i].lower()
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,v1,v2
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


2) 구두점 제거(정제 과정)

In [9]:
import re
for i in range(data.shape[0]):
  data['v2'][i]=re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…》;]', '', data['v2'][i])
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,v1,v2
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...


3) 길이가 2 이하인 단어 제거(정제 과정)

In [10]:
for i in range(data.shape[0]):
  shortword = re.compile(r'\W*\b\w{1,2}\b')
  data['v2'][i]=shortword.sub('', data['v2'][i])
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,v1,v2
0,0,until jurong point crazy available only bugis...
1,0,lar joking wif oni
2,1,free entry wkly comp win cup final tkts 21st m...
3,0,dun say early hor already then say
4,0,nah dont think goes usf lives around here though


4) 단어 토큰화

In [11]:
from nltk.tokenize import word_tokenize
for i in range(data.shape[0]):
  data['v2'][i]=word_tokenize(data['v2'][i])
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,v1,v2
0,0,"[until, jurong, point, crazy, available, only,..."
1,0,"[lar, joking, wif, oni]"
2,1,"[free, entry, wkly, comp, win, cup, final, tkt..."
3,0,"[dun, say, early, hor, already, then, say]"
4,0,"[nah, dont, think, goes, usf, lives, around, h..."


5) 불용어(stopwords) 제거

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
from nltk.corpus import stopwords
stop_words=stopwords.words('english')
for i in range(data.shape[0]):
  tmp=[]
  for w in data['v2'][i]:
    if w not in stop_words:
      tmp.append(w)
  data['v2'][i]=tmp
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,v1,v2
0,0,"[jurong, point, crazy, available, bugis, great..."
1,0,"[lar, joking, wif, oni]"
2,1,"[free, entry, wkly, comp, win, cup, final, tkt..."
3,0,"[dun, say, early, hor, already, say]"
4,0,"[nah, dont, think, goes, usf, lives, around, t..."


6) 표제어 추출(lemmatazition) 과정

In [14]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [15]:
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
for i in range(data.shape[0]):
  tmp=""
  cnt=1
  length=len(data['v2'][i])
  for w in data['v2'][i]:
    if cnt==length:
      tmp=tmp+lem.lemmatize(w)
    else:
      tmp=tmp+lem.lemmatize(w)+" "
    cnt+=1
  data['v2'][i]=tmp
data[:5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,v1,v2
0,0,jurong point crazy available bugis great world...
1,0,lar joking wif oni
2,1,free entry wkly comp win cup final tkts 21st m...
3,0,dun say early hor already say
4,0,nah dont think go usf life around though


7) 정수 인코딩

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer()
tokenizer.fit_on_texts(data['v2']) #각 행에 토큰화를 수행, 빈도수 기준 단어 집합 생성
encoded=tokenizer.texts_to_sequences(data['v2'])
print(encoded[:5])

[[3875, 239, 541, 512, 1018, 47, 215, 2523, 1019, 7, 3876, 60], [205, 1165, 313, 1551], [5, 360, 574, 709, 129, 914, 513, 1552, 1895, 162, 1896, 14, 1553, 400, 360, 2524, 22, 2525, 263, 2526], [135, 43, 240, 2527, 68, 43], [758, 3, 31, 327, 710, 78, 111, 328]]


In [17]:
word_to_index=tokenizer.word_index #단어별 인덱스 부여 확인
cat_num=len(word_to_index)+1

8) 패딩 수행

In [18]:
X_data=encoded
print(max(len(l) for l in X_data))

80


In [19]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len=80
X_data=pad_sequences(X_data, maxlen=max_len)

train data와 test data 분리

In [21]:
y_data=data['v1']
X_test = X_data[4400:]
y_test = y_data[4400:]
X_train = X_data[:4400]
y_train = y_data[:4400]

1D CNN으로 스팸 메일 분류하기

In [22]:
from tensorflow.keras.layers import Dense, Conv1D, GlobalMaxPooling1D, Embedding, Dropout, MaxPooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

1D CNN 모델 구성(1)

In [24]:
model=Sequential()
model.add(Embedding(cat_num, 32))
model.add(Conv1D(32, 5, strides=1, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

EarlyStopping 구성<br>
accuracy를 기준으로 어느 이상 전보다 성능 향상이 안되면 종료

In [25]:
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('best_model.h5', monitor = 'val_acc', mode = 'max', verbose = 1, save_best_only = True)

모델 학습시키기

In [26]:
history = model.fit(X_train, y_train, epochs = 20, batch_size=64, validation_split=0.2, callbacks=[es, mc])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 00010: early stopping


In [27]:
print("\n 테스트 정확도 : %.4f" % (model.evaluate(X_test, y_test)[1]))


 테스트 정확도 : 0.9787


1D CNN 모델 구성(2)

In [28]:
model2=Sequential()
model2.add(Embedding(cat_num, 32))
model2.add(Dropout(0.2))
model2.add(Conv1D(32, 5, strides=1, padding='valid', activation='relu'))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(64, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

모델2 학습시키기

In [29]:
history = model2.fit(X_train, y_train, epochs = 20, batch_size=64, validation_split=0.2, callbacks=[es, mc])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 00009: early stopping


In [30]:
print("\n 테스트 정확도 : %.4f" % (model2.evaluate(X_test, y_test)[1]))


 테스트 정확도 : 0.9812
