# ナイーブベイズ 迷惑メールの振り分け

迷惑メールのフィルタリング(あるメールが迷惑メールであるか、そうでないか)を行います。

## 使用するデータセット
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

## データセット読み込み

In [1]:
import pandas as pd
df = pd.read_table("SMSSpamCollection")
features = ['label', 'message']
df.columns = features

## データ確認

In [2]:
df.head(10)

Unnamed: 0,label,message
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...
5,ham,Even my brother is not like to speak with me. ...
6,ham,As per your request 'Melle Melle (Oru Minnamin...
7,spam,WINNER!! As a valued network customer you have...
8,spam,Had your mobile 11 months or more? U R entitle...
9,ham,I'm gonna be home soon and i don't want to tal...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5571 entries, 0 to 5570
Data columns (total 2 columns):
label      5571 non-null object
message    5571 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


## 前処理

### ラベルをバイナリに変換

In [4]:
df['label'] = df.label.map({"ham":0, "spam":1})

In [5]:
df.head(10)

Unnamed: 0,label,message
0,0,Ok lar... Joking wif u oni...
1,1,Free entry in 2 a wkly comp to win FA Cup fina...
2,0,U dun say so early hor... U c already then say...
3,0,"Nah I don't think he goes to usf, he lives aro..."
4,1,FreeMsg Hey there darling it's been 3 week's n...
5,0,Even my brother is not like to speak with me. ...
6,0,As per your request 'Melle Melle (Oru Minnamin...
7,1,WINNER!! As a valued network customer you have...
8,1,Had your mobile 11 months or more? U R entitle...
9,0,I'm gonna be home soon and i don't want to tal...


## Bag of Wordの使いかた

In [6]:
messages = ['Thank you for calling.',
            'Thank you for your inquiry',
            'Thanks for keeping in touch.',
            'Thanks for getting in touch with me?']

#### ライブラリimport

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
count_vec = CountVectorizer()

#### count_vecでの引数を確認

In [9]:
count_vec

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

- `stop_words`には、除外する単語を設定。予測結果に悪影響を及ぼすと考えられた場合、ストップワードに含めて除外。

- `token_pattern`は、どのように単語を区切るか正規表現で表現。

- `lowercase`は、大文字を小文字に変換する。`Thank`と`thank`が別の単語だと認識されてしまうのを防止。

In [10]:
count_vec.fit(messages)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

#### `vocabulary_`で辞書確認

In [11]:
count_vec.vocabulary_

{'calling': 0,
 'for': 1,
 'getting': 2,
 'in': 3,
 'inquiry': 4,
 'keeping': 5,
 'me': 6,
 'thank': 7,
 'thanks': 8,
 'touch': 9,
 'with': 10,
 'you': 11,
 'your': 12}

#### pythonのアンダースコアの役割
https://qiita.com/ikki8412/items/ab482690170a3eac8c76

#### 行列形式に変形

In [12]:
data = count_vec.transform(messages)

#### todence()メソッドで変換結果確認

In [13]:
data.todense()

matrix([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1],
        [0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0],
        [0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0]], dtype=int64)

## データセットを分割

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=1)

## X_train, X_testをBag of wordsに置き換え

In [15]:
count_vector = CountVectorizer()
count_vector.fit(X_train)
count_vector.vocabulary_

{'wish': 7556,
 'things': 6849,
 'were': 7474,
 'different': 2300,
 'wonder': 7594,
 'when': 7494,
 'will': 7533,
 'be': 1252,
 'able': 735,
 'to': 6939,
 'show': 6132,
 'you': 7718,
 'how': 3513,
 'much': 4644,
 'value': 7248,
 'pls': 5268,
 'continue': 1967,
 'the': 6821,
 'brisk': 1495,
 'walks': 7370,
 'no': 4812,
 'drugs': 2457,
 'without': 7567,
 'askin': 1077,
 'me': 4410,
 'please': 5261,
 'and': 944,
 'find': 2844,
 'laugh': 4033,
 'about': 737,
 'love': 4232,
 'dearly': 2179,
 'almost': 902,
 'there': 6840,
 'see': 5989,
 'in': 3640,
 'sec': 5975,
 'nights': 4794,
 'we': 7423,
 'nt': 4862,
 'staying': 6463,
 'at': 1094,
 'port': 5322,
 'step': 6473,
 'liao': 4095,
 'too': 6980,
 'ex': 2681,
 'hello': 3385,
 'my': 4675,
 'what': 7487,
 'are': 1032,
 'doing': 2383,
 'did': 2286,
 'get': 3109,
 'that': 6818,
 'interview': 3712,
 'today': 6946,
 'happy': 3321,
 'being': 1293,
 'good': 3171,
 'boy': 1455,
 'do': 2359,
 'think': 6850,
 'of': 4904,
 'missing': 4521,
 'huh': 3537,
 '

#### X_trainの辞書でX_trainもX_testも変換する

In [16]:
X_train= count_vector.transform(X_train)
X_test = count_vector.transform(X_test)

## モデル実装・学習
MultinomialNBを使用

In [17]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## モデルからの結果

In [18]:
predictions = naive_bayes.predict(X_test)

In [19]:
predictions

array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

## 評価

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9838565022421525
Precision score:  0.9444444444444444
Recall score:  0.9315068493150684
F1 score:  0.9379310344827586


In [None]:
迷惑メール判定の場合、precisionが重要