## 나이브 베이즈 분류(Navie Bayes Classfication)

    - 나이브 베이즈 분류는 scikit-learn의 naive_bayes 라이브러리 사용
    - 나이브 베이즈 분류 (1) GaussianNB, (2) BernoulliNB, (3) MultinomialNB 
    

In [2]:
import pandas as pd
import numpy as np
import os

### `Data1` : Amazon product data (아마존 상품 리뷰의 긍정과 부정 분석)

데이터 : https://www.kaggle.com/datasets/sameersmahajan/reviews-of-amazon-baby-products

    - 나이브 베이즈 모델을 활요애하기 위해서는 먼저 데이터를 scikit-learn이 활용할 수 있는 형태로 
    transform 필요
    - CounterVectorizer 객체를 생성해서 fit() 메서드로 단어를 학습시켜야 함

In [3]:
os.listdir('./data/')

['amazon_baby.csv']

In [4]:
data = pd.read_csv('./data/amazon_baby.csv')
data

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


In [5]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [6]:
data['rating'].value_counts(dropna=False)

5    107054
4     33205
3     16779
1     15183
2     11310
Name: rating, dtype: int64

In [7]:
data.isnull().sum()

name      318
review    829
rating      0
dtype: int64

In [8]:
data = data.dropna()
data.isnull().sum()

name      0
review    0
rating    0
dtype: int64

    - 리뷰의 rating이 4점 이상이면 긍정, 그 미만이면 부정으로 분류함

In [9]:
a_data = data.copy()

In [10]:
a_data['rating'] = a_data['rating'].apply(lambda x: 'pos' if int(x) >=4 else 'neg')
a_data['rating'].value_counts(dropna=False)

pos    139318
neg     43066
Name: rating, dtype: int64

In [11]:
a_data

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",neg
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,pos
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,pos
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,pos
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,pos
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,pos
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,pos
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,pos
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,pos


    - 데이터가 많고 불균형 하기 때문에 random sampling 으로 각 1000개만 가져와서 사용해봄

In [12]:
a_data = a_data[['review','rating']]
a_p_data = a_data.loc[a_data['rating']=='pos'].sample(1000)
a_n_data = a_data.loc[a_data['rating']=='neg'].sample(1000)

In [13]:
a_df = pd.concat([a_p_data, a_n_data]).reset_index(drop=True)
a_df['rating'].value_counts(dropna=False)
display(a_df.head(3))

Unnamed: 0,review,rating
0,I bought this seat based on all the great revi...,pos
1,"This monitor is great. It doesn't have video,...",pos
2,its very soft. we love it and sure that baby w...,pos


**Step 1 `CounterVectorizer`**

In [18]:
a_data = a_df['review'].values.astype('U')
a_data

array(["I bought this seat based on all the great reviews and I was not disappointed. My son just turned 2 and is small for his age and he fits perfectly on this seat. What I also love about the seat is that it doesn't move once you get it on the toilet. I don't have to worry about my son slipping and sliding around as he moves on the potty. I bought 2!",
       "This monitor is great.  It doesn't have video, but it has all the rest.  I love the tempurature feature, we heat with a pellet stove and it makes it easy for me to monitor at night that the kids rooms are warm enough.  Very sensitive, but has a control for that as well.  The only down fall of this monitor is that the lights do light up my room quite a bit, since we use a noise machine in the kids rooms.  Great product.",
       'its very soft. we love it and sure that baby will love it too. will use it in car seat and rocker..',
       ...,
       'When I received this product the box had a little chemical smell, but I aired i

    - 리스트로 받아서 fitting 할 때 에러 난다면 Unicode 변환을 해서 vectorizer fitting 수행

In [19]:
vectorizer = CountVectorizer()
vectorizer.fit(a_data)

**Step 2 `transform`**

    - 문자열 목록을 가져와, 학습시켜놓은 사전을 기반으로 어휘의 빈도를 셈
    - 학습시킨 vectorizer를 .transform() 메서드로 문자열의 배열을 받아 학습된 단어의 빈도수로 변환
    - counts에는 각 단어가 등장한 빈도수가 저장됨
    
           * counts의 index : 학습 단어들
           * counts의 value : 해당 단어의 빈도 수

In [20]:
vectorizer.vocabulary_

{'bought': 1024,
 'this': 7492,
 'seat': 6354,
 'based': 787,
 'on': 4967,
 'all': 454,
 'the': 7443,
 'great': 3317,
 'reviews': 6091,
 'and': 512,
 'was': 8081,
 'not': 4876,
 'disappointed': 2221,
 'my': 4748,
 'son': 6803,
 'just': 3981,
 'turned': 7752,
 'is': 3899,
 'small': 6700,
 'for': 3016,
 'his': 3543,
 'age': 415,
 'he': 3459,
 'fits': 2921,
 'perfectly': 5252,
 'what': 8160,
 'also': 473,
 'love': 4339,
 'about': 275,
 'that': 7439,
 'it': 3907,
 'doesn': 2309,
 'move': 4702,
 'once': 4968,
 'you': 8354,
 'get': 3201,
 'toilet': 7580,
 'don': 2320,
 'have': 3451,
 'to': 7569,
 'worry': 8291,
 'slipping': 6683,
 'sliding': 6669,
 'around': 610,
 'as': 623,
 'moves': 4709,
 'potty': 5489,
 'monitor': 4664,
 'video': 7995,
 'but': 1191,
 'has': 3437,
 'rest': 6041,
 'tempurature': 7405,
 'feature': 2832,
 'we': 8115,
 'heat': 3478,
 'with': 8245,
 'pellet': 5242,
 'stove': 7097,
 'makes': 4399,
 'easy': 2471,
 'me': 4492,
 'at': 648,
 'night': 4842,
 'kids': 4011,
 'rooms': 

In [23]:
counts = vectorizer.transform(a_data)
print(counts.shape)

(2000, 8389)


**Step 3 `naive bayes classification Training`**

    - 데이터 포인트 배열인 counts 객체와 각 데이터 라벨을 전ㄷ라하여 나이브 베이즈 학습
    - counts 객체를 만들 때, 리뷰 데이터가 긍정 1000개, 부정 1000개 순으로 들어가서
    1001번째 인덱스까지는 1, 뒤 끝까지는 0으로 라벨링 해줌

In [25]:
multi_classification = MultinomialNB()

labels = [1] *1000 + [0] *1000

multi_classification.fit(counts, labels)

**Step 4 `naive bayes classification Prediction`**

    - .predict() : 임의의 데이터 포인트 배열을 전달하여 클래스 예측
    - .predict_proba() : 주어진 데이터가 특정 클래스에 속할 확률


In [26]:
print(multi_classification.predict(counts))

[1 1 1 ... 0 0 0]


In [27]:
print(multi_classification.predict_proba(counts))

[[1.43746714e-03 9.98562533e-01]
 [4.70895083e-05 9.99952910e-01]
 [1.15071165e-02 9.88492883e-01]
 ...
 [9.86803539e-01 1.31964610e-02]
 [6.66627968e-01 3.33372032e-01]
 [9.94138960e-01 5.86104013e-03]]


In [32]:
review_case1 = "This is bad product. I'm dissatisfied"
print(multi_classification.predict(vectorizer.transform([review_case1])))
print(multi_classification.predict_proba(vectorizer.transform([review_case1])))

[0]
[[0.92423565 0.07576435]]


In [33]:
review_case2 = "I think it's ambiguous. I'm just writing because I'm lazy to refund"
print(multi_classification.predict(vectorizer.transform([review_case2])))
print(multi_classification.predict_proba(vectorizer.transform([review_case2])))

[0]
[[0.93477025 0.06522975]]


In [35]:
review_case3 = "I guess not bad. sell a lot"
print(multi_classification.predict(vectorizer.transform([review_case3])))
print(multi_classification.predict_proba(vectorizer.transform([review_case3])))

[0]
[[0.93397293 0.06602707]]


#### 데이터의 빈도 수 저장 (데이터 긍정/부정 빈도수 저장 Counter 객체 생성)    
    
    
    + 데이터의 빈도 수 저장
    - 데이터의 긍정, 부정 빈도수를 저장하는 Counter 객체 생성

In [48]:
# 긍정 리뷰의 단어 빈도수 저장

pos_lst = a_df.loc[a_df['rating']=='pos']['review'].values.tolist()
print(pos_lst[:2])

pos_words = []

for p_sentence in pos_lst:
    pos_words.extend(str(p_sentence).split(' '))

pos_counter = Counter(pos_words)
pos_counter

["I bought this seat based on all the great reviews and I was not disappointed. My son just turned 2 and is small for his age and he fits perfectly on this seat. What I also love about the seat is that it doesn't move once you get it on the toilet. I don't have to worry about my son slipping and sliding around as he moves on the potty. I bought 2!", "This monitor is great.  It doesn't have video, but it has all the rest.  I love the tempurature feature, we heat with a pellet stove and it makes it easy for me to monitor at night that the kids rooms are warm enough.  Very sensitive, but has a control for that as well.  The only down fall of this monitor is that the lights do light up my room quite a bit, since we use a noise machine in the kids rooms.  Great product."]


Counter({'I': 2088,
         'bought': 156,
         'this': 813,
         'seat': 121,
         'based': 9,
         'on': 631,
         'all': 204,
         'the': 3131,
         'great': 225,
         'reviews': 33,
         'and': 2430,
         'was': 563,
         'not': 327,
         'disappointed.': 5,
         'My': 217,
         'son': 143,
         'just': 235,
         'turned': 10,
         '2': 84,
         'is': 1305,
         'small': 73,
         'for': 1125,
         'his': 142,
         'age': 8,
         'he': 211,
         'fits': 61,
         'perfectly': 20,
         'seat.': 22,
         'What': 7,
         'also': 142,
         'love': 202,
         'about': 163,
         'that': 661,
         'it': 1529,
         "doesn't": 82,
         'move': 19,
         'once': 26,
         'you': 313,
         'get': 173,
         'toilet.': 1,
         "don't": 128,
         'have': 570,
         'to': 2016,
         'worry': 16,
         'my': 811,
         'slipping': 

In [52]:
# 부정 리뷰에 들어있는 단어 빈도수 파악

neg_lst = a_df.loc[a_df['rating']=='neg']['review'].values.tolist()
print(neg_lst[:2])

neg_words = []

for n_sentence in neg_lst:
    neg_words.extend(str(n_sentence).split(' '))

neg_counter = Counter(neg_words)
neg_counter


['I would have given it 5 stars because I love it, but the day I got it the pink circle on the front came off. Easily glued back on, but I was expecting higher quality. I love that it is only 1 year of statutes and info instead of 5. I never record past 1 anyway.', "THis is a pretty inexpensive night light we bought for our boy's room, it's great but the bulb went out after a few weeks."]


Counter({'I': 2432,
         'would': 418,
         'have': 657,
         'given': 16,
         'it': 2039,
         '5': 55,
         'stars': 23,
         'because': 251,
         'love': 60,
         'it,': 78,
         'but': 724,
         'the': 4519,
         'day': 30,
         'got': 113,
         'pink': 15,
         'circle': 5,
         'on': 706,
         'front': 41,
         'came': 46,
         'off.': 25,
         'Easily': 2,
         'glued': 5,
         'back': 161,
         'on,': 11,
         'was': 820,
         'expecting': 12,
         'higher': 7,
         'quality.': 16,
         'that': 922,
         'is': 1418,
         'only': 213,
         '1': 54,
         'year': 56,
         'of': 1220,
         'statutes': 1,
         'and': 2320,
         'info': 4,
         'instead': 31,
         '5.': 2,
         'never': 58,
         'record': 3,
         'past': 11,
         'anyway.': 2,
         'THis': 1,
         'a': 2096,
         'pretty': 56,
         'in

#### 긍정/부정 확률 계산

    P(positive), N(negative)
    - 긍정, 부정 리뷰를 1,000개씩 가져왔으므로 긍정,부정의 확률은 각각 0.5, 0.5임

In [61]:
percent_pos, percent_neg = 0.5, 0.5

#### 긍정일 때의 리뷰의 조건부 확률 계산 

    P(review|positive)
    해당 리뷰들의 각 단어들이 모두 독립이라는 것을 가정하고, 각 단어들이 긍정 리뷰에 나타날 확률
    예를 들어 'These are great' 라면 P(These|positive) * P(are|positive) * P(great|positive)
    

In [59]:
review_case4=  'These are great'

# 전체 긍정, 부정의 단어 수
total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

review_given_pos = 1
review_given_neg = 1

# smoothing (스무딩) : 단어가 기존의 리뷰에 존재하지 않는 경우 확률이 0이 되므로 1을 더해줌

for word in review_case4.split(' '):
    review_given_pos *= (pos_counter[word] + 1) / (total_pos + len(pos_counter))
    review_given_neg *= (neg_counter[word]+1)/ (total_neg + len(neg_counter))

print(review_given_pos)
print(review_given_neg)

8.30035611046729e-09
2.656093835438738e-09


#### 분류하기

    P(positive|review) 와 P(negative|reveiw) 계산
    - 임의의 리뷰가 주어졌을 때, 해당 리뷰가 긍정적인지 부정적인지 분류
    - 최종확률은 pos가 더 크기 때문에 These are great 리뷰는 긍정으로 분류 될 것

In [62]:
pos = review_given_pos * percent_pos
neg = review_given_neg * percent_neg

print(pos)
print(neg)

4.150178055233645e-09
1.328046917719369e-09


---------

      - 아마존 전체 데이터를 가지고 navie bayes classification 해보기 

In [65]:
all_data = data.copy()

In [67]:
all_data['rating'] = all_data['rating'].apply(lambda x: 'pos' if int(x)>=4 else 'neg')
display(all_data.head(3))

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",neg
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,pos
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,pos


In [73]:
all_data = all_data.dropna()
all_data.isnull().sum()

name      0
review    0
rating    0
dtype: int64

In [74]:
all_data['rating'].value_counts()

pos    139318
neg     43066
Name: rating, dtype: int64

In [77]:
all_p_data = all_data.loc[all_data['rating']=='pos'].sample(43000)
all_n_data = all_data.loc[all_data['rating']=='neg'].sample(43000)

In [78]:
all_df = pd.concat([all_p_data, all_n_data]).reset_index(drop=True)
all_df.head(3)

Unnamed: 0,name,review,rating
0,DwellStudio&reg; for Target&reg;Silver Lake Co...,I really love this crib. It went together easi...,pos
1,Lamaze Play &amp; Grow Mortimer the Moose Take...,I bought this for the coming arrival of my bab...,pos
2,Trio of Sunshine&reg; Polishing Cloths for Ste...,I really like these. They work really well. Th...,pos


In [80]:
all_review = all_df['review'].values.astype('U')
all_review

array(["I really love this crib. It went together easily and is working very well for us. The material is nice and strong, and the particle board is kept to a minimum. The only issue I have is that at the time we got it, the description said it came with the conversion rails. It doesn't.Amazon seems to have corrected this issue, but finding the conversion parts was pretty tough. I had to chase down the manufacturer (Foremost Group, not Dwell Studio or Target) and order it over the phone for an extra $80.I did that here: [...]We also got the matching changing table and changed the knobs out with some decorative ones from Anthropologie, which really gives it some character and made it sort of unique. Quality stuff.",
       "I bought this for the coming arrival of my baby boy. It's really cute and colorful, but pretty big. I couldn't hang it on my infant carrier without it whacking my kid in the face.",
       'I really like these. They work really well. They have cleaned the tarnish off

In [82]:
countvectorizer = CountVectorizer()
countvectorizer.fit(all_review)
countvectorizer

In [83]:
countvectorizer.vocabulary_

{'really': 33829,
 'love': 25097,
 'this': 42365,
 'crib': 11384,
 'it': 22750,
 'went': 46422,
 'together': 42896,
 'easily': 14651,
 'and': 3538,
 'is': 22682,
 'working': 47157,
 'very': 45500,
 'well': 46391,
 'for': 17594,
 'us': 45022,
 'the': 41980,
 'material': 25874,
 'nice': 28103,
 'strong': 40355,
 'particle': 30272,
 'board': 6496,
 'kept': 23428,
 'to': 42836,
 'minimum': 26582,
 'only': 29079,
 'issue': 22726,
 'have': 20063,
 'that': 41957,
 'at': 4409,
 'time': 42667,
 'we': 46232,
 'got': 19041,
 'description': 12693,
 'said': 36008,
 'came': 7959,
 'with': 46922,
 'conversion': 10729,
 'rails': 33563,
 'doesn': 13844,
 'amazon': 3383,
 'seems': 36742,
 'corrected': 10908,
 'but': 7711,
 'finding': 16937,
 'parts': 30293,
 'was': 46066,
 'pretty': 32353,
 'tough': 43162,
 'had': 19694,
 'chase': 8688,
 'down': 14049,
 'manufacturer': 25671,
 'foremost': 17629,
 'group': 19458,
 'not': 28414,
 'dwell': 14545,
 'studio': 40412,
 'or': 29257,
 'target': 41502,
 'order': 

In [85]:
all_count = countvectorizer.transform(all_review)
print(all_count.shape)

(86000, 47803)


In [88]:
bayes_classification = MultinomialNB()

all_labels = [1]*43000 + [0]*43000

bayes_classification.fit(all_count,all_labels )


In [95]:
review_case5 = "it was not bad. it's so normal"

print(bayes_classification.predict(countvectorizer.transform([review_case5])))
print(bayes_classification.predict_proba(countvectorizer.transform([review_case5])))

[0]
[[0.88504191 0.11495809]]
