<a href="https://colab.research.google.com/github/johyunkang/python-ml-guide/blob/main/python_ml_perfect_guide_08_TextAnal_07OpinionReview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 07 문서 군집화 소개와 실습 (Opinion Review 데이터 세트)

- 문서 군집화 개념 : 비슷한 텍스트 구성의 문서를 군집화 하는 것
- 텍스트 분류 기반의 문서 분류는 학습 데이터 세트가 필요하지만
- 문서 군집화는 학습 데이터 세트가 필요 없는 비지도학습 기반으로 동작


- 데엍 세트 URL : https://archive.ics.uci.edu/ml/datasets/Opinosis+Opinion+%26frasl%3B+Review

In [7]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Pac

True

In [8]:
import pandas as pd
import glob, os

# path directory 설정
path = r'/content/drive/MyDrive/Colab Notebooks/data/text-anal/topics'

# path로 지정한 디렉토리 밑에 있는 모든 .data 파일을 리스트로 취합
all_files = glob.glob(os.path.join(path, "*.data"))
filename_list = []
opinion_text = []

# 개별 파일명은 filename_list 로 취합
# 개별 파일의 파일 내용은 DataFrame 로딩 후 다시 String 으로 변환해 opinion_text list 로 취합
for file_ in all_files :
    # 개별 파일을 읽어서 DF 로 생성
    df = pd.read_table(file_, index_col=None, header=0, encoding='latin1')

    # 절대 경로로 주어진 파일명을 가공.
    # 마지막 확장자 .data 제거
    filename_ = file_.split('/')[-1]
    filename = filename_.split('.')[0]

    # 파일명 list와 파일내용 list에 파일명과 파일 내용을 추가
    filename_list.append(filename)
    opinion_text.append(df.to_string())

# 파일명 list와 파일 내용 list 객체를 DataFrame으로 생성
document_df = pd.DataFrame({'filename': filename_list, 'opinion_text':opinion_text})
document_df.head()

Unnamed: 0,filename,opinion_text
0,battery-life_ipod_nano_8gb,...
1,video_ipod_nano_8gb,...
2,directions_garmin_nuvi_255W_gps,...
3,sound_ipod_nano_8gb,headphone jack i got a clear case for it a...
4,screen_ipod_nano_8gb,...


- TF-IDF 형태로 피처 벡터화
- 아래 함수 LemTokens(), LemNormalize() 는 아래 링크 참조
- https://github.com/wikibook/pymldg-rev/blob/master/8%EC%9E%A5/8.7%20%EB%AC%B8%EC%84%9C%20%EA%B5%B0%EC%A7%91%ED%99%94%20%EC%86%8C%EA%B0%9C%EC%99%80%20%EC%8B%A4%EC%8A%B5(Opinion%20Review%20%EB%8D%B0%EC%9D%B4%ED%84%B0%20%EC%84%B8%ED%8A%B8).ipynb


In [9]:
from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english', ngram_range=(1, 2), min_df=0.05, max_df=0.85)

# opinion_text 컬럼 값으로 피처 벡터화 수행
feature_vect = tfidf.fit_transform(document_df['opinion_text'])

  % sorted(inconsistent)


- 군집화 기법은 K-평균을 이용
- 문서의 유형은 크게 보면 전자제품, 자동차, 호텔로 되어 있음
- 먼저 5개의 중심(Centroid) 기반으로 어떻게 군집화되는지 확인해 보겠음

In [11]:
from sklearn.cluster import KMeans

# 5개의 집합으로 군집화 수행
kmcluster = KMeans(n_clusters=5, max_iter=10000, random_state=0)
kmcluster.fit(feature_vect)
cluster_label = kmcluster.labels_
cluster_centers = kmcluster.cluster_centers_

- 군집의 레이블을 파일명과 파일 내용을 가지고 있는 DF 에 cluster_label 컬럼을 추가 저장

In [12]:
document_df['cluster_label'] = cluster_label
document_df.head()

Unnamed: 0,filename,opinion_text,cluster_label
0,battery-life_ipod_nano_8gb,...,2
1,video_ipod_nano_8gb,...,2
2,directions_garmin_nuvi_255W_gps,...,4
3,sound_ipod_nano_8gb,headphone jack i got a clear case for it a...,2
4,screen_ipod_nano_8gb,...,2


In [13]:
pd.set_option('display.max.column', 20)
pd.set_option('display.max.colwidth', 500)
pd.set_option('display.max_columns', None)
document_df[document_df['cluster_label'] == 0].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
47,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 ...",0
36,food_holiday_inn_london,The room was packed to capacity with queues at the food buffets .\n0 The over zealous st...,0
20,food_swissotel_chicago,The food for our event was delicious .\n0 ...,0
50,free_bestwestern_hotel_sfo,The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till t...,0
42,location_bestwestern_hotel_sfo,"Good Value good location , ideal choice .\n0 ...",0
19,location_holiday_inn_london,Great location for tube and we crammed in a fair amount of sightseeing in a short time .\n0 ...,0
46,parking_bestwestern_hotel_sfo,Parking was expensive but I think this is common for San Fran .\n0 there is a fee for parking but we...,0
18,price_holiday_inn_london,"All in all, a normal chain hotel on a nice location , I will be back if I do not find anthing closer to Picadilly for a better price .\n0 ...",0
24,room_holiday_inn_london,"We arrived at 23,30 hours and they could not recommend a restaurant so we decided to go ...",0
44,rooms_bestwestern_hotel_sfo,...,0


In [14]:
document_df[document_df['cluster_label'] == 1].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
29,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 ...",1
28,eyesight-issues_amazon_kindle,"It feels as easy to read as the K1 but doesn't seem any crisper to my eyes .\n0 the white is really GREY, and to avoid considerable eye, strain I had to ...",1
22,fonts_amazon_kindle,Being able to change the font sizes is awesome !\n0 ...,1
25,navigation_amazon_kindle,"In fact, the entire navigation structure has been completely revised , I'm still getting used to it but it's a huge step forward .\n0 ...",1
34,price_amazon_kindle,"If a case was included, as with the Kindle 1, that would have been reflected in a higher price .\n0 ...",1
12,speed_windows7,"Windows 7 is quite simply faster, more stable, boots faster, goes to sleep faster, comes back from sleep faster, manages your files better and on top of that it's beautiful to look at and easy to use .\n0 ...",1


In [15]:
document_df[document_df['cluster_label'] == 3].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
43,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",3
31,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 ...",3
26,gas_mileage_toyota_camry_2007,Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 ...,3
39,interior_honda_accord_2008,I love the new body style and the int...,3
21,interior_toyota_camry_2007,"First of all, the interior has way too many cheap plastic parts like the cheap plastic center piece that houses the clock .\n0 ...",3
41,mileage_honda_accord_2008,"It's quiet, get good gas mileage and looks clean inside and out .\n0 The mileage is great, and I've had to get used to stopping less for gas .\n1 ...",3
37,performance_honda_accord_2008,"Very happy with my 08 Accord, performance is quite adequate it has nice looks and is a great long, distance cruiser .\n0 6, 4, 3 eco engine has poor performance and gas mileage of 22 highway .\n1 ...",3
40,quality_toyota_camry_2007,I previously owned a Toyota 4Runner which had incredible build quality and reliability .\n0 ...,3
49,seats_honda_accord_2008,Front seats are very uncomfortable .\n0 ...,3
48,transmission_toyota_camry_2007,"After slowing down, transmission has to be kicked to speed up .\n0 ...",3


In [16]:
from sklearn.cluster import KMeans

# 3개의 집합으로 군집화
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_

# 소속 군집을 cluster_label 칼럼으로 할당하고 cluster_label 값으로 정렬
document_df['cluster_label'] = cluster_label
document_df.sort_values(by='cluster_label')
document_df.head()

Unnamed: 0,filename,opinion_text,cluster_label
0,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 ...,2
1,video_ipod_nano_8gb,"I bought the 8, gig Ipod Nano that has the built, in video camera .\n0 ...",2
2,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 ...,2
3,sound_ipod_nano_8gb,headphone jack i got a clear case for it and it i got a clear case for it and it like prvents me from being able to put the jack all the way in so the sound can b messsed up or i can get it in there and its playing well them go to move or something and it slides out .\n0 Picture and...,2
4,screen_ipod_nano_8gb,"As always, the video screen is sharp and bright .\n0 ...",2


#### 군집별 핵심 단어 추출하기

- cluster_centers_ : 배열 값으로 제공되며, 행은 개별 군집을, 열은 개별 피처를 의미. 각 배열 내의 값은 개별 군집 내의 상대 위치를 숫자값으로 표현한 일종의 좌표 값

In [17]:
cluster_centers = km_cluster.cluster_centers_
print('cluster_centers shape:', cluster_centers.shape)
print(cluster_centers)

cluster_centers shape: (3, 4611)
[[0.         0.00099499 0.00174637 ... 0.         0.00183397 0.00144581]
 [0.         0.00092551 0.         ... 0.         0.         0.        ]
 [0.01005322 0.         0.         ... 0.00706287 0.         0.        ]]


- cluster_centers 는 (3, 4611) 배열임. 이는 군집이 3개, word 피처가 4611개로 구성되었음을 의미
- 각 행의 배열값은 군집 내의 4611개 피처의 위치가 개별 중심과 얼마나 가까운가를 상대값으로 나타낸 것. 0 에서 1까지의 값을 가질 수 있으며 1에 가까울수록 중심과 가까운 값을 의미함.

In [18]:
# 군집별 top n 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명을 반환함

def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10) :
    cluster_details = {}
    
    # cluster_centers array의 값이 큰 순으로 정렬된 인덱스 값을 반환
    # 군집 중심점(centroid) 별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:, ::-1]
    
    # 개별 군집별로 반복하면서 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명 입력
    for cluster_num in range(clusters_num) :
        # 개별 군집별 정보를 담을 데이터 초기화
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        # cluster_centers_.argsort()[:, ::-1] 로 구한 인덱스를 이용해 top n 피처 단어를 구함
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        # top_feature_indexes 를 이용해 해당 피처 단어의 중심 위치 상댓값 구함
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()
        
        #cluster_details 딕셔너리 객체에 개별 군집별 핵심단어와 중심위치 상댓값, 해당 파일명 입력
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

In [19]:
def print_cluster_details(cluster_details) :
    for cluster_num, cluster_detail in cluster_details.items() :
        print('#### Cluster {0}'.format(cluster_num))
        print('Top features:', cluster_detail['top_features'])
        print('Reviews 파일명 :', cluster_detail['filenames'][:7])
        print('===============================================\n\n')

- 이제 위에서 생성한 get_cluster_details(), print_cluster_details()를 호출하겠음

In [20]:
feature_names = tfidf.get_feature_names()
cluster_details = get_cluster_details(cluster_model = km_cluster, cluster_data = document_df, 
                                     feature_names = feature_names, clusters_num=3, top_n_features=10)

print_cluster_details(cluster_details)

#### Cluster 0
Top features: ['room', 'hotel', 'service', 'staff', 'food', 'location', 'bathroom', 'clean', 'price', 'parking']
Reviews 파일명 : ['price_holiday_inn_london', 'location_holiday_inn_london', 'food_swissotel_chicago', 'staff_swissotel_chicago', 'room_holiday_inn_london', 'rooms_swissotel_chicago', 'service_holiday_inn_london']


#### Cluster 1
Top features: ['interior', 'seat', 'mileage', 'comfortable', 'gas', 'gas mileage', 'transmission', 'car', 'performance', 'quality']
Reviews 파일명 : ['interior_toyota_camry_2007', 'gas_mileage_toyota_camry_2007', 'comfort_toyota_camry_2007', 'performance_honda_accord_2008', 'interior_honda_accord_2008', 'quality_toyota_camry_2007', 'mileage_honda_accord_2008']


#### Cluster 2
Top features: ['screen', 'battery', 'keyboard', 'battery life', 'life', 'kindle', 'direction', 'video', 'size', 'voice']
Reviews 파일명 : ['battery-life_ipod_nano_8gb', 'video_ipod_nano_8gb', 'directions_garmin_nuvi_255W_gps', 'sound_ipod_nano_8gb', 'screen_ipod_nano_8g

