# **2주차 피드백**
***  
**임베딩에 대해 더 깊은 공부 필요**  

1. ### **_빈도 기반 표현_**  
    1. n-gram
    2. BoW
    3. DTM 
    4. TF-IDF

2. ### **_Count / Tfidf vectorizer 주요 파라미터_**  
    1. stop_words
    2. max_features
    3. max / min_df
    4. ngram_range
    5. token pattern

3. ### **_World Embedding (1)_**  
    1. W2V(Word2Vec)
        * CBOW
        * Skip-gram
    2. Keras Embedding()
    3. FastText


4. ### **_분류 모델 실습(파라미터 위주)_**  
    1. 임베딩 기반
    2. 빈도 기반


## Library

In [None]:
# base
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# regex
import re

# stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# tokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import TreebankWordTokenizer
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# lemmatization
from nltk.stem import WordNetLemmatizer

# Vectorizer
f

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

#scoring
from sklearn.metrics import f1_score, accuracy_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 공부
1. n-gram
2. 빈도 기반 단어표현
2. vectorizer 파라미터 
3. 워드 임베딩
4. 머신러닝 분류 모델 실습 ( tfidf, 임베딩)
5. 문서 유사도 (tfidf)

# 1. **n-gram**
n그램은 빈도 기반 통계적 언어 모델로 특정 단어가 나올 확률을 **앞의** n개의 단어를 통해 계산하는 방법이다  

EX ) i was a car  
n=1(unigram) : 'i' / 'was' / 'a' / 'car' $$p(car|a)$$ 
n=2(bigram) : 'i was' / 'was a' / 'a car' $$p(car|was\;a)$$ 
n=3(trigram) : 'i was a' / 'was a car' $$p(car|i\;was\;a)$$ 
n=4(4-gram) : 'i was a car'  
  
한계  
0. 특정 부분만 뽑는것은 문맥 고려가 힘드니 전체 문장을 고려한 모델보다 정확도가 떨어짐
1. 희소성 극복 X
2. trade-off : 대부분 n=2일때가 성능이 더 좋긴함. n이 커질수록 희소성도 커지며 모델 사이즈도 커짐. n이 작을수록 근사 정확도가 떨어짐 n<5 로 권장됨

# 2. **빈도 기반 단어 표현**
![이미지](https://wikidocs.net/images/page/31767/wordrepresentation.PNG)  
로컬 표현은 단어의 의미, 뉘앙스를 표현 할 수 없다는 한계를 가짐.  

## BoW, DTM, TF-IDF
**Bow**  
단어 순서를 고려하지 않는 빈도 기반 단어표현방법. 문서 내 단어들의 횟수를 수치화하여 문서 간 유사도나 분류 문제를 수행가능  
  
**DTM**   
문서들의 BoW를 결합한 문서 단어 행렬(Document-Term Matrix)  
한계
1. 희소성 극복 X
2. 단순 빈도 수 기반 접근(중요한 단어인지는 빈도수는 알 수 없음)  
**TF-IDF**   
TF : 특정 문서에서 특정 단어의 등장 빈도  
DF : 특정 단어를 포함한 문서의 개수  
IDF : DF의 역수  $$idf(d, t) = log(\frac{n}{1+df(t)})$$
TF-IDF : TF * IDF 값으로 특정 문서에서 단어의 빈도에 가중치를 부여. 문서 내에서 특정 단어의 중요도를 구하는 작업 등에 쓰이며 해당 단어가 특정 문서에서 값이 높아도, 많은 문서에서 등장한다면 낮은 값을 가진다.

# 3. **vectorizer params**
### CountVectorizer 
단어 빈도화 시키는 함수. 그동안 했던거 한번에 할 수 있다! 자세한 함수는 아래에서
### TfidfVectorizer 
tfidf화 시켜주는 함수. 기존 idf식의 분자에 +1, 로그항에+1 , if-idf에 L2정규화


### CountVectorizer & TfidfVectorizer 파라미터 정리

In [None]:
#1. stop_words : 불용어 처리 
text = ["When I was 10, I was a car. it was very good!"]
#직접지정
vect = CountVectorizer(stop_words=["I"])
print('bag of words vector :',vect.fit_transform(text).toarray())
print('vocabulary :',vect.vocabulary_)
#내장
vect = CountVectorizer(stop_words="english")
print('bag of words vector :',vect.fit_transform(text).toarray())
print('vocabulary :',vect.vocabulary_)
#NLTK불용어
stop_words = stopwords.words("english")
vect = CountVectorizer(stop_words=stop_words)
print('bag of words vector :',vect.fit_transform(text).toarray())
print('vocabulary :',vect.vocabulary_)

bag of words vector : [[1 1 1 1 1 3 1]]
vocabulary : {'when': 6, 'was': 5, '10': 0, 'car': 1, 'it': 3, 'very': 4, 'good': 2}
bag of words vector : [[1 1 1]]
vocabulary : {'10': 0, 'car': 1, 'good': 2}
bag of words vector : [[1 1 1]]
vocabulary : {'10': 0, 'car': 1, 'good': 2}


In [None]:
#lowercase : 소문자 처리 유무
# max / min_df : 최대 / 최소 빈도수 제한
# max_features : 높은 빈도 순으로 단어 개수 제한 

# analyzer : 토큰화 단위. 단어 문자 등
vect = CountVectorizer(analyzer='char').fit(text)
print(vect.get_feature_names(),'\n')
# tokenizer : 토큰화 함수
vect = CountVectorizer().fit(text)
print(vect.get_feature_names(),'\n')
toks = WordPunctTokenizer()
vect = CountVectorizer(tokenizer=toks.tokenize).fit(text)
print(vect.get_feature_names(),'\n')
# ngram_range : n그램 단위로 토큰화 (1,2) 1단어 2단어 동시에
# 2단어가 의미있는 경우 (good job, well done)
vect = CountVectorizer(ngram_range=(1,2)).fit(text)
print(vect.get_feature_names())

[' ', '!', ',', '.', '0', '1', 'a', 'c', 'd', 'e', 'g', 'h', 'i', 'n', 'o', 'r', 's', 't', 'v', 'w', 'y'] 

['10', 'car', 'good', 'it', 'very', 'was', 'when'] 

['!', ',', '.', '10', 'a', 'car', 'good', 'i', 'it', 'very', 'was', 'when'] 

['10', '10 was', 'car', 'car it', 'good', 'it', 'it was', 'very', 'very good', 'was', 'was 10', 'was car', 'was very', 'when', 'when was']


In [None]:
#token_pattern : 토큰화 하는 정규표현식 패턴 default : (?u)\b\w\w+\b
vect = CountVectorizer(token_pattern="[^\w]").fit(text)
print(vect.get_feature_names(),'\n\n')
#숫자가 문자형이여도 제거해줌
vect = CountVectorizer().fit(text)
print(vect.get_feature_names(),'\n')
vect = CountVectorizer(token_pattern="[^\d\W]+\w").fit(text)
print(vect.get_feature_names())

[' ', '!', ',', '.'] 


['10', 'car', 'good', 'it', 'very', 'was', 'when'] 

['car', 'good', 'it', 'very', 'was', 'when']


In [None]:
# fit_transform 은 압축행렬을 생성함. fit 하고 transform 따로 한거랑은 다른 결과값임!
text = ['When I was 10, I was a car. it was very good!',
        'hi my name is car']
vect = CountVectorizer()      
toks = vect.fit_transform(text) #압축 행렬 변환
print(toks.shape)
print(toks.toarray()) # 원행렬 행렬
print(list(dict(sorted(vect.vocabulary_.items())).keys())) #단어들

(2, 11)
[[1 1 1 0 0 1 0 0 1 3 1]
 [0 1 0 1 1 0 1 1 0 0 0]]
['10', 'car', 'good', 'hi', 'is', 'it', 'my', 'name', 'very', 'was', 'when']


In [None]:
#tfidfvectorizer 는 countvectorizer와 동일한 파리미터를 가짐
# norm : l1이나 l2냐. default 는 l2

# 4. **워드 임베딩**
희소표현 : 벡터 또는 행렬의 값이 대부분 0으로 표현되는 방법 (원핫인코딩)  
워드임베딩(밀집표현) : 희소 표현을 압축 - 단어 간 유사도를 파악할 수 있음 > 의미부여 > 단어의 의미를 여러 차원에다가 분산 표현  
<br/>  

워드 임베딩 방법론
1. LSA
2. Word2Vec
3. FastText
4. Glove
5. keras Embedding()  

## CBOW(Continuous BOW)
![image](https://wikidocs.net/images/page/22660/%EB%8B%A8%EC%96%B4.PNG)  
예측(중심단어), window(주변단어)=2
<br/>  

![image](https://wikidocs.net/images/page/22660/word2vec_renew_1.PNG)
![image](https://wikidocs.net/images/page/22660/word2vec_renew_2.PNG)
$$W_{m \times v} \ \ \ \ W^{"}_{v \times m}$$ 은 전치가 아닌 임의의 행렬이며 CBOW는 이 두 행렬을 학습하여 임베딩하는 기법  

<br/>  

![image](https://wikidocs.net/images/page/22660/word2vec_renew_4.PNG)  

## Skip-gram  

![image](https://wikidocs.net/images/page/22660/word2vec_renew_6.PNG)


In [None]:
#사전훈련 임베딩모델
import gensim
import urllib.request

# 구글의 사전 훈련된 Word2Vec 모델을 로드.
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
print(word2vec_model.vectors.shape)
print(word2vec_model.similarity('social', 'network'))
print(word2vec_model.similarity('virtual', 'reality'))
#print(word2vec_model['book'])

(3000000, 300)
0.30620542
0.31206152


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
#example data
sentences = ['nice great best amazing', 'stop lies', 'pitiful nerd', 'excellent work', 'supreme quality', 'bad', 'highly respectable']
y_train = [1, 0, 0, 1, 1, 0, 1]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab_size = len(tokenizer.word_index) + 1
X_encoded = tokenizer.texts_to_sequences(sentences)
max_len = max(len(l) for l in X_encoded)
X_train = pad_sequences(X_encoded, maxlen=max_len, padding='post')
y_train = np.array(y_train)

In [None]:
embedding_matrix = np.zeros((vocab_size, 300))

In [None]:
def get_vector(word):
    if word in word2vec_model:
        return word2vec_model[word]
    else:
        return None

In [None]:
for word, index in tokenizer.word_index.items():
    # 단어와 맵핑되는 사전 훈련된 임베딩 벡터값
    vector_value = get_vector(word)
    if vector_value is not None:
        embedding_matrix[index] = vector_value

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten, Input

model = Sequential()
model.add(Input(shape=(max_len,), dtype='int32'))
e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=max_len, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train, y_train, epochs=10, verbose=2)

Epoch 1/10
1/1 - 1s - loss: 0.7041 - acc: 0.5714 - 602ms/epoch - 602ms/step
Epoch 2/10
1/1 - 0s - loss: 0.6854 - acc: 0.7143 - 11ms/epoch - 11ms/step
Epoch 3/10
1/1 - 0s - loss: 0.6672 - acc: 0.7143 - 6ms/epoch - 6ms/step
Epoch 4/10
1/1 - 0s - loss: 0.6496 - acc: 0.7143 - 7ms/epoch - 7ms/step
Epoch 5/10
1/1 - 0s - loss: 0.6325 - acc: 0.8571 - 5ms/epoch - 5ms/step
Epoch 6/10
1/1 - 0s - loss: 0.6160 - acc: 0.8571 - 7ms/epoch - 7ms/step
Epoch 7/10
1/1 - 0s - loss: 0.6000 - acc: 0.8571 - 6ms/epoch - 6ms/step
Epoch 8/10
1/1 - 0s - loss: 0.5846 - acc: 1.0000 - 6ms/epoch - 6ms/step
Epoch 9/10
1/1 - 0s - loss: 0.5696 - acc: 1.0000 - 6ms/epoch - 6ms/step
Epoch 10/10
1/1 - 0s - loss: 0.5552 - acc: 1.0000 - 5ms/epoch - 5ms/step


<keras.callbacks.History at 0x7f5869e80ac0>

## Glove(Global Vectors for Word Representation)  
빈도 + 예측기반 임베딩 기법으로 word2vec이랑 성능 비슷함  
한줄요약 : "임베딩 된 중심 단어와 주변 단어 벡터의 내적이 전체 코퍼스에서의 동시 등장 확률이 되도록 만드는 것"  
$$dot\ product(w_{i}\ \tilde{w_{k}}) \approx\ log\ P(k\ |\ i) = log\ P_{ik}$$  
어려움...  
<br/>
## FastText
word2Vec의 확장판으로, 단어 안 내부단어인 subwords룰 고려하여 학습  
n = 3인 경우 : <ap, app, ppl, ple, le>, <apple>  
n = 3 ~ 6인 경우 : 
<ap, app, ppl, ppl, le>, <app, appl, pple, ple>, <appl, pple>, ..., <apple>
apple = <ap + app + ppl + ppl + le> + <app + appl + pple + ple> + <appl + pple> + , ..., +<apple>  
<br/>
장점 
1. 모르는 단어를 쪼개서 아는 단어들로 유사도를 구할 수 있다. (birthday -> birth, day)
2. 오타가 나도 계산 가능 (appple -> apple과 거의 유사)  

##keras Embedding()
인공신경망으로 임베딩 층 구현
절차 : 정수 인코딩 > 룩업 테이블 내 정수 인덱스의 임베딩 벡터 값 리턴 > 학습 > 최종 벡터 도출  



In [None]:
#keras embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ['nice great best amazing', 'stop lies', 'pitiful nerd', 'excellent work', 'supreme quality', 'bad', 'highly respectable']
y_train = [1, 0, 0, 1, 1, 0, 1]

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab_size = len(tokenizer.word_index) + 1 # 패딩을 고려하여 +1
print('단어 집합 :',vocab_size)

단어 집합 : 16


In [None]:
import numpy as np
X_encoded = tokenizer.texts_to_sequences(sentences)
X_train = pad_sequences(X_encoded, maxlen=4, padding='post')
y_train = np.array(y_train)
print('패딩 결과 :')
print(X_train)

패딩 결과 :
[[ 1  2  3  4]
 [ 5  6  0  0]
 [ 7  8  0  0]
 [ 9 10  0  0]
 [11 12  0  0]
 [13  0  0  0]
 [14 15  0  0]]


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

embedding_dim = 4

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=4))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train, y_train, epochs=10, verbose=2)

Epoch 1/10
1/1 - 1s - loss: 0.7102 - acc: 0.1429 - 568ms/epoch - 568ms/step
Epoch 2/10
1/1 - 0s - loss: 0.7084 - acc: 0.1429 - 12ms/epoch - 12ms/step
Epoch 3/10
1/1 - 0s - loss: 0.7066 - acc: 0.2857 - 11ms/epoch - 11ms/step
Epoch 4/10
1/1 - 0s - loss: 0.7048 - acc: 0.2857 - 8ms/epoch - 8ms/step
Epoch 5/10
1/1 - 0s - loss: 0.7030 - acc: 0.2857 - 8ms/epoch - 8ms/step
Epoch 6/10
1/1 - 0s - loss: 0.7013 - acc: 0.2857 - 8ms/epoch - 8ms/step
Epoch 7/10
1/1 - 0s - loss: 0.6995 - acc: 0.2857 - 7ms/epoch - 7ms/step
Epoch 8/10
1/1 - 0s - loss: 0.6977 - acc: 0.4286 - 10ms/epoch - 10ms/step
Epoch 9/10
1/1 - 0s - loss: 0.6960 - acc: 0.4286 - 10ms/epoch - 10ms/step
Epoch 10/10
1/1 - 0s - loss: 0.6942 - acc: 0.4286 - 7ms/epoch - 7ms/step


<keras.callbacks.History at 0x7f58673f8df0>

## ELMo ( Embeddings from language model )
양방향 문자 임베딩(CNN 챕터에서 설명)

# 5. **분류 모델 실습**
### 1. 임베딩 기반 

##  임베딩

In [None]:
data = pd.read_csv('/content/drive/MyDrive/nlp_data/practice.csv')
data.columns = ['index','text','author']
data = data[['text','author']]
data.author = data.author.astype('str')

#전처리부터 해야됨
data.text = data.text.apply(lambda x : re.sub('o‐m','om',x))
data.text = data.text.apply(lambda x : re.sub('o-m','om',x))
data.text = data.text.apply(lambda x : re.sub('o‐d','od',x))
data.text = data.text.apply(lambda x : re.sub('o-d','od',x))

data.text = data.text.apply(lambda x : re.sub("-{2,}",' ',x))
data.text = data.text.apply(lambda x : re.sub("[^0-9a-zA-Z\'\- ]",'',x))

data.text = data.text.apply(lambda x : re.sub("US", "usa" ,x)) #특정 단어만 소문자 처리

data_text = data.text.apply(lambda x : text_to_word_sequence(x)) #대문자 처리도 다 해줌
data_text

0        [he, was, almost, choking, there, was, so, muc...
1               [your, sister, asked, for, it, i, suppose]
2        [she, was, engaged, one, day, as, she, walked,...
3        [the, captain, was, in, the, porch, keeping, h...
4        [have, mercy, gentlemen, odin, flung, up, his,...
                               ...                        
54874    [is, that, you, mr, smith, odin, whispered, i,...
54875    [i, told, my, plan, to, the, captain, and, bet...
54876    [your, sincere, well, wisher, friend, and, sis...
54877        [then, you, wanted, me, to, lend, you, money]
54878    [it, certainly, had, not, occurred, to, me, be...
Name: text, Length: 54879, dtype: object

In [None]:
# 불용어 처리
#stop_words_list = stopwords.words('english')
#data_text = data_text.apply(lambda x : [i for i in x if i not in stop_words_list])
#data_text

In [None]:
#결측행 확인
print('전체 문서의 수 :',len(data_text))
cnt=[]
for i,j in enumerate(data_text):
  if j == []:
    cnt.append(i)
len(cnt)

전체 문서의 수 : 54879


44

In [None]:
idx = [i for i in data_text.index if i not in cnt]

In [None]:
data_text = data_text[idx]

In [None]:
#표제어 추출 하는게 좋은가??
# https://aclanthology.org/W19-6203.pdf
# https://stats.stackexchange.com/questions/374209/pre-processing-lemmatizing-and-stemming-make-a-better-doc2vec
# https://stackoverflow.com/questions/23877375/word2vec-lemmatization-of-corpus-before-training
# 정해진건 없지만 영어에서는 안해도 된다고는 한다..

#불용어 처리도 안해도 된다는 의견이 있음
# https://stackoverflow.com/questions/34721984/stopword-removing-when-using-the-word2vec
# https://datascience.stackexchange.com/questions/80227/should-i-keep-common-stop-words-when-preprocessing-for-word-embedding
# 하지만 텍스트 분류와 감성분석에서는 불용어가 주는 정보가 거의 없기 때문에 제거하는것이 더 좋다고 한다.
# https://www.kaggle.com/general/208662

In [None]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
  #특정 단어의 품사(ex. VBD)의 앞글자(V)를 대문자로 추출
    tag = nltk.pos_tag([word])[0][1][0].upper() 

  #품사 딕셔너리 생성
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

# 단어 표제어 추출
#lemmatizer = WordNetLemmatizer()
#data_text = data_text.apply(lambda x : [lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in x])

size = 워드 벡터의 특징 값. 즉, 임베딩 된 벡터의 차원.  
window = 컨텍스트 윈도우 크기  
min_count = 단어 최소 빈도 수 제한 (빈도가 적은 단어들은 학습하지 않는다.)  
workers = 학습을 위한 프로세스 수  
sg = 0은 CBOW, 1은 Skip-gram.

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

model = Word2Vec(sentences=data_text, size=300, window=5, min_count=2, workers=4, sg=0)

In [None]:
model.wv.most_similar("man") #불용어 처리도 안했음

[('gentleman', 0.7922993302345276),
 ('fellow', 0.7626957297325134),
 ('woman', 0.7523154020309448),
 ('person', 0.6834543943405151),
 ('officer', 0.676331102848053),
 ('soldier', 0.6753376722335815),
 ('creature', 0.6553103923797607),
 ('lad', 0.6538023948669434),
 ('fellah', 0.6221071481704712),
 ('stranger', 0.6079340577125549)]

In [None]:
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")

In [None]:
# 표제어 추출 하고 sg=0일때
#[('fellah', 0.8633233308792114), 농민!
# ('woman', 0.8444805145263672),
# ('fellow', 0.7350084781646729),
# ('chap', 0.7068322896957397),
# ('men', 0.7026264667510986),
# ('fashion', 0.7018184065818787),
# ('gentleman', 0.6731172800064087),
# ('soldier', 0.6719125509262085),
# ('people', 0.6682370901107788),
# ("man's", 0.6663216352462769)]

In [None]:
# 표제어 추출 하고 sg=1일때 
#[('fellah', 0.7122722864151001),
# ("man's", 0.6552585363388062),
# ('fellow', 0.6448282599449158),
# ('bailey', 0.638940691947937),
# ('orlick', 0.632754921913147),
# ('englishman', 0.6306297183036804),
# ('pawnbroker', 0.6290642023086548),
# ('vagabond', 0.6286340355873108),
# ('shipmate', 0.6274350881576538),
# ('rascal', 0.6273226737976074)]

In [None]:
#표제어 추출 안하고 sg=1일때 이렇게 나옴
#[('woman', 0.684684157371521),
# ('fellah', 0.672799825668335),
# ('officer', 0.6544522643089294),
# ('fellow', 0.6477473974227905),
# ('soldier', 0.6471898555755615),
# ('mans', 0.6460458040237427),
# ("man's", 0.6431757807731628),
# ('person', 0.6349670886993408),
# ('gentleman', 0.6260026693344116),
# ('gentlemans', 0.6248079538345337)]

In [None]:
#표제어 추출 안하고 sg=0일 때는 이렇게 나옴... 뭐가 더 좋은걸까??
#[('woman', 0.8838928937911987),
# ('fellah', 0.8600987195968628),
# ('fellow', 0.7722310423851013),
# ('mans', 0.7648662328720093),
# ('gentleman', 0.7474823594093323),
# ('asp', 0.721663236618042),
# ('men', 0.7075345516204834),
# ('creature', 0.7051717042922974),
# ('soldier', 0.705062985420227),
# ('chap', 0.6961818933486938)]

In [None]:
def get_document_vectors(document_list):
    document_embedding_list = []
    nonidx=[]

    # 각 문서에 대해서
    for idxs,line in enumerate(document_list):
        doc2vec = None
        count = 0
        for word in line:
            if word in model.wv.vocab:
                count += 1
                # 해당 문서에 있는 모든 단어들의 벡터값을 더한다.
                if doc2vec is None:
                    doc2vec = model[word]
                else:
                    doc2vec = doc2vec + model[word]

        if doc2vec is not None:
            # 단어 벡터를 모두 더한 벡터의 값을 문서 길이로 나눠준다.
            doc2vec = doc2vec / count
            document_embedding_list.append(doc2vec)
        else:
          nonidx.append(idxs)

    # 각 문서에 대한 문서 벡터 리스트를 리턴
    return document_embedding_list, nonidx

In [None]:
document_embedding_list,aa = get_document_vectors(data_text)

In [None]:
print(len(document_embedding_list),len(data_text))

54835 54835


In [None]:
np.array(document_embedding_list).shape

(54835, 300)

In [None]:
y = data.author[idx].reset_index(drop=True)
X_train, X_test, y_train, y_test = train_test_split(np.array(document_embedding_list), y, test_size = 0.2, random_state = 156)

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [None]:
preds = model.predict(X_test)
preds

array(['0', '0', '1', ..., '0', '0', '0'], dtype=object)

## 성능평가
1. 불용어 o 표제어 o : 0.53
2. 불용어 o 표제어 X : 0.51
3. 불용어 X 표제어 X : 0.53 size=100
4. 불용어 X 표제어 X : 0.62 size=300


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, accuracy_score
accuracy_score(preds, y_test) #53퍼센트

0.540713048235616

In [None]:
print(preds[:10],'\n',list(y_test[:10])) #2를 잘 분류 못함

['0' '0' '1' '3' '3' '1' '3' '1' '2' '1'] 
 ['2', '0', '0', '2', '3', '1', '3', '1', '1', '1']


In [None]:
y_train

18186    0
13356    1
13397    3
35662    4
28840    4
        ..
6955     1
7653     2
42402    1
39628    3
24108    2
Name: author, Length: 43868, dtype: object

## 모델 파이프라인 구축
https://studying-modory.tistory.com/entry/210114-머신러닝-텍스트마이닝-두-번째-시간

## 2. 빈도 기반 모델링

In [None]:
data = pd.read_csv('/content/drive/MyDrive/nlp_data/practice.csv')
data.columns = ['index','text','author']
data = data[['text','author']]

#전처리부터 해야됨
data.text = data.text.apply(lambda x : re.sub('o‐m','om',x))
data.text = data.text.apply(lambda x : re.sub('o-m','om',x))
data.text = data.text.apply(lambda x : re.sub('o‐d','od',x))
data.text = data.text.apply(lambda x : re.sub('o-d','od',x))

data.text = data.text.apply(lambda x : re.sub("-{2,}",' ',x))
data.text = data.text.apply(lambda x : re.sub("[^0-9a-zA-Z\'\- ]",'',x))

data.text = data.text.apply(lambda x : re.sub("US", "usa" ,x)) #특정 단어만 소문자 처리

In [None]:
data.text

0        He was almost choking There was so much so muc...
1                       Your sister asked for it I suppose
2         She was engaged one day as she walked in peru...
3        The captain was in the porch keeping himself c...
4        Have mercy gentlemen odin flung up his hands D...
                               ...                        
54874    Is that you Mr Smith odin whispered I hardly d...
54875    I told my plan to the captain and between us w...
54876     Your sincere well-wisher friend and sister LU...
54877                 Then you wanted me to lend you money
54878    It certainly had not occurred to me before but...
Name: text, Length: 54879, dtype: object

In [None]:
tfidf_descp = TfidfVectorizer(max_features=20000, stop_words='english', min_df=2)
x_descp = tfidf_descp.fit_transform(data.text)

In [None]:
tfidf_descp.get_feature_names_out()

array(['10', '100', '1000', ..., 'zoology', 'zossimov', 'zphyrine'],
      dtype=object)

In [None]:
y = data.author

In [None]:
 x_train, x_test, y_train, y_test = train_test_split(x_descp, y, test_size = 0.2, random_state = 156)

In [None]:
model2 = LogisticRegression()

In [None]:
model2.fit(x_train, y_train)
preds = model2.predict(x_test)

In [None]:
accuracy_score(preds, y_test) #72퍼센트의 성능

0.7265852769679301

In [None]:
f1_score(preds, y_test, average='weighted')

0.72912247624235

## max features가 얼만큼 영향을 미치는가?

In [None]:
max_feature = [100,500,1000,2500,5000,10000,20000,30000,50000,100000]
scores = []
for i in max_feature:
  tfidf_descp = TfidfVectorizer(max_features=i, stop_words='english', min_df=2)
  x_descp = tfidf_descp.fit_transform(data.text)
  x_train, x_test, y_train, y_test = train_test_split(x_descp, y, test_size = 0.2, random_state = 156)
  model2 = LogisticRegression()
  model2.fit(x_train, y_train)
  preds = model2.predict(x_test)
  scores.append(accuracy_score(preds, y_test))
scores # 2만이 제일 좋음

[0.4518039358600583,
 0.5603134110787172,
 0.605502915451895,
 0.6701895043731778,
 0.7044460641399417,
 0.7223943148688047,
 0.7265852769679301,
 0.7241253644314869,
 0.7241253644314869,
 0.7241253644314869]

## stopword빼고 ngram 넣어보자

In [None]:
ranges=[2,3,4]
features=[20000,40000]
scores = []
for i in ranges:
  for j in features:
    tfidf_descp = TfidfVectorizer(max_features=j, min_df=2, ngram_range=(1,i))
    x_descp = tfidf_descp.fit_transform(data.text)
    x_train, x_test, y_train, y_test = train_test_split(x_descp, y, test_size = 0.2, random_state = 156)
    model2 = LogisticRegression()
    model2.fit(x_train, y_train)
    preds = model2.predict(x_test)
    scores.append(accuracy_score(preds, y_test))
scores #2에 4만이 제일 좋긴함

[0.7584730320699709,
 0.7652150145772595,
 0.7532798833819242,
 0.7628462099125365,
 0.7533709912536443,
 0.7629373177842566]