# 텍스트 사전 준비 작업 - 텍스트 정규화

## 토큰화

### 문장 토큰화
- 문장의 마침표, 개행문자 등 문장의 마지막을 뜻하는 기호에 따라 분리하는 것
- nltk : https://www.nltk.org

In [4]:
!pip install nltk



In [2]:
from nltk import sent_tokenize
import nltk 

nltk.download('punkt')

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

sentence = sent_tokenize(text_sample)
print(len(sentence))
print(sentence)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']


### 단어 토큰화
- 문장을 단어로 토큰화 하는 것
- 일반적으로 문장 토큰화는 각 문장이 가지는 의미가 중요한 요소로 사용될 때 사용
- BoW(Bag of Word)와 같이 단어의 순서가 중요하지 않는 경우 단어 토큰화만 해도 충분하다.

In [6]:
from nltk import word_tokenize

sentence = 'The Matrix is everywhere its all around us, here even in this room.'
words = word_tokenize(sentence)
print(len(words))
print(words)

15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']


In [7]:
from nltk import sent_tokenize, word_tokenize

def tokenize_text(text):
    sentences = sent_tokenize(text)
    word_tokens = [ word_tokenize(sentence) for sentence in sentences]
    return word_tokens

word_tokens = tokenize_text(text_sample)
print(word_tokens)

[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]


## Stopwords 제거
- 분석에 큰 의미가 없는 단어를 지칭한다.
- is, the, a, will 등 문맥적으로 큰 의미가 없는 단어가 이에 해당한다.

In [8]:
import nltk
nltk.download('stopwords') # 불용어 목록을 다운 받는다.

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [11]:
print(nltk.corpus.stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [10]:
print('영어 불용어 개수: ', len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

영어 불용어 개수:  179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [12]:
# 구두점 목록
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [15]:
import nltk

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []
print('[원본 단어]')
print(word_tokens)
for sentence in word_tokens:
    filtered_words = []
    for word in sentence:
        word = word.lower()
        if word not in stopwords and word not in string.punctuation:
            filtered_words.append(word)
    all_tokens.append(filtered_words)
print()
print('[불용어 제거 단어]')
print(all_tokens)

[원본 단어]
[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]

[불용어 제거 단어]
[['matrix', 'everywhere', 'around', 'us', 'even', 'room'], ['see', 'window', 'television'], ['feel', 'go', 'work', 'go', 'church', 'pay', 'taxes']]


## 어간추출(Stemming)과 표제어 추출(Lemmatization)
- 문법적 또는 의미적으로 변화하는 `단어의 원형을 찾는다.`

In [20]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))
print(stemmer.stem('amusing'), stemmer.stem('amused'), stemmer.stem('amuses'))
print(stemmer.stem('happier'), stemmer.stem('happiest'))

work work work
amus amus amus
happy happiest


In [25]:
from nltk.stem import WordNetLemmatizer
import nltk 
nltk.download('wordnet') # 사전 다운로드

lemma = WordNetLemmatizer()
# 명사(n), 동사(v), 형용사(a), 부사(r)
print(lemma.lemmatize('amusing','v'), lemma.lemmatize('amuses','v'), lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'), lemma.lemmatize('happiest','a'))

amuse amuse amuse
happy happy


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Bag of Words (BoW)

## DTM(Document Term Matrix, 문서 단어 행렬)

### CountVectorizer 사용

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()
print('bag of words vector:', vector.fit_transform(corpus).toarray())
# 'I'는 BoW를 만드는 과정에서 제외됨(CountVectorizer는 기본적으로 길이가 2 이상인 단어만 토큰으로 인식)
print('vocabulary:', sorted(vector.vocabulary_.items(), key= lambda item:item[0]))

bag of words vector: [[1 1 2 1 2 1]]
vocabulary: [('because', 0), ('know', 1), ('love', 2), ('want', 3), ('you', 4), ('your', 5)]


- 불용어를 제거한 BoW 만들기

1. 사용자 정의 불용어 사용

In [33]:
text = ["Family is not an important thing. It's everything."]
vector = CountVectorizer(stop_words=['the','an','a','is','not'])
print(vector.fit_transform(text).toarray())
print('vocabulary:', sorted(vector.vocabulary_.items(), key= lambda item:item[0]))

[[1 1 1 1 1]]
vocabulary: [('everything', 0), ('family', 1), ('important', 2), ('it', 3), ('thing', 4)]


2. CountVectorizer에서 제공하는 자체 불용어 사용

In [34]:
text = ["Family is not an important thing. It's everything."]
vector = CountVectorizer(stop_words='english')
print(vector.fit_transform(text).toarray())
print('vocabulary:', sorted(vector.vocabulary_.items(), key= lambda item:item[0]))

[[1 1 1]]
vocabulary: [('family', 0), ('important', 1), ('thing', 2)]


3. NLTK에서 지원하는 불용어 사용

In [36]:
from nltk.corpus import stopwords
text = ["Family is not an important thing. It's everything."]
stop_words = stopwords.words('english')
vector = CountVectorizer(stop_words=stop_words)
print(vector.fit_transform(text).toarray())
print('vocabulary:', sorted(vector.vocabulary_.items(), key= lambda item:item[0]))

[[1 1 1 1]]
vocabulary: [('everything', 0), ('family', 1), ('important', 2), ('thing', 3)]


### TF-IDF(Term Frequency-Inverse Document Frequency)

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['you know I want your love', # 문서 1
         'I like you', # 문서 2
         'what should I do'] # 문서 3
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(corpus).toarray())
print('vocabulary:', sorted(tfidf.vocabulary_.items(), key= lambda item:item[0]))

[[0.         0.46735098 0.         0.46735098 0.         0.46735098
  0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.
  0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.        ]]
vocabulary: [('do', 0), ('know', 1), ('like', 2), ('love', 3), ('should', 4), ('want', 5), ('what', 6), ('you', 7), ('your', 8)]


# Word2Vec

## 패키지 로딩

In [1]:
import re
import urllib.request
from nltk.tokenize import word_tokenize, sent_tokenize

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

## 데이터 로딩

In [2]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/09.%20Word%20Embedding/dataset/ted_en-20160408.xml", filename="ted_en-20160408.xml")

('ted_en-20160408.xml', <http.client.HTTPMessage at 0x20803e27190>)

In [3]:
with open("ted_en-20160408.xml", 'r', encoding='UTF8') as f:
    ted_xml = f.read()

In [5]:
print(ted_xml[:1000])

<?xml version="1.0" encoding="UTF-8"?>
<xml language="en"><file id="1">
  <head>
    <url>http://www.ted.com/talks/knut_haanaes_two_reasons_companies_fail_and_how_to_avoid_them</url>
    <pagesize>72832</pagesize>
    <dtime>Fri Apr 01 00:57:03 CEST 2016</dtime>
    <encoding>UTF-8</encoding>
    <content-type>text/html; charset=utf-8</content-type>
    <keywords>talks, business, creativity, curiosity, goal-setting, innovation, motivation, potential, success, work</keywords>
    <speaker>Knut Haanaes</speaker>
    <talkid>2470</talkid>
    <videourl>http://download.ted.com/talks/KnutHaanaes_2015S.mp4</videourl>
    <videopath>talks/KnutHaanaes_2015S.mp4</videopath>
    <date>2015/06/30</date>
    <title>Knut Haanaes: Two reasons companies fail -- and how to avoid them</title>
    <description>TED Talk Subtitles and Transcript: Is it possible to run a company and reinvent it at the same time? For business strategist Knut Haanaes, the ability to innovate after becoming successful is the 

## 데이터 전처리
- 학습할 문서는 <content>~</content> 안에 있으며, 이 내용만 가져온다.
- 문서 안에 포함된 학습에 불필요한 문장들 제거

In [6]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(ted_xml, 'lxml')



In [11]:
content_text = bs.find_all('content')
# print(len(content_text))
# print(content_text[0])
content_text = '\n'.join( c.get_text() for c in content_text)
content_text = re.sub(r'\([\w]*\)','',content_text)
# print(content_text[:1000])

Here are two reasons companies fail: they only do more of the same, or they only do what's new.
To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.
Consider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.
To me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.

Facit did too much exploitation. But exploration can go wild, too.
A few years back, I worked closely alongside a European biotech co

In [12]:
# 문장 토큰화
sentences = sent_tokenize(content_text)

# 구두점 제거 및 소문자 변환
nomalized_text = []
for sentence in sentences:
    result = re.sub('[^\w]+',' ',sentence.lower())
    nomalized_text.append(result)

# 단어 토큰화
result = [word_tokenize(sentence) for sentence in nomalized_text]

In [14]:
print(f'전체 샘플 개수 :{len(result):,}')
print(result[0])

전체 샘플 개수 :273,698
['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']


## Word2Vec 학습
- Word2Vec 주요 파라메터
>- vector_size : 워드 벡터의 특징 값. 즉, 임베딩 된 벡터의 차원 (적절한 임베딩 벡터의 차원의 크기는 100~300 사이의 값 지정)
>- window : 컨텍스트 윈도우 크기
>- min_count : 단어 최소 빈도 수 제한(빈도가 적은 단어들은 학습에서 제외)
>- sg : 0은 CBOW, 1은 Skip-gram

In [15]:
model = Word2Vec(sentences=result, vector_size=100, window=5, min_count=5, workers=4, sg=0)

In [16]:
sim_word = model.wv.most_similar('man')
print(sim_word)

[('woman', 0.8412037491798401), ('guy', 0.8095774054527283), ('boy', 0.763991117477417), ('lady', 0.7517896294593811), ('girl', 0.7424473762512207), ('soldier', 0.7386289238929749), ('gentleman', 0.7186206579208374), ('kid', 0.6830040812492371), ('poet', 0.6681225895881653), ('photographer', 0.6545884013175964)]


## Word2Vec 모델 저장

In [17]:
# 모델 저장
model.wv.save_word2vec_format('eng_w2v')

# 모델 로드
loaded_model = KeyedVectors.load_word2vec_format('eng_w2v')

## 임베딩 벡터의 시각화
- 구글의 임베딩 프로젝터라는 시각화 도구를 통해 학습한 임베딩 벡터를 시각화할 수 있다.

### 워드 임베딩 모델로부터 tsv 파일 생성하기

In [18]:
!python -m gensim.scripts.word2vec2tensor --input eng_w2v --output eng_w2v

2024-05-02 11:38:42,867 - word2vec2tensor - INFO - running C:\Users\user\anaconda3\Lib\site-packages\gensim\scripts\word2vec2tensor.py --input eng_w2v --output eng_w2v
2024-05-02 11:38:42,867 - keyedvectors - INFO - loading projection weights from eng_w2v
2024-05-02 11:38:44,615 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (21638, 100) matrix of type float32 from eng_w2v', 'binary': False, 'encoding': 'utf8', 'datetime': '2024-05-02T11:38:44.569316', 'gensim': '4.3.0', 'python': '3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'load_word2vec_format'}
2024-05-02 11:38:45,998 - word2vec2tensor - INFO - 2D tensor file saved to eng_w2v_tensor.tsv
2024-05-02 11:38:45,998 - word2vec2tensor - INFO - Tensor metadata file saved to eng_w2v_metadata.tsv
2024-05-02 11:38:45,998 - word2vec2tensor - INFO - finished running word2vec2tensor.py


### 임베딩 프로젝터를 이용한 시각화
- https://projector.tensorflow.org