# fastText

https://github.com/facebookresearch/fastText

**FastText**는 자연어 처리(NLP) 작업에서 사용되는 오픈소스 라이브러리로, 텍스트 분류 및 단어 임베딩을 위한 빠르고 효율적인 도구이다. 이는 Facebook AI Research 팀에서 개발했으며, 특히 대규모 텍스트 데이터에서도 높은 성능과 속도를 제공한다. FastText는 아래와 같은 주요 특징을 가진다:


**주요 특징**
1. **단어 벡터 학습 (Word Embeddings)**  
   - FastText는 단어를 고정된 크기의 벡터로 변환하는 단어 임베딩 모델을 학습한다. 이는 단어의 의미를 벡터 공간에 매핑하여 유사한 단어가 가까운 벡터로 표현되도록 한다.
   - 기존의 Word2Vec과 유사하지만, FastText는 단어를 **서브워드(subword)** 단위로 처리한다.

2. **서브워드 기반 모델 (Subword-based Model)**  
   - 단어를 n-그램(예: 'apple' → ['app', 'ppl', 'ple'])으로 분해하여 학습하기 때문에, **희귀 단어**나 **철자 오류**에도 강건하다.
   - 이는 단어 외에도 철자 패턴과 같은 더 세밀한 정보를 학습하는 데 유용하다.

3. **텍스트 분류 (Text Classification)**  
   - FastText는 문서나 문장을 빠르고 정확하게 분류하는 데 최적화되어 있다.
   - 학습 과정이 빠르고, 모델의 크기가 작으며, 정확도도 뛰어나다.

4. **효율적인 구현**  
   - FastText는 CPU 기반으로도 높은 성능을 내도록 설계되었으며, 대규모 데이터셋에서도 빠르게 작동한다.

**FastText의 작동 원리**
1. **단어 표현**  
   - 단어를 n-그램 서브워드로 나눈 후, 각 서브워드에 대해 벡터를 학습한다.
   - 예를 들어, "cat"이라는 단어는 'c', 'ca', 'cat'과 같은 다양한 조합으로 분해된다.
   - 결과적으로 단어 벡터는 각 서브워드 벡터의 합으로 표현된다.

2. **모델 구조**  
   - FastText는 Skip-gram 모델이나 CBOW 모델을 기반으로 동작한다.
   - 단, 기존 모델과 달리 단어 자체가 아닌 서브워드를 사용하여 학습한다.

**FastText의 장점**
1. **희귀 단어 처리 능력**  
   - 서브워드 기반 접근 방식 덕분에 희귀 단어 또는 새로운 단어에 대해 더 좋은 일반화 성능을 발휘한다.
2. **빠른 학습 속도**  
   - 단순한 모델 구조와 최적화된 구현으로 매우 빠르게 학습할 수 있다.
3. **다양한 언어 지원**  
   - 다양한 언어에서 동작하며, 특히 굴절어(inflected languages)와 같은 복잡한 언어에서도 효과적이다.

**활용 사례**
1. **단어 임베딩**  
   - 단어 간 유사도 계산, 문장 표현 학습.
2. **텍스트 분류**  
   - 스팸 필터링, 감정 분석, 뉴스 분류.
3. **다언어 지원**  
   - 다국어 데이터셋에서 빠른 응답 성능 제공.

### gensim FastText

In [1]:
from gensim.models import FastText
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

In [2]:
f = open('ted_en.xml', 'r', encoding='utf-8')
xml = etree.parse(f)

corpus = '\n'.join(xml.xpath('//content/text()'))
corpus = re.sub(r'\([^)]*\)', '', corpus)

sentences = sent_tokenize(corpus)

preprocessed_sentences = []

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^0-9a-zA-Z]', ' ', sentence)
    tokens = word_tokenize(sentence)
    preprocessed_sentences.append(tokens)

In [3]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

In [4]:
w2v_model.wv.vectors.shape

(21613, 100)

In [6]:
w2v_df = pd.DataFrame(w2v_model.wv.vectors, index=w2v_model.wv.index_to_key)
w2v_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.464643,-0.917629,0.256912,0.3622,-1.066241,-0.931005,-0.006629,0.068477,0.372089,-0.022289,...,-1.065033,0.89769,0.997271,-0.756384,1.126539,0.511299,-0.863078,0.365835,0.224018,0.945146
and,-0.883835,0.177656,-1.057905,1.487984,1.12748,-0.175623,-1.184068,-0.57567,-1.102482,0.317463,...,-1.814558,-1.054829,1.208269,0.039478,0.397316,-0.709551,0.33469,1.015812,0.740607,-0.48465
to,0.506508,1.362625,-0.449053,-1.197003,1.186114,-1.094063,-2.516772,-0.974796,-1.893198,0.768134,...,-2.124477,1.679532,2.209489,-0.442015,0.690735,0.68094,-0.586557,-0.033951,3.570805,0.187301
of,-3.297137,1.012579,0.567303,-0.241115,-0.563605,-1.699185,-0.819826,1.167598,0.276815,-1.17233,...,-0.655139,1.276986,1.96599,-0.343236,1.674938,-2.024231,1.154721,-0.038081,0.211678,0.857612
a,-0.087081,-1.770927,0.488321,-0.76828,-1.086148,2.082346,-0.233698,-0.636904,0.266622,0.71715,...,1.074848,1.270035,1.907876,0.010054,1.432991,-0.392168,1.133561,1.32134,1.282733,3.033434
that,0.0615,-0.213951,-1.778792,-0.234738,0.511289,-0.639467,-2.041008,-0.31593,-0.6187,1.482044,...,-1.026122,1.300095,0.294023,0.443703,1.355227,-0.712531,-0.595474,1.516287,1.987599,-0.037751
i,1.255031,1.441046,0.425875,0.795668,-0.30211,-0.297489,-2.844997,1.777878,-0.59705,-1.294695,...,-2.37549,0.219939,0.576163,1.179371,-1.855322,-0.223886,0.180738,1.014588,0.648383,1.66651
in,0.497402,-0.821946,1.893463,0.710792,1.030819,0.454409,-1.084455,-0.524898,-2.186147,-1.303596,...,-1.111929,1.184903,-0.717251,1.239532,0.629372,-0.101079,-0.702943,-1.048647,0.799252,-0.390033
it,-0.271573,0.038963,-0.228575,-0.867665,0.355201,1.358448,-0.903723,0.072886,-1.366179,0.920411,...,-0.201174,-0.42796,-0.213951,1.099804,2.017107,1.78999,-0.850215,1.288615,0.838582,1.39
you,0.930557,0.549396,-0.710548,-1.599341,-2.380533,-1.110615,-3.056979,0.20667,0.440255,-1.26495,...,-1.421706,3.084721,-0.833058,0.587924,-0.41182,-1.433603,-0.564615,0.372911,1.543063,0.455937


In [10]:
w2v_model.wv.most_similar('father')
# w2v_model.wv.most_similar('abracadabra')

[('son', 0.9369217753410339),
 ('husband', 0.9184561967849731),
 ('daughter', 0.9044973254203796),
 ('mother', 0.8905930519104004),
 ('grandmother', 0.8881942629814148),
 ('dad', 0.8869089484214783),
 ('brother', 0.8781124949455261),
 ('wife', 0.8769818544387817),
 ('sister', 0.8744943141937256),
 ('mom', 0.8683457374572754)]

In [12]:
# FastText
from gensim.models import FastText

fasttext_model = FastText(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

fasttext_model.wv.vectors.shape

(21613, 100)

In [13]:
fasttext_df = pd.DataFrame(fasttext_model.wv.vectors, index=fasttext_model.wv.index_to_key)
fasttext_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.369635,1.070593,-2.22295,-0.520111,-1.750231,2.64449,-0.296321,2.331468,-3.747618,1.360666,...,-0.251083,-1.989456,2.857589,-0.698603,0.250717,-1.631142,-0.550104,0.156362,-0.458002,2.517551
and,-2.930699,1.840704,-0.286831,-0.232874,-0.637588,-0.593772,-2.609573,1.211713,-3.226885,1.336933,...,-0.118432,-0.110354,2.851955,0.698216,0.647782,-0.819922,-1.078549,-1.252406,-1.060564,-1.272167
to,0.520525,4.440922,-1.635941,-2.318451,-3.04406,-2.386131,-0.640111,3.309205,-3.684052,1.372206,...,-1.918518,-0.482606,1.881178,2.125586,5.475373,2.141087,0.88028,0.745256,1.94045,-2.108219
of,-1.807726,-0.494223,2.170498,-1.548933,-3.447823,4.199588,-3.089437,-0.69934,1.064016,-2.401737,...,-0.069137,-0.407513,-3.170756,0.145843,-4.474441,3.352006,-2.608699,-9.526842,4.611053,-4.045509
a,12.11083,3.430279,-2.195641,-1.139991,-4.649692,13.757376,-6.614538,6.837711,0.783609,-1.602102,...,-0.73513,-8.295216,-4.468103,-2.94383,-0.286269,2.260313,0.935888,1.010712,0.659523,3.915431
that,1.19401,0.653939,-0.58396,0.905998,-2.479112,0.942541,-1.638389,-0.717018,-2.958884,-0.065789,...,0.716063,-0.15442,2.518903,-1.240831,0.333392,-1.537385,0.185634,0.443325,1.32108,1.043293
i,-0.172524,-6.955939,-5.234448,-3.397086,6.372025,-3.071188,-6.728538,-19.662636,-6.217935,5.085392,...,-1.777961,-1.854891,9.763267,-4.068396,-1.643426,5.700501,1.699686,9.25527,-4.325672,-12.462574
in,-2.617754,-2.216851,2.590937,-2.848028,1.710477,-1.084073,2.367129,0.188109,1.588724,-0.462572,...,-1.262814,3.254043,-5.111934,4.861982,3.439079,0.116557,-2.6057,-4.601095,-3.43909,-0.35007
it,2.595818,-1.560869,-0.933022,4.705704,-2.299701,-2.576495,0.33144,0.425584,-6.975593,2.070978,...,-0.171248,-2.745023,3.503058,0.386814,-0.491321,-1.97776,0.478873,4.745159,2.378975,2.373267
you,-0.857277,0.339942,-3.226851,1.416326,-2.41953,-0.461426,-3.588753,-5.324973,-2.136588,0.994558,...,1.23411,-2.532249,0.630184,-5.305309,-0.971201,-1.118237,0.318046,2.365978,2.15045,-2.945542


In [20]:
# fasttext_model.wv.most_similar('father')
fasttext_model.wv.most_similar('abracadabra')

[('abrahamic', 0.8421013951301575),
 ('celebratory', 0.811163604259491),
 ('bra', 0.8102812170982361),
 ('braille', 0.8018088340759277),
 ('autobiography', 0.789027750492096),
 ('anthropology', 0.7859097123146057),
 ('oratory', 0.7818185091018677),
 ('aboriginal', 0.7805594205856323),
 ('zebra', 0.778374969959259),
 ('gerontology', 0.778139054775238)]

In [21]:
fasttext_model.wv['abracadabra']

array([ 0.07494709,  0.41374564, -0.1753403 , -0.04557087, -0.34857494,
        0.14115968, -0.16344196,  0.26615506,  0.02902378,  0.1390778 ,
        0.07068592, -0.11412629, -0.16274497,  0.13680775, -0.30548942,
        0.11450381, -0.14930272,  0.03161902, -0.33451527, -0.34933928,
       -0.11536528, -0.22207391,  0.06144693,  0.19084226,  0.00197143,
        0.04672071, -0.2629677 , -0.07434057,  0.02868821,  0.00550894,
       -0.1466568 ,  0.443755  , -0.14749628,  0.11703664,  0.17943701,
       -0.19181035,  0.00633159, -0.20053269, -0.05394301,  0.11791342,
        0.05931603, -0.15537938, -0.20118457,  0.07667778, -0.32272765,
       -0.14838123,  0.20484066,  0.00342387, -0.25390348,  0.5536968 ,
        0.08777578, -0.11583656, -0.40661952,  0.17372754, -0.1961997 ,
        0.1496326 ,  0.13830578, -0.14125176, -0.1097391 ,  0.11179979,
        0.66206723,  0.00257547,  0.15389438, -0.20280387,  0.44525337,
        0.03898219,  0.01975412, -0.18484637, -0.11399737, -0.22

### fasttext 패키지 설치

In [23]:
!pip install fasttext-wheel



In [24]:
import fasttext
import fasttext.util

model = fasttext.train_unsupervised(
    'naver_movie_ratings.txt',
    model='skipgram',
    minCount=1,
    dim=100,
    minn=3,
    maxn=5
)

In [25]:
model.get_word_vector('극장')

array([ 0.51310927, -0.3474465 , -0.5519998 ,  0.7981209 , -0.2288926 ,
        0.00205313, -0.19965076,  0.02645567,  0.07761472,  0.5785533 ,
        0.07491961,  0.4428514 ,  0.46619484, -0.3238933 , -0.61847115,
       -0.88594687,  0.08833578, -1.3669742 , -0.43506083, -0.13538094,
        0.54382074,  0.37043953,  0.18037817, -0.09259647,  0.6192734 ,
       -0.93502295, -0.26679516,  0.5349201 , -0.8440418 ,  0.41410708,
       -0.24564022,  0.47172397, -0.04313922,  0.3068818 , -0.648671  ,
        0.23869662, -0.00854011,  0.29743674, -0.40523574, -0.2744046 ,
        0.10509539,  0.3299204 ,  0.3628938 , -0.252718  ,  1.0912603 ,
       -0.22757918,  0.1799167 , -0.4914909 , -0.06720181,  0.2996303 ,
        0.26246417, -0.8771287 , -0.11714436,  0.5193382 , -0.66715133,
        0.09862232,  0.77082103,  0.27044293,  1.1250198 , -0.09882633,
        0.48046717,  0.09255551,  0.24735783,  0.43767506, -0.00854313,
       -0.04849064, -0.07767742,  0.03295455,  0.43177676, -0.00

In [26]:
model.get_subwords('영화관')

(['영화관', '<영화', '<영화관', '<영화관>', '영화관', '영화관>', '화관>'],
 array([   2062, 1921845, 1442415, 1378913, 2245977, 1515139, 1352938]))

In [27]:
model.get_subwords('특선영화')

(['특선영화',
  '<특선',
  '<특선영',
  '<특선영화',
  '특선영',
  '특선영화',
  '특선영화>',
  '선영화',
  '선영화>',
  '영화>'],
 array([  54542,  989150,  929201, 1543251, 2496531,  878545, 1046555,
        2645177, 2342883, 2504929]))