# fastText

https://github.com/facebookresearch/fastText

**FastText**는 자연어 처리(NLP) 작업에서 사용되는 오픈소스 라이브러리로, 텍스트 분류 및 단어 임베딩을 위한 빠르고 효율적인 도구이다. 이는 Facebook AI Research 팀에서 개발했으며, 특히 대규모 텍스트 데이터에서도 높은 성능과 속도를 제공한다. FastText는 아래와 같은 주요 특징을 가진다:


**주요 특징**
1. **단어 벡터 학습 (Word Embeddings)**  
   - FastText는 단어를 고정된 크기의 벡터로 변환하는 단어 임베딩 모델을 학습한다. 이는 단어의 의미를 벡터 공간에 매핑하여 유사한 단어가 가까운 벡터로 표현되도록 한다.
   - 기존의 Word2Vec과 유사하지만, FastText는 단어를 **서브워드(subword)** 단위로 처리한다.

2. **서브워드 기반 모델 (Subword-based Model)**  
   - 단어를 n-그램(예: 'apple' → ['app', 'ppl', 'ple'])으로 분해하여 학습하기 때문에, **희귀 단어**나 **철자 오류**에도 강건하다.
   - 이는 단어 외에도 철자 패턴과 같은 더 세밀한 정보를 학습하는 데 유용하다.

3. **텍스트 분류 (Text Classification)**  
   - FastText는 문서나 문장을 빠르고 정확하게 분류하는 데 최적화되어 있다.
   - 학습 과정이 빠르고, 모델의 크기가 작으며, 정확도도 뛰어나다.

4. **효율적인 구현**  
   - FastText는 CPU 기반으로도 높은 성능을 내도록 설계되었으며, 대규모 데이터셋에서도 빠르게 작동한다.

**FastText의 작동 원리**
1. **단어 표현**  
   - 단어를 n-그램 서브워드로 나눈 후, 각 서브워드에 대해 벡터를 학습한다.
   - 예를 들어, "cat"이라는 단어는 'c', 'ca', 'cat'과 같은 다양한 조합으로 분해된다.
   - 결과적으로 단어 벡터는 각 서브워드 벡터의 합으로 표현된다.

2. **모델 구조**  
   - FastText는 Skip-gram 모델이나 CBOW 모델을 기반으로 동작한다.
   - 단, 기존 모델과 달리 단어 자체가 아닌 서브워드를 사용하여 학습한다.

**FastText의 장점**
1. **희귀 단어 처리 능력**  
   - 서브워드 기반 접근 방식 덕분에 희귀 단어 또는 새로운 단어에 대해 더 좋은 일반화 성능을 발휘한다.
2. **빠른 학습 속도**  
   - 단순한 모델 구조와 최적화된 구현으로 매우 빠르게 학습할 수 있다.
3. **다양한 언어 지원**  
   - 다양한 언어에서 동작하며, 특히 굴절어(inflected languages)와 같은 복잡한 언어에서도 효과적이다.

**활용 사례**
1. **단어 임베딩**  
   - 단어 간 유사도 계산, 문장 표현 학습.
2. **텍스트 분류**  
   - 스팸 필터링, 감정 분석, 뉴스 분류.
3. **다언어 지원**  
   - 다국어 데이터셋에서 빠른 응답 성능 제공.

### gensim FastText

In [22]:
from gensim.models import FastText
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

In [44]:
f = open('data/ted_en.xml', 'r', encoding='UTF-8')
xml = etree.parse(f)

corpus = '\n'.join(xml.xpath('//content/text()'))
corpus = re.sub(r'\([^)]*\)', ' ', corpus)

sentences = sent_tokenize(corpus)

preprocessed_sentences = []

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^0-9a-zA-Z\s]',' ',sentence)
    tokens = word_tokenize(sentence)
    preprocessed_sentences.append(tokens)

In [45]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences =preprocessed_sentences,
    vector_size =100,
    window = 5,
    min_count=5,
    sg =0
)

In [46]:
w2v_model.wv.vectors.shape

(21613, 100)

In [28]:
w2v_df = pd.DataFrame(w2v_model.wv.vectors, index=w2v_model.wv.index_to_key)
w2v_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.776963,-0.797464,0.444491,0.191414,-0.491910,-1.424802,0.066485,0.459750,1.036876,0.019964,...,-0.583155,0.295929,1.405704,-0.471201,1.443888,-0.300509,-0.296716,0.001117,-0.070745,0.501643
and,-0.303893,0.475106,-0.338535,0.548562,-0.054172,-0.505835,-0.797833,-0.537236,-0.068170,0.331256,...,-1.419948,0.078482,-0.037150,-0.506518,0.845964,-0.218977,1.082322,0.189117,2.074550,-0.188505
to,1.272241,0.929739,-0.264578,-1.512347,0.557372,0.648326,-2.652921,-0.389107,-0.395356,1.800736,...,-3.554919,1.772559,0.914105,-0.018856,1.716710,-0.738535,0.521278,0.532989,2.901483,-0.683534
of,-2.185021,1.035434,0.798074,-0.163894,-0.280988,-1.066290,0.635226,0.009180,-0.944173,-0.198484,...,-1.007902,0.770110,1.859850,0.166012,2.134465,-1.790208,0.815877,0.059806,-0.665944,1.060775
a,-0.438809,-2.172774,0.327632,-0.935937,0.066895,1.701729,0.561861,0.954440,0.971521,1.513618,...,-0.201605,0.562909,2.075599,0.578084,1.559882,-1.872466,0.755718,1.149818,1.390264,3.256155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bullies,-0.006125,-0.007149,0.022816,0.026336,0.013268,-0.105609,0.029853,0.145805,-0.046724,-0.021191,...,0.006716,0.028263,0.021686,-0.022965,0.064978,0.021635,0.088181,-0.030756,0.026607,0.010633
splendor,0.010778,0.055281,0.008977,-0.008202,0.002482,-0.108669,0.021202,0.162862,-0.076732,-0.093240,...,0.057508,-0.007362,0.018966,-0.039649,0.140846,0.031968,-0.023660,-0.032566,0.032430,-0.052803
enslaving,-0.093687,0.047929,0.022094,0.007275,-0.037095,-0.111963,0.042325,0.111749,-0.053040,-0.052479,...,0.084536,-0.049987,-0.031262,-0.048201,0.056903,0.087529,0.124241,-0.084242,0.078272,-0.088292
inspirations,-0.017800,0.009915,-0.000538,-0.037676,0.008513,-0.089794,-0.029525,0.183153,-0.073258,-0.014399,...,0.076714,-0.072973,0.078307,0.053352,0.088617,0.028858,0.069231,-0.045606,0.000903,-0.034441


In [29]:
w2v_model.wv.most_similar('father')

[('son', 0.9390171766281128),
 ('husband', 0.9155017733573914),
 ('mother', 0.910145103931427),
 ('daughter', 0.8917059898376465),
 ('dad', 0.8809183239936829),
 ('sister', 0.8742847442626953),
 ('wife', 0.8714937567710876),
 ('brother', 0.8679164052009583),
 ('grandmother', 0.8619385361671448),
 ('mom', 0.8603384494781494)]

In [33]:
#FastText
from gensim.models import FastText

fasttext_model = FastText(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

fasttext_model.wv.vectors.shape

(21613, 100)

In [34]:
fasttext_df = pd.DataFrame(fasttext_model.wv.vectors, index=fasttext_model.wv.index_to_key)
fasttext_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.066028,-0.425921,-0.485485,0.304084,-1.815992,1.00711,-1.400272,1.424996,-2.258129,1.889184,...,-0.533799,-2.68227,2.220175,0.819622,-1.353169,-1.260544,-1.477636,-1.36042,-0.579382,3.459252
and,-1.076935,2.458134,-0.556498,-1.833398,-2.116755,0.591015,-1.247553,0.961236,-0.4216,-0.828085,...,0.860981,-0.038644,1.91147,0.811239,-0.506321,0.036457,-0.043981,1.900904,0.726009,-1.495924
to,-1.985636,6.917452,-4.156023,-0.891901,-2.872051,-1.590681,-4.289844,2.413342,-4.724717,3.735363,...,1.436304,-1.438557,6.41631,-0.877485,3.892961,-3.695372,0.823196,-1.1737,0.481083,1.362189
of,4.318037,-2.535341,3.265241,-0.927753,-0.872222,4.580719,-1.864809,3.280707,3.956719,-1.505844,...,2.752055,-3.141435,1.033209,-5.338972,2.482344,-4.912637,-1.993977,-8.848774,-1.001378,0.606083
a,5.676257,0.750182,1.271193,0.748281,1.569148,9.575325,-5.356816,8.108353,4.894669,1.834455,...,-0.536344,-6.734227,-3.618746,-1.04375,3.692653,4.947735,-1.98536,0.567206,-2.689448,-0.794772
that,1.725451,0.800955,2.16264,0.660587,-1.69949,0.416532,-0.624124,-0.756223,-0.619966,-0.028935,...,1.739324,0.301694,1.459222,-0.300404,-1.316466,-3.52396,0.8777,1.840534,0.73848,-0.522883
i,5.292976,-6.19838,-3.692228,3.971497,3.626101,6.033006,-2.133781,-12.945464,-3.530894,6.342506,...,4.001472,-1.753188,8.397825,-10.390478,-1.699131,-1.142891,3.306067,14.443714,-3.150985,-9.103758
in,-4.18897,-0.560295,4.646445,0.171138,3.557795,0.365687,-0.132407,2.625147,-2.119274,0.002492,...,-0.474225,-1.56501,-1.387192,0.673072,3.559374,-3.159387,-2.418353,-2.990674,-4.053271,0.215456
it,2.542565,0.693409,2.701969,1.226769,0.419028,-1.145173,0.9955,3.624137,-6.746661,2.429697,...,-0.684783,-2.151254,1.403046,-1.273402,-5.427185,-0.9727,2.141302,3.667426,3.546724,-0.463801
you,0.157724,2.577191,-2.993769,3.517256,-3.992445,-1.797449,-1.274788,-4.749404,-3.674747,4.231498,...,1.803459,-3.595713,0.140987,-5.587932,-0.964713,0.420893,-0.194007,3.683552,0.41555,-4.068974


In [35]:
fasttext_model.wv.most_similar('father')

[('godfather', 0.9675668478012085),
 ('grandfather', 0.9613574743270874),
 ('mother', 0.9392042756080627),
 ('grandmother', 0.9352418184280396),
 ('stepfather', 0.9235369563102722),
 ('brother', 0.9186810851097107),
 ('granddaughter', 0.8919588923454285),
 ('feather', 0.8903626203536987),
 ('daughter', 0.8879181742668152),
 ('motherhood', 0.8842654228210449)]

In [38]:
fasttext_model.wv.most_similar('abracadabra')

[('abrahamic', 0.8355204463005066),
 ('bra', 0.8248817920684814),
 ('gerontology', 0.8245952725410461),
 ('anthropology', 0.8084943294525146),
 ('braille', 0.807310938835144),
 ('zebra', 0.8030344247817993),
 ('alhambra', 0.8016319870948792),
 ('brace', 0.7956182360649109),
 ('brady', 0.7921019196510315),
 ('pharmacology', 0.7920774817466736)]

### fasttext 패키지 설치

In [39]:
!pip install fasttext-wheel

Collecting fasttext-wheel
  Downloading fasttext-wheel-0.9.2.tar.gz (71 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: fasttext-wheel
  Building wheel for fasttext-wheel (pyproject.toml) ... [?25ldone
[?25h  Created wheel for fasttext-wheel: filename=fasttext_wheel-0.9.2-cp312-cp312-macosx_15_0_arm64.whl size=280452 sha256=bca78e3a5f135a31fc210afa7ee2100a53d4e20d08133b66d7b3cfcd50b74476
  Stored in directory: /Users/hwangjunho/Library/Caches/pip/wheels/69/26/b0/acc2f5f9df418ddd0ccccd4531f8a4d9740de0840f51e5aa74
Successfully built fasttext-wheel
Installing collected packages: fasttext-wheel
Successfully installed fasttext-wheel-0.9.2


In [40]:
import fasttext
import fasttext.util

model = fasttext.train_unsupervised(
    'naver_movie_ratings.txt',
    model='skipgram',
    minCount=1,
    dim=100,
    minn=3,
    maxn=5
)

Read 2M words
Number of words:  650541
Number of labels: 0
Progress: 100.0% words/sec/thread:   69372 lr:  0.000000 avg.loss:  2.382764 ETA:   0h 0m 0s


In [41]:
model.get_word_vector('극장')

array([ 2.96947658e-01,  1.11858696e-01, -1.16792962e-01,  8.90490413e-02,
       -3.76736104e-01, -2.69724667e-01, -3.39125484e-01,  4.49560940e-01,
        1.17842531e+00,  7.19143867e-01,  3.57253969e-01,  4.89631414e-01,
        2.34527588e-01, -2.53516138e-01, -5.67807913e-01, -6.07770681e-01,
       -8.83643031e-02, -9.55508709e-01, -1.27493000e+00, -3.07288319e-02,
        7.93063045e-01, -5.67875266e-01,  4.57900375e-01, -6.79764003e-02,
        1.84115469e-02, -2.26187870e-01,  4.15671885e-01,  3.43420625e-01,
       -1.47440600e+00,  2.19775230e-01, -7.44487196e-02,  5.32227159e-01,
       -2.04835802e-01,  3.78566980e-03, -4.94383097e-01,  1.21569954e-01,
        4.38608199e-01,  3.12539935e-01, -6.59904301e-01,  1.22686431e-01,
       -3.08894843e-01,  2.11099476e-01, -5.13445400e-02, -5.55364668e-01,
        1.00283933e+00, -1.16075122e+00,  3.47618669e-01, -5.20512521e-01,
       -1.30154282e-01,  2.08460271e-01,  1.28696665e-01,  1.19429082e-04,
        6.39921188e-01,  

In [43]:
model.get_subwords('특선영화')

(['특선영화',
  '<특선',
  '<특선영',
  '<특선영화',
  '특선영',
  '특선영화',
  '특선영화>',
  '선영화',
  '선영화>',
  '영화>'],
 array([  91992,  989150,  929201, 1543251, 2496531,  878545, 1046555,
        2645177, 2342883, 2504929]))