# fastText

https://github.com/facebookresearch/fastText

**FastText**는 자연어 처리(NLP) 작업에서 사용되는 오픈소스 라이브러리로, 텍스트 분류 및 단어 임베딩을 위한 빠르고 효율적인 도구이다. 이는 Facebook AI Research 팀에서 개발했으며, 특히 대규모 텍스트 데이터에서도 높은 성능과 속도를 제공한다. FastText는 아래와 같은 주요 특징을 가진다:


**주요 특징**
1. **단어 벡터 학습 (Word Embeddings)**  
   - FastText는 단어를 고정된 크기의 벡터로 변환하는 단어 임베딩 모델을 학습한다. 이는 단어의 의미를 벡터 공간에 매핑하여 유사한 단어가 가까운 벡터로 표현되도록 한다.
   - 기존의 Word2Vec과 유사하지만, FastText는 단어를 **서브워드(subword)** 단위로 처리한다.

2. **서브워드 기반 모델 (Subword-based Model)**  
   - 단어를 n-그램(예: 'apple' → ['app', 'ppl', 'ple'])으로 분해하여 학습하기 때문에, **희귀 단어**나 **철자 오류**에도 강건하다.
   - 이는 단어 외에도 철자 패턴과 같은 더 세밀한 정보를 학습하는 데 유용하다.

3. **텍스트 분류 (Text Classification)**  
   - FastText는 문서나 문장을 빠르고 정확하게 분류하는 데 최적화되어 있다.
   - 학습 과정이 빠르고, 모델의 크기가 작으며, 정확도도 뛰어나다.

4. **효율적인 구현**  
   - FastText는 CPU 기반으로도 높은 성능을 내도록 설계되었으며, 대규모 데이터셋에서도 빠르게 작동한다.

**FastText의 작동 원리**
1. **단어 표현**  
   - 단어를 n-그램 서브워드로 나눈 후, 각 서브워드에 대해 벡터를 학습한다.
   - 예를 들어, "cat"이라는 단어는 'c', 'ca', 'cat'과 같은 다양한 조합으로 분해된다.
   - 결과적으로 단어 벡터는 각 서브워드 벡터의 합으로 표현된다.

2. **모델 구조**  
   - FastText는 Skip-gram 모델이나 CBOW 모델을 기반으로 동작한다.
   - 단, 기존 모델과 달리 단어 자체가 아닌 서브워드를 사용하여 학습한다.

**FastText의 장점**
1. **희귀 단어 처리 능력**  
   - 서브워드 기반 접근 방식 덕분에 희귀 단어 또는 새로운 단어에 대해 더 좋은 일반화 성능을 발휘한다.
2. **빠른 학습 속도**  
   - 단순한 모델 구조와 최적화된 구현으로 매우 빠르게 학습할 수 있다.
3. **다양한 언어 지원**  
   - 다양한 언어에서 동작하며, 특히 굴절어(inflected languages)와 같은 복잡한 언어에서도 효과적이다.

**활용 사례**
1. **단어 임베딩**  
   - 단어 간 유사도 계산, 문장 표현 학습.
2. **텍스트 분류**  
   - 스팸 필터링, 감정 분석, 뉴스 분류.
3. **다언어 지원**  
   - 다국어 데이터셋에서 빠른 응답 성능 제공.

## gensim FastText

In [1]:
# !pip install nltk

In [2]:
from gensim.models import FastText
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

In [4]:
f = open('ted_en.xml', 'r', encoding='utf-8')
xml = etree.parse(f)

corpus = '\n'.join(xml.xpath('//content/text()'))
corpus = re.sub(r'\([^)]*\)', '', corpus)

sentences = sent_tokenize(corpus)   #문장 단위 토큰화

preprocessed_sentences = [] #전처리 완료한 것을 받아줌.

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^0-9a-zA-Z]', ' ', sentence)
    tokens = word_tokenize(sentence)
    preprocessed_sentences.append(tokens)

In [5]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences = preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

In [6]:
w2v_model.wv.vectors.shape

(21613, 100)

In [7]:
w2v_df = pd.DataFrame(w2v_model.wv.vectors, index=w2v_model.wv.index_to_key)
w2v_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.796338,-0.667115,0.010667,-0.048711,-0.817956,-1.039908,0.162477,-0.366594,1.522968,0.774081,...,-0.455889,0.425948,0.597530,-0.782464,1.298080,0.456934,0.419550,-0.650006,0.839398,1.055030
and,-1.411786,0.222986,0.226645,-0.076862,0.842408,0.103371,-2.127734,0.266194,-1.133336,-0.508279,...,-0.707466,0.252866,0.624714,-0.218014,0.394531,-0.774429,0.237228,0.687620,1.453149,-0.078364
to,0.712515,0.425849,-1.259865,-1.496762,1.768470,0.054263,-3.840959,-0.713046,-1.152648,1.104363,...,-0.982708,3.213798,0.826391,-0.489452,-0.114355,-0.531721,-0.800484,-0.067859,3.051630,0.028353
of,-2.544311,1.131186,-0.491778,-1.349420,0.480213,-0.881054,-1.740203,0.435761,0.336198,-0.675590,...,-1.889598,1.965618,0.735686,-0.005052,1.926310,-0.216531,0.653568,0.524149,-0.454954,0.800512
a,-1.060251,-1.065062,-0.079677,-0.987760,0.257333,2.693637,-0.461988,-1.513585,0.469920,1.138510,...,-0.410625,1.607007,1.383409,0.879824,2.531086,-1.524093,0.264770,0.435698,0.598641,3.404154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bullies,0.022034,-0.042698,0.026442,-0.013671,0.023486,-0.093976,-0.016941,0.103524,-0.052466,-0.052304,...,0.106841,0.008852,0.002353,0.042900,0.055617,0.031132,0.081750,-0.008910,0.019832,-0.040410
splendor,0.041067,0.002632,-0.050060,-0.010757,0.031189,-0.063312,-0.008563,0.156203,-0.034825,-0.080066,...,0.061950,0.009127,0.067130,-0.016564,0.051954,0.001374,0.000089,-0.051148,-0.013619,-0.022083
enslaving,-0.090814,0.014560,0.037404,0.084089,0.026793,-0.125393,0.037622,0.104023,-0.068871,-0.035068,...,0.024046,-0.053033,-0.020010,-0.018595,0.073161,0.029469,0.063298,-0.092950,0.028843,-0.046309
inspirations,-0.013141,-0.002378,0.010291,-0.072822,0.000122,-0.124427,-0.037502,0.127031,-0.049245,0.010140,...,0.032934,0.014206,-0.008891,0.021049,0.045627,0.067176,0.026027,0.009158,-0.012763,-0.023674


In [8]:
w2v_model.wv.most_similar('father')
# w2v_model.wv.most_similar('abracadabra')

[('son', 0.9123485684394836),
 ('mother', 0.9038951396942139),
 ('husband', 0.8988270163536072),
 ('daughter', 0.8947147727012634),
 ('dad', 0.8846123814582825),
 ('wife', 0.8780397772789001),
 ('sister', 0.8769536018371582),
 ('uncle', 0.8730123043060303),
 ('brother', 0.8678197264671326),
 ('grandfather', 0.8651431202888489)]

In [9]:
# FastText
from gensim.models import FastText

fasttext_model = FastText(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

fasttext_model.wv.vectors.shape

(21613, 100)

In [10]:
fasttext_df = pd.DataFrame(fasttext_model.wv.vectors, index=fasttext_model.wv.index_to_key)
fasttext_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,0.795931,2.307072,-1.775411,-0.043200,-2.205198,1.205355,-2.110960,1.518950,-2.448919,1.047636,...,-1.678700,-0.022859,3.066850,-0.641973,1.559653,-2.096075,1.508018,0.168231,0.216444,3.647699
and,-2.199786,2.089274,0.475970,0.419396,-1.001879,1.459685,-2.760160,1.462184,-1.654204,0.208814,...,-1.239211,1.013188,1.304046,0.366325,-0.552023,-2.401501,1.105280,-2.419214,-0.201186,-0.286171
to,0.053426,5.710554,-0.953427,0.730889,-3.902440,-4.632107,-0.910566,4.479760,-1.341629,1.726183,...,-3.620025,-1.419651,2.640763,1.585184,4.442782,-0.700811,0.991374,1.234336,3.999476,-3.672229
of,-3.645368,1.905113,-3.481102,-3.170648,-0.017848,7.077937,-3.080188,2.332995,4.108702,-4.413774,...,1.722271,-0.389899,-1.851145,-2.605480,-2.552389,6.104148,-0.862815,-11.387140,1.995455,0.863953
a,6.249631,5.720897,0.430787,-0.273637,-1.755657,6.328749,-5.951559,-1.119490,2.024936,-0.159320,...,4.343470,-7.346551,-5.545101,-0.701436,1.993731,1.753241,1.733654,6.832728,0.116900,2.606367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bullies,0.067243,0.245842,1.651779,0.290767,-0.772758,0.001221,-0.855174,0.283391,0.753297,0.921886,...,0.811756,1.257144,0.911596,-0.518997,-0.024651,-0.688455,0.148350,-0.253123,0.320930,-0.083799
splendor,0.713796,-0.114772,0.764624,0.106705,-0.463035,0.278161,-0.009974,-0.186820,-0.050478,0.261802,...,0.646715,0.260689,0.377128,0.131493,-0.247338,0.217470,0.089254,-0.237877,0.037371,0.195668
enslaving,0.081055,1.085910,0.517078,0.108046,-0.378317,0.844733,-0.369059,0.484494,-0.163265,1.344356,...,0.707870,0.861005,0.205956,-0.017586,0.867544,-0.326354,0.138738,-0.728256,0.324296,0.031473
inspirations,0.161611,1.121585,1.467443,0.249953,-1.212321,0.297023,-0.920657,0.202091,0.807664,0.675500,...,0.851693,1.052634,1.688162,-0.776508,-0.468740,-1.180213,-0.578344,0.129891,0.264549,-0.855314


In [12]:
# fasttext_model.wv.most_similar('father')
fasttext_model.wv.most_similar('abracadabra')

[('abrahamic', 0.8225451707839966),
 ('celebratory', 0.8123055100440979),
 ('bra', 0.809792697429657),
 ('pharmacology', 0.8002275824546814),
 ('braille', 0.7993728518486023),
 ('anthropology', 0.7969568371772766),
 ('brace', 0.7954248785972595),
 ('gerontology', 0.7899302244186401),
 ('trajectory', 0.7882890105247498),
 ('braun', 0.7856870889663696)]

In [13]:
fasttext_model.wv['abracadabra']

array([ 0.3658495 ,  0.01416903,  0.2511049 ,  0.05636037, -0.32492286,
        0.05466355, -0.12639265,  0.05328081, -0.09600283,  0.3316598 ,
       -0.28714967,  0.11715566,  0.0616492 ,  0.01174701, -0.2297349 ,
        0.3077034 ,  0.00321594,  0.0047213 , -0.26885608, -0.23947601,
       -0.26997963,  0.07585284,  0.05859122,  0.20977187, -0.03192195,
        0.13659856, -0.22644264, -0.18093368,  0.10970066,  0.10003294,
        0.1080725 ,  0.4185541 , -0.09203792, -0.11750513,  0.15716758,
        0.07775126, -0.0007779 , -0.2067473 ,  0.1193454 ,  0.04365681,
        0.09222052, -0.27170417,  0.07427084,  0.16793695, -0.2673172 ,
       -0.45981944,  0.16570455, -0.14425793, -0.21834369,  0.34363705,
       -0.02929795,  0.03751097, -0.34714827,  0.1657564 , -0.20669067,
        0.15562825,  0.06888905, -0.03404712, -0.14733714,  0.17080005,
        0.44144717, -0.05274072,  0.23618264, -0.09813219,  0.4863861 ,
        0.3915351 , -0.01902009, -0.22519109, -0.03781037, -0.24

### fasttext 패키지 설치

In [15]:
!pip install fasttext-wheel



In [16]:
import fasttext
import fasttext.util

model = fasttext.train_unsupervised(
    'naver_movie_ratings.txt',
    model='skipgram',
    minCount=1,
    dim=100,
    minn=3,
    maxn=5
)

In [23]:
model.get_word_vector('극장')

array([ 0.5133301 , -0.3918601 , -0.53698534,  0.8300549 , -0.28512287,
        0.02670072, -0.18230653,  0.06180463,  0.1193095 ,  0.6347901 ,
        0.02883446,  0.43405813,  0.53329366, -0.37888303, -0.72836095,
       -1.0086077 ,  0.06491461, -1.4531572 , -0.41711667, -0.24503992,
        0.52984005,  0.3612824 ,  0.27693176, -0.06325518,  0.621961  ,
       -0.9090264 , -0.23314627,  0.5396179 , -0.7831601 ,  0.39649904,
       -0.22723114,  0.46795216,  0.00243461,  0.305085  , -0.62191963,
        0.19338626,  0.03398333,  0.32460508, -0.38305777, -0.24791658,
        0.10217802,  0.32429725,  0.3333733 , -0.26681793,  1.0826446 ,
       -0.15730074,  0.19420673, -0.469149  , -0.06383722,  0.2761275 ,
        0.26274553, -0.84116113, -0.10727764,  0.48208693, -0.68479085,
        0.12208257,  0.7245796 ,  0.3125805 ,  1.1093127 , -0.10600656,
        0.44692105,  0.04342682,  0.22491696,  0.45959508, -0.02276977,
       -0.10401051, -0.09711914,  0.01634998,  0.41386276,  0.01

In [18]:
model.get_subwords('영화관')

(['영화관', '<영화', '<영화관', '<영화관>', '영화관', '영화관>', '화관>'],
 array([   2062, 1921845, 1442415, 1378913, 2245977, 1515139, 1352938]))

In [24]:
model.get_subwords('특선영화')

(['특선영화',
  '<특선',
  '<특선영',
  '<특선영화',
  '특선영',
  '특선영화',
  '특선영화>',
  '선영화',
  '선영화>',
  '영화>'],
 array([  54542,  989150,  929201, 1543251, 2496531,  878545, 1046555,
        2645177, 2342883, 2504929]))