# Feature extraction from text with scikit-learn

In text, feature extraction means digitizing. I am going to look two method to do this.

**CountVectorizer, TfidVectorizer**

### 1. CountVectorizer

1. First, it needs to be fit with texts to make vocabulary dictionary. (or character)
2. And it counts number of word in text for each data. 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

text_data = ['나는 배가 고프다',
             '내일 점심 뭐먹지',
             '내일 공부 해야겠다',
             '점심 먹고 공부해야지']

count_vectorizer = CountVectorizer()
count_vectorizer.fit(text_data)
print(count_vectorizer.vocabulary_)

{'나는': 3, '배가': 7, '고프다': 0, '내일': 4, '점심': 8, '뭐먹지': 6, '공부': 1, '해야겠다': 9, '먹고': 5, '공부해야지': 2}


It makes index for each word.

In [6]:
sentence = [text_data[0]]
print(count_vectorizer.transform(sentence).toarray())

[[1 0 0 1 0 0 0 1 0 0]]


And it finally counts number of words. It is considered as very simple method to vectorize. It doesn't consider importance of each word.

TF-IDF method is better method to solve this problem.

### 2. TF-IDF, TfidVectorizer

It extracts feature from text according to specific value. Simply, TF(Term Frequency) means number of appearance, one word in a data

and DF(Document Frequency) also means number of appearance, but one word in all data. IDF(Inverse Document Frequency) is calculated with 

inverse of DF value which means its value would increase if it doesn't appear frequently

**TFIDF** is value of multiplying TF and IDF. It has large value if specific word has high frequency in a data and low frequency in all data.

So, it can deal with postposition or demonstrative pronoun which is not that important.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(text_data)
print(tfidf_vectorizer.vocabulary_)

{'나는': 3, '배가': 7, '고프다': 0, '내일': 4, '점심': 8, '뭐먹지': 6, '공부': 1, '해야겠다': 9, '먹고': 5, '공부해야지': 2}


In [10]:
sentence = [text_data[0]]
print(tfidf_vectorizer.transform(text_data).toarray())

[[0.57735027 0.         0.         0.57735027 0.         0.
  0.         0.57735027 0.         0.        ]
 [0.         0.         0.         0.         0.52640543 0.
  0.66767854 0.         0.52640543 0.        ]
 [0.         0.61761437 0.         0.         0.48693426 0.
  0.         0.         0.         0.61761437]
 [0.         0.         0.61761437 0.         0.         0.61761437
  0.         0.         0.48693426 0.        ]]
