## **| 텍스트 분석 연습 문제**

- 출처 : 캐글

### **1. Tokenization**

In the field of Natural Language Processing, tokenization basically refers to splitting up a larger body of text into smaller lines or words.

There are mainly two types of tokenization :

- Sentence Tokenization
- Word Tokenization

In [None]:
# import package
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# sample text to perform our operations
text = "Hi, My name is Amartya Nambiar. I am a Computer Science Engineer. My favourite color is black"

In [None]:
# 문장 토큰화
sent_tokenize(text)

['Hi, My name is Amartya Nambiar.',
 'I am a Computer Science Engineer.',
 'My favourite color is black']

In [None]:
# 단어 토큰화, 길이 출력
words =word_tokenize(text)   #tokenized into words
print(len(words))
print(words)

20
['Hi', ',', 'My', 'name', 'is', 'Amartya', 'Nambiar', '.', 'I', 'am', 'a', 'Computer', 'Science', 'Engineer', '.', 'My', 'favourite', 'color', 'is', 'black']


### **2. Stopwords & Flushing them**

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

In [None]:
import nltk
nltk.download('stopwords') 
from nltk.corpus import stopwords 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# english stopword 불러오기, 15개만 확인
stop = stopwords.words('english')
print(len(stop))
print(stop[:15])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours']


In [None]:
# 필터링을 통해 text에서 stopword 제거
clean = [i for i in words if not i in stop]      # removing stopwords from our sample text
print(len(clean))
print(clean)

16
['Hi', ',', 'My', 'name', 'Amartya', 'Nambiar', '.', 'I', 'Computer', 'Science', 'Engineer', '.', 'My', 'favourite', 'color', 'black']


In [None]:
# 소문자 변환 후, stopword 제거
words =word_tokenize(text.lower())          
clean_lower = [i for i in words if not i in stop]   
print(len(clean_lower))
print(clean_lower)

13
['hi', ',', 'name', 'amartya', 'nambiar', '.', 'computer', 'science', 'engineer', '.', 'favourite', 'color', 'black']


In [None]:
# punctuation('.', ',') 제거

import string 
punctuations = list(string.punctuation)        
stop += punctuations                           
words =word_tokenize(text.lower())
clean_lower = [i for i in words if not i in stop]
print(len(clean_lower))
print(clean_lower)

10
['hi', 'name', 'amartya', 'nambiar', 'computer', 'science', 'engineer', 'favourite', 'color', 'black']


### **3. Stemming**

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

There are mainly two widely used Stemmer Algorithms:

- Porter Stemmer (we'll work on this)
- Lancaster Stem

In [None]:
from nltk.stem import PorterStemmer

In [None]:
# ps 객체 생성 후 stemming , example 최소 3개 임의 생성 후 시도해보기
# example1= ['helps', 'helping', 'helped']   

ps = PorterStemmer()         
example = ['helps','helping','helped']   
stemmed_example = [ps.stem(i) for i in example]
stemmed_example

['help', 'help', 'help']

In [None]:
ps.stem('happiness') # but it isn't always the best choice

'happi'

### **4. Lemmatization**

PorterStemmer class chops off the suffixes from the word but this isn't the best thing to apply to clean our data.

Stemming technique only looks at the form of the word whereas Lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.

In [None]:
#import package
import nltk
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
#lemmatize word 'believes' (use parameter)
lem = WordNetLemmatizer()
lem.lemmatize('believes')

'belief'

In [None]:
lem.lemmatize('believes', pos='a')

'believes'

In [None]:
lem.lemmatize('believes', pos='v')

'believe'