# Text Preprocessing

In this notebook, you will practice how to preprocess text data before analyzing it. It's an inevitable process for every NLP project.


Reference:    
* https://github.com/gilbutITbook/080289/blob/main/chap09/colab_9%EC%9E%A5.ipynb

In [1]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[K     |████████████████████████████████| 19.4 MB 423 kB/s 
Collecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 42.6 MB/s 
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.3.0 konlpy-0.6.0


In [2]:
import numpy as np
import pandas as pd
import nltk

In [3]:
nltk.download("popular")
text=nltk.word_tokenize("Is it possible distinguishing cats and dogs")
text

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

['Is', 'it', 'possible', 'distinguishing', 'cats', 'and', 'dogs']

In [4]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
nltk.pos_tag(text)

[('Is', 'VBZ'),
 ('it', 'PRP'),
 ('possible', 'JJ'),
 ('distinguishing', 'VBG'),
 ('cats', 'NNS'),
 ('and', 'CC'),
 ('dogs', 'NNS')]

In [6]:
nltk.download('punkt')
string1="my favorite subject is math"
string2="my favorite subject is math, english, economic and computer science"
nltk.word_tokenize(string1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['my', 'favorite', 'subject', 'is', 'math']

In [7]:
nltk.word_tokenize(string2)

['my',
 'favorite',
 'subject',
 'is',
 'math',
 ',',
 'english',
 ',',
 'economic',
 'and',
 'computer',
 'science']

In [12]:
# KoNLPy
from konlpy.tag import Komoran, Kkma
import time

t1 = time.time()
komoran = Komoran()
print(komoran.morphs('딥러닝이 쉽나요? 어렵나요?'))

t2 = time.time()

kkma = Kkma()
print(kkma.morphs('딥러닝이 쉽나요? 어렵나요?'))
t3 = time.time()

print(t2-t1, t3-t2)

['딥러닝이', '쉽', '나요', '?', '어렵', '나요', '?']
['딥', '러닝', '이', '쉽', '나요', '?', '어렵', '나요', '?']


In [13]:
print(komoran.pos('소파 위에 있는 것이 고양이인가요? 강아지인가요?'))

[('소파', 'NNP'), ('위', 'NNG'), ('에', 'JKB'), ('있', 'VV'), ('는', 'ETM'), ('것', 'NNB'), ('이', 'JKS'), ('고양이', 'NNG'), ('이', 'VCP'), ('ㄴ가요', 'EF'), ('?', 'SF'), ('강아지', 'NNG'), ('이', 'VCP'), ('ㄴ가요', 'EF'), ('?', 'SF')]


## Tokenizing

In [14]:
# Sentence Tokenizing
from nltk import sent_tokenize
text_sample = 'Natural Language Processing, or NLP, is the process of extracting the meaning, or intent, behind human language. In the field of Conversational artificial intelligence (AI), NLP allows machines and applications to understand the intent of human language inputs, and then generate appropriate responses, resulting in a natural conversation flow.'
tokenized_sentences = sent_tokenize(text_sample)
print(tokenized_sentences)

['Natural Language Processing, or NLP, is the process of extracting the meaning, or intent, behind human language.', 'In the field of Conversational artificial intelligence (AI), NLP allows machines and applications to understand the intent of human language inputs, and then generate appropriate responses, resulting in a natural conversation flow.']


In [15]:
# Word Tokenizing
from nltk import word_tokenize
sentence = " This book is for deep learning learners"
words = word_tokenize(sentence)
print(words)

['This', 'book', 'is', 'for', 'deep', 'learning', 'learners']


In [16]:
# Tokenizing w.r.t. apostrophe
from nltk.tokenize import WordPunctTokenizer  
sentence = "it’s nothing that you don’t already know except most people aren’t aware of how their inner world works."
words = WordPunctTokenizer().tokenize(sentence)
print(words)

['it', '’', 's', 'nothing', 'that', 'you', 'don', '’', 't', 'already', 'know', 'except', 'most', 'people', 'aren', '’', 't', 'aware', 'of', 'how', 'their', 'inner', 'world', 'works', '.']


In [17]:
data_url = "https://raw.githubusercontent.com/gilbutITbook/080289/main/chap09/data/ratings_train.txt"

In [18]:
rating_data = pd.read_csv(data_url, sep='\t')

In [19]:
rating_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [20]:
len(rating_data)

150000

In [21]:
# 형태소 분석
from konlpy.tag import Okt

In [22]:
from tqdm import tqdm

In [23]:
# Twitter(Okt) - 오픈 소스 한글 형태소 분석기
okt = Okt()

result = []
for comment in tqdm(rating_data.document.values[:1000], desc='형태소 분석 중...'):
  malist = okt.pos(comment, norm=True, stem=True)
  r = []
  for word in malist:
    if not word[1] in 'Josa, Eomi, Punctuation'.split(', '):
      r.append(word[0])
  rl = (' '.join(r)).strip()
  result.append(rl)


형태소 분석 중...: 100%|██████████| 1000/1000 [00:24<00:00, 40.92it/s]


In [24]:
result[:10]

['아 더빙 진짜 짜증나다 목소리',
 '흠 포스터 보고 초딩 영화 줄 오버 연기 가볍다 않다',
 '너 무재 밓었 다그 래서 보다 추천 다',
 '교도소 이야기 구먼 솔직하다 재미 없다 평점 조정',
 '사이 몬페 그 의 익살스럽다 연기 돋보이다 영화 스파이더맨 늙다 보이다 하다 커스틴 던스트 너무나도 이쁘다 보이다',
 '막 걸음 마 떼다 3 세 초등학교 1 학년 생인 8 살다 영화 ㅋㅋㅋ 별 반개 아깝다 움',
 '원작 긴장감 제대로 살리다 하다',
 '별 반개 아깝다 욕 나오다 이응경 길용우 연 기 생활 몇 년 정말 발 해도 그것 낫다 납치 감금 반복 반복 이 드라마 가족 없다 연기 못 하다 사람 모 엿 네',
 '액션 없다 재미 있다 몇 안되다 영화',
 '왜 이렇게 평점 낮다 꽤 볼 한 데 헐리우드 식 화려하다 너무 길들이다 있다']

## Removing Stopwords

In [25]:
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

sample_text = "One of the first things that we ask ourselves is what are the pros and cons of any task we perform."
text_tokens = word_tokenize(sample_text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words('english')]
print("불용어 제거 미적용:", text_tokens, '\n')
print("불용어 제거 적용:",tokens_without_sw)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
불용어 제거 미적용: ['One', 'of', 'the', 'first', 'things', 'that', 'we', 'ask', 'ourselves', 'is', 'what', 'are', 'the', 'pros', 'and', 'cons', 'of', 'any', 'task', 'we', 'perform', '.'] 

불용어 제거 적용: ['One', 'first', 'things', 'ask', 'pros', 'cons', 'task', 'perform', '.']


## Stemming and Lemmatization

ex) writing, writes, wrote, written -> write

In [26]:
# Porter algorithm
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

print(stemmer.stem('obesses'),stemmer.stem('obssesed'))
print(stemmer.stem('standardizes'),stemmer.stem('standardization'))
print(stemmer.stem('national'), stemmer.stem('nation'))
print(stemmer.stem('absentness'), stemmer.stem('absently'))
print(stemmer.stem('tribalical'), stemmer.stem('tribalicalized'))

obess obsses
standard standard
nation nation
absent absent
tribal tribalic


In [27]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('obsesses'),stemmer.stem('obsessed'))
print(stemmer.stem('standardizes'),stemmer.stem('standardization'))
print(stemmer.stem('national'), stemmer.stem('nation'))
print(stemmer.stem('absentness'), stemmer.stem('absently'))
print(stemmer.stem('tribalical'), stemmer.stem('tribalicalized'))

obsess obsess
standard standard
nat nat
abs abs
trib trib


In [28]:
# Lemmatization

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

print(stemmer.stem('obsesses'),stemmer.stem('obsessed'))
print(lemma.lemmatize('standardizes'),lemma.lemmatize('standardization'))
print(lemma.lemmatize('national'), lemma.lemmatize('nation'))
print(lemma.lemmatize('absentness'), lemma.lemmatize('absently'))
print(lemma.lemmatize('tribalical'), lemma.lemmatize('tribalicalized'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
obsess obsess
standardizes standardization
national nation
absentness absently
tribalical tribalicalized


In [29]:
print(lemma.lemmatize('obsesses', 'v'),lemma.lemmatize('obsessed','a'))
print(lemma.lemmatize('standardizes','v'),lemma.lemmatize('standardization','n'))
print(lemma.lemmatize('national','a'), lemma.lemmatize('nation','n'))
print(lemma.lemmatize('absentness','n'), lemma.lemmatize('absently','r'))
print(lemma.lemmatize('tribalical','a'), lemma.lemmatize('tribalicalized','v'))

obsess obsessed
standardize standardization
national nation
absentness absently
tribalical tribalicalized


# To-do

Using requests and bs4 library, get today's lunch and dinner menu of '두레미담' then pre-process them.

In [30]:
import requests
from bs4 import BeautifulSoup

In [31]:
url = "https://snuco.snu.ac.kr/foodmenu"

In [32]:
html = requests.get(url)

In [33]:
soup = BeautifulSoup(html.text, 'html.parser')

In [34]:
lunch = soup.find_all('td', {'class': 'views-field views-field-field-lunch'})

In [35]:
# 두레미담: 5th restaurant
lunch[4]

<td class="views-field views-field-field-lunch">
<p>&lt;셀프코너&gt; 6,500원</p>
<p>잡곡밥</p>
<p>알배추된장국</p>
<p>마늘보쌈</p>
<p>콘스프양념치킨볼</p>
<p>비빔막국수</p>
<p>상추파채무침</p>
<p>포기김치</p>
<p>오늘의차/그린샐러드</p>
<p> </p>
<p>&lt;주문식 메뉴&gt;</p>
<p>고등어 소금구이 13,000원</p>
<p>철판주꾸미 볶음 14,000원</p>
<p>떡갈비 구이 15,000원</p>
<p>도가니탕 18,000원</p>
<p>돌솥 7선 산채비빔밥 15,000원</p>
<p>차돌된장찌개 10,000원</p>
<p>초계국수/냉면+왕만두 11,000원</p>
<p>솥밥(추가) 2,000원</p>
<p> </p>
<p><span style="font-size: 14px;"><span style='color: black; font-family: "맑은 고딕"; language: en-US; mso-ascii-font-family: "맑은 고딕"; mso-fareast-font-family: "맑은 고딕"; mso-bidi-font-family: +mj-cs; mso-ascii-theme-font: major-latin; mso-fareast-theme-font: major-fareast; mso-bidi-theme-font: major-bidi; mso-color-index: 1; mso-font-kerning: 12.0pt; mso-style-textfill-type: solid; mso-style-textfill-fill-themecolor: text1; mso-style-textfill-fill-color: black; mso-style-textfill-fill-alpha: 100.0%;'>※</span><span style='color: black; font-family: "맑은 고딕"; language: ko; mso-ascii-font

In [36]:
p_tag_list = lunch[4].find_all('p')

menus = []
for idx, item in enumerate(p_tag_list[1:]):
  menu = item.get_text()
  if menu != '\xa0':
    menus.append(menu)
  if menu == '오늘의차/그린샐러드':
    break

In [37]:
menus

['잡곡밥', '알배추된장국', '마늘보쌈', '콘스프양념치킨볼', '비빔막국수', '상추파채무침', '포기김치', '오늘의차/그린샐러드']

Now complete rest of the task

In [53]:
# KoNLPy
from konlpy.tag import Komoran, Kkma
import time
komoran = Komoran()
kkma = Kkma()
for menu in menus:
  print(komoran.morphs(menu))
  print(kkma.morphs(menu))

['잡곡밥']
['잡곡밥']
['알', 'ㄹ', '배추', '된장국']
['알', 'ㄹ', '배추', '되', 'ㄴ', '장국']
['마늘', '보쌈']
['마늘', '보쌈']
['콘', '스프', '양념치킨', '볼']
['콘', '스프', '양념', '치킨', '보', 'ㄹ']
['비', '빔', '막국수']
['비빔', '막국수']
['상추', '파', '채무', '침']
['상추', '파', '채무', '치', 'ㅁ']
['포기', '김치']
['포기김치']
['오늘', '의', '차', '/', '그린', '샐러드']
['오늘', '의', '차', '/', '그린', '샐러드']
['잡곡밥']
['잡곡밥']
['바지락', '순두부찌개']
['바지락', '순두부', '찌개']
['고구마', '찜', '닭']
['고구마', '찜', '닭']
['찹쌀', '꾸', '어', '바', '로우']
['찹쌀', '꾸', '어', '바로', '울']
['미역', '줄기', '볶음']
['미역', '줄기', '볶음']
['슈크림', '데', '이', '니', '쉬']
['슈크림', '델', '니', '쉬']
['포기', '김치']
['포기김치']
['오늘', '의', '차', '/', '그린', '샐러드']
['오늘', '의', '차', '/', '그린', '샐러드']


In [39]:
# 형태소 분석
from konlpy.tag import Okt
from tqdm import tqdm

In [41]:
# Twitter(Okt) - 오픈 소스 한글 형태소 분석기
okt = Okt()

result = []
for comment in tqdm(menus, desc='형태소 분석 중...'):
  malist = okt.pos(comment, norm=True, stem=True)
  r = []
  for word in malist:
    if not word[1] in 'Josa, Eomi, Punctuation'.split(', '):
      r.append(word[0])
  rl = (' '.join(r)).strip()
  result.append(rl)


형태소 분석 중...: 100%|██████████| 8/8 [00:00<00:00, 603.61it/s]


In [42]:
result

['잡곡 밥',
 '알 배추 된장국',
 '마늘 보쌈',
 '콘 스프 양념치킨 볼',
 '비빔 막국수',
 '상추 파 채무 침',
 '포기 김치',
 '오늘 차 그린 샐러드']

### Dinner

In [43]:
dinner = soup.find_all('td', {'class': 'views-field views-field-field-dinner'})

In [None]:
from konlpy.tag import Okt
from tqdm import tqdm

In [45]:
# 두레미담: 5th restaurant
dinner[4]

<td class="views-field views-field-field-dinner">
<p>&lt;셀프코너&gt; 6,500원</p>
<p>잡곡밥</p>
<p>바지락순두부찌개</p>
<p>고구마찜닭</p>
<p>찹쌀꿔바로우</p>
<p>미역줄기볶음</p>
<p>슈크림데니쉬</p>
<p>포기김치</p>
<p>오늘의차/그린샐러드</p>
<p> </p>
<p>&lt;주문식 메뉴&gt;</p>
<p>고등어 소금구이 13,000원</p>
<p>철판주꾸미 볶음 14,000원</p>
<p>떡갈비 구이 15,000원</p>
<p>도가니탕 18,000원</p>
<p>돌솥 7선 산채비빔밥 15,000원</p>
<p>차돌된장찌개 10,000원</p>
<p>초계국수/냉면+왕만두 11,000원</p>
<p>솥밥(추가) 2,000원</p>
<p> </p>
<p><span style="font-size: 14px;"><span style='color: black; font-family: "맑은 고딕"; language: en-US; mso-ascii-font-family: "맑은 고딕"; mso-fareast-font-family: "맑은 고딕"; mso-bidi-font-family: +mj-cs; mso-ascii-theme-font: major-latin; mso-fareast-theme-font: major-fareast; mso-bidi-theme-font: major-bidi; mso-color-index: 1; mso-font-kerning: 12.0pt; mso-style-textfill-type: solid; mso-style-textfill-fill-themecolor: text1; mso-style-textfill-fill-color: black; mso-style-textfill-fill-alpha: 100.0%;'>※</span><span style='color: black; font-family: "맑은 고딕"; language: ko; mso-ascii-f

In [48]:
p_tag_list = dinner[4].find_all('p')

dinner_menus = []
for idx, item in enumerate(p_tag_list[1:]):
  menu = item.get_text()
  if menu != '\xa0':
    dinner_menus.append(menu)
  if menu == '오늘의차/그린샐러드':
    break

In [49]:
dinner_menus

['잡곡밥',
 '바지락순두부찌개',
 '고구마찜닭',
 '찹쌀꿔바로우',
 '미역줄기볶음',
 '슈크림데니쉬',
 '포기김치',
 '오늘의차/그린샐러드']

In [50]:
# Twitter(Okt) - 오픈 소스 한글 형태소 분석기
okt = Okt()

result = []
for comment in tqdm(dinner_menus, desc='형태소 분석 중...'):
  malist = okt.pos(comment, norm=True, stem=True)
  r = []
  for word in malist:
    if not word[1] in 'Josa, Eomi, Punctuation'.split(', '):
      r.append(word[0])
  rl = (' '.join(r)).strip()
  result.append(rl)

형태소 분석 중...: 100%|██████████| 8/8 [00:00<00:00, 661.65it/s]


In [51]:
result

['잡곡 밥',
 '바지락 순두부찌개',
 '고구마 찜닭',
 '찹쌀 꾸다 바 로우',
 '미역 줄기 볶음',
 '슈크림 데니 쉬',
 '포기 김치',
 '오늘 차 그린 샐러드']