<a href="https://colab.research.google.com/github/hwuiwon/ML_study/blob/master/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Required Frameworks & Packages**

## **Framework**
> **[Tensorflow](https://www.tensorflow.org/)**: Machine learning open source library<br>**[Keras](https://keras.io/)**: Python deep learning library<br>**[Gensim](https://radimrehurek.com/gensim/)**: Open-source library for unsupervised topic modeling and natural language processing<br>**[Scikit-learn](https://scikit-learn.org/)**: Python machine learning library
## **Packages**
> **[NLTK](https://www.nltk.org/)**: Natural Language Toolkit for symbolic and statistical natural language processing for **English**<br>**[KoNLPy](https://konlpy.org/en/latest/)**: Python package for natural language processing of the **Korean language**<br>**[JPype](https://pypi.org/project/JPype1/)**:Python module to provide full access to **Java** from within Python

Reference: https://wikidocs.net/book/2155

### **Run below to install**

In [0]:
%tensorflow_version 2.x

!pip install tensorflow
!pip install keras
!pip install gensim
!pip install scikit-learn

!pip install nltk
!pip install konlpy
!pip install Jpype1

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# **Text Preprocessing**

## **Word Tokenization**

### **Using Packages**

In [0]:
# 1. Don't seperate words that include hyphen
# 2. Seperate by appostrophe

from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text="Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."
print(tokenizer.tokenize(text))

['Starting', 'a', 'home-based', 'restaurant', 'may', 'be', 'an', 'ideal.', 'it', 'does', "n't", 'have', 'a', 'food', 'chain', 'or', 'restaurant', 'of', 'their', 'own', '.']


In [0]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']


## **Sentence Tokenization**

### **Using Packages**

In [0]:
from nltk.tokenize import sent_tokenize
text="His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to mae sure no one was near."
print(sent_tokenize(text))

['His barber kept his word.', 'But keeping such a huge secret to himself was driving him crazy.', 'Finally, the barber went up a mountain and almost to the edge of a cliff.', 'He dug a hole in the midst of some reeds.', 'He looked about, to mae sure no one was near.']


#### **Case: Korean**

In [0]:
# Korean Sentence Splitter
pip install kss

In [0]:
import kss
text='인체 임상 1상에 2개월이 소요된다. 인체 투여하고 나서 임상 결과는 3~4주면 판단할 수 있다. 원래는 2개월 소요되는데 식약처와 규제기관과 상의할 예정이다. 임상 2상과 3상을 같이 할것이냐, 2상을 따로 하고 3상을 할 것이냐도 규제기관과 협의가 남아있다.'
print(kss.split_sentences(text))

['인체 임상 1상에 2개월이 소요된다.', '인체 투여하고 나서 임상 결과는 3~4주면 판단할 수 있다.', '원래는 2개월 소요되는데 식약처와 규제기관과 상의할 예정이다.', '임상 2상과 3상을 같이 할것이냐, 2상을 따로 하고 3상을 할 것이냐도 규제기관과 협의가 남아있다.']


## **Part of speech Tagging**

### **Using Packages**

In [0]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
text="I am actively looking for Ph.D. students. and you are a Ph.D. student."
x=word_tokenize(text)
pos_tag(x)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('actively', 'RB'),
 ('looking', 'VBG'),
 ('for', 'IN'),
 ('Ph.D.', 'NNP'),
 ('students', 'NNS'),
 ('.', '.'),
 ('and', 'CC'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('a', 'DT'),
 ('Ph.D.', 'NNP'),
 ('student', 'NN'),
 ('.', '.')]

**P.O.S Tags**: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

#### **Case: Korean**

In [0]:
from konlpy.tag import Okt  
okt=Okt()  
print(okt.morphs("국가와 전세계 기여하기 위해 예방적 차원에서라도 끝까지 투자할 생각이다."))
print(okt.pos("국가와 전세계 기여하기 위해 예방적 차원에서라도 끝까지 투자할 생각이다."))
print(okt.nouns("국가와 전세계 기여하기 위해 예방적 차원에서라도 끝까지 투자할 생각이다."))

['국가', '와', '전세계', '기여', '하기', '위해', '예방', '적', '차원', '에서라도', '끝', '까지', '투자', '할', '생각', '이다', '.']
[('국가', 'Noun'), ('와', 'Josa'), ('전세계', 'Noun'), ('기여', 'Noun'), ('하기', 'Verb'), ('위해', 'Noun'), ('예방', 'Noun'), ('적', 'Suffix'), ('차원', 'Noun'), ('에서라도', 'Josa'), ('끝', 'Noun'), ('까지', 'Josa'), ('투자', 'Noun'), ('할', 'Verb'), ('생각', 'Noun'), ('이다', 'Josa'), ('.', 'Punctuation')]
['국가', '전세계', '기여', '위해', '예방', '차원', '끝', '투자', '생각']


Where
> **okt.morphs(String)**: Extract morphemes<br>
> **okt.pos(String)**: Tag part of speech<br>
> **okt.nouns(String)**: Extract nouns