# 自然言語処理デモ (ver. 2021)

- 以下の資料は以下のYoutube動画を主に参考にしています：https://youtu.be/xvqsFTUsOmc 
    - GitHub: https://github.com/adashofdata/nlp-in-python-tutorial
- 後半のBERTを用いた分析は以下の教科書を参考にしました：『BERTによる自然言語処理入門』
    - https://www.amazon.co.jp/dp/B098J9M4PP/ref=dp_kinw_strp_1

## Two standard text formats:

1. **Corpus** - a collection of text (単語の現れる順番を保存したい: e.g., not happy)
2. **Document-Term Matrix** - word counts in matrix format (単語の現れる頻度のみで十分: e.g., トピックモデリング)  
$$\begin{bmatrix} & word1 & word2 & word3 & \dots \\
text 1 & freq & freq & freq & \dots\\
text 2 & freq & freq & freq & \dots\\
\vdots & \vdots & \vdots & \vdots & \end{bmatrix}$$

## Getting The Data

- 自然言語処理は英語の方が日本語よりはるかに容易（単語の区切りがはっきりしているので）
- 今回は日本のコロナ関連の英語ニュースを分析する：NHK World-Japan.

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

def url_to_text(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    text = [p.text for p in soup.find(class_ = "p-article__body").find_all('p')]
    # ソースの例：view-source:https://www3.nhk.or.jp/nhkworld/en/news/20211025_13/
    print(url)
    return text

urls = ['https://www3.nhk.or.jp/nhkworld/en/news/20211025_13/',
        'https://www3.nhk.or.jp/nhkworld/en/news/20211025_08/',
        'https://www3.nhk.or.jp/nhkworld/en/news/20211025_04/',
        'https://www3.nhk.or.jp/nhkworld/en/news/20211024_11/',
        'https://www3.nhk.or.jp/nhkworld/en/news/20211024_13/',
        'https://www3.nhk.or.jp/nhkworld/en/news/20211022_32/']

# Articles: URLの最後の数字部分を取り出す

articles = [url.split('/')[-2] for url in urls]

In [2]:
# 記事を取得

# texts = [url_to_text(u) for u in urls] # 短時間にアクセスしすぎると繋がらなくなるので注意
# texts[1]

In [3]:
# Pickle files for later use

# for i, c in enumerate(articles):
#     with open("articles/" + c + ".txt", "wb") as file:
#         pickle.dump(texts[i], file)

In [4]:
articles

['20211025_13',
 '20211025_08',
 '20211025_04',
 '20211024_11',
 '20211024_13',
 '20211022_32']

In [5]:
# Load pickled files

data = {}
for i, c in enumerate(articles):
    with open("articles/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [6]:
data.keys()

dict_keys(['20211025_13', '20211025_08', '20211025_04', '20211024_11', '20211024_13', '20211022_32'])

## Cleaning The Data

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words (たとえばin/on/atなどの前置詞)

**More data cleaning steps after tokenization:**
* Stemming / lemmatization (派生形・活用形の処理)
* Create bi-grams or tri-grams (連続する複数単語でグループ化)
* Deal with typos
* And more...

In [7]:
# dataの値は１文ごとのリストになっている．これを全部まとめて文字列(str)形式にしたい．

data['20211025_13']

["Japan's wedding industry is enduring tough times as the coronavirus pandemic has been forcing couples to cancel their nuptials, and the organization that oversees the industry has been looking at who should bear the cost.",
 'The Bridal Institutional Association has revised its guidelines for wedding contracts for the first time in 13 years, introducing a clause on what to do in the event of an infectious disease outbreak.',
 'The association is recommending that wedding companies allow rescheduling at no extra cost and reduce their fees for canceling a ceremony if the government has a business suspension request in place.',
 "Japan's Consumer Affairs Center says it has received more than 5,400 complaints about wedding contracts since the start of the pandemic. Some of those cases have developed into lawsuits."]

In [8]:
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [9]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
data_combined['20211025_13'] # 文章が１まとめに繋がっている．

["Japan's wedding industry is enduring tough times as the coronavirus pandemic has been forcing couples to cancel their nuptials, and the organization that oversees the industry has been looking at who should bear the cost. The Bridal Institutional Association has revised its guidelines for wedding contracts for the first time in 13 years, introducing a clause on what to do in the event of an infectious disease outbreak. The association is recommending that wedding companies allow rescheduling at no extra cost and reduce their fees for canceling a ceremony if the government has a business suspension request in place. Japan's Consumer Affairs Center says it has received more than 5,400 complaints about wedding contracts since the start of the pandemic. Some of those cases have developed into lawsuits."]

In [10]:
# dataframe形式も準備しておく
import pandas as pd

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['article']
data_df

Unnamed: 0,article
20211025_13,Japan's wedding industry is enduring tough tim...
20211025_08,The Tokyo Metropolitan Government has lifted a...
20211025_04,Russia's first and second largest cities of Mo...
20211024_11,The government of Singapore will make COVID-19...
20211024_13,Around 130 elementary school students took par...
20211022_32,The Tokyo Metropolitan Government has opened t...


In [11]:
# データクリーニング
import re
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower() # 小文字にする
    text = re.sub(r'[.|,|:|*|!|?|-|-|/]', '', text) # 記号
    text = re.sub(r'\w*\d\w*', '', text) # 数字を含む単語
    text = re.sub(r'\d', '', text) # 数字
    return text

clean = lambda x: clean_text(x)

In [12]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.article.apply(clean))
data_clean

Unnamed: 0,article
20211025_13,japan's wedding industry is enduring tough tim...
20211025_08,the tokyo metropolitan government has lifted a...
20211025_04,russia's first and second largest cities of mo...
20211024_11,the government of singapore will make vaccina...
20211024_13,around elementary school students took part i...
20211022_32,the tokyo metropolitan government has opened t...


### Corpus (コーパス)

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

- 日本語の定義は　https://ja.wikipedia.org/wiki/%E3%82%B3%E3%83%BC%E3%83%91%E3%82%B9
- つまり単に文章をたくさん集めたもの

In [13]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,article
20211025_13,Japan's wedding industry is enduring tough tim...
20211025_08,The Tokyo Metropolitan Government has lifted a...
20211025_04,Russia's first and second largest cities of Mo...
20211024_11,The government of Singapore will make COVID-19...
20211024_13,Around 130 elementary school students took par...
20211022_32,The Tokyo Metropolitan Government has opened t...


In [14]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix（単語文書行列）

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

- Tokenization：トークン = 文章の構成要素．通常は１単語．場合によっては連続する複数単語(n-gram)．日本語はこのステップが難しい（形態素解析）．
- 単語文書行列：それぞれのトークン（単語）の出現回数を行列にしたもの．文書数×単語数の行列なので，横に非常に長い行列になる．

In [15]:
# from nltk.corpus import stopwords
# stopwords.words('english')

In [16]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')
data_cv = cv.fit_transform(data_clean.article)
data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,able,accept,accordingly,activities,activity,affairs,aim,alcohol,allow,allowed,...,ward,wedding,won,work,workplace,workplaces,world,year,yearend,years
20211025_13,0,0,0,0,0,1,0,0,1,0,...,0,4,0,0,0,0,0,0,0,1
20211025_08,1,0,0,1,1,0,0,2,0,0,...,0,0,0,0,0,0,0,0,1,0
20211025_04,0,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
20211024_11,0,0,0,0,0,0,1,0,0,1,...,0,0,0,1,1,1,0,1,0,0
20211024_13,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,1,1,0,1
20211022_32,0,0,0,0,0,0,0,0,0,0,...,2,0,0,1,0,0,0,0,0,1


In [21]:
from textblob import TextBlob
#import nltk
#nltk.download('punkt') # <= 初回は必要
#nltk.download('wordnet') # <= 初回は必要
#nltk.download('averaged_perceptron_tagger') # <= 初回は必要

# Use TextBlob
def textblob_tokenizer(str_input):
    blob = TextBlob(str_input)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in blob.tags] # 単語の品詞を取得
    words = [wd.lemmatize(tag) for wd, tag in words_and_tags] # lemmatize(): activities => activity
    return words

In [22]:
cv = CountVectorizer(stop_words='english', tokenizer = textblob_tokenizer)
data_cv = cv.fit_transform(data_clean.article)
data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,'s,able,accept,accordingly,activity,affair,aim,alcohol,allow,announce,...,walkin,ward,wed,wedding,win,work,workplace,world,year,yearend
20211025_13,2,0,0,0,0,1,0,0,1,0,...,0,0,3,1,0,0,0,0,1,0
20211025_08,0,1,0,0,2,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,1
20211025_04,3,0,1,1,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,1,0
20211024_11,0,0,0,0,0,0,1,0,1,1,...,0,0,0,0,0,1,2,0,1,0
20211024_13,2,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,2,0
20211022_32,1,0,0,0,0,0,0,0,0,0,...,1,2,0,0,0,1,0,0,1,0


In [23]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [24]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))