# 機械学習
## 本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．
#### https://nlp100.github.io/ja/ch06.html

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iamtatsuki05/NLP_100/blob/fix_all_merge/NLP_100_6.ipynb)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# データの入手・整形
## News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

###ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
##情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
###抽出された事例をランダムに並び替える．
###抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．
###学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip

In [None]:
!unzip NewsAggregatorDataset.zip

In [None]:
f = open('readme.txt', 'r')
data = f.read()
print(data)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('/content/newsCorpora.csv', header=None, sep='\t', names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])
df = df.loc[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['TITLE', 'CATEGORY']]

dataset_train, other = train_test_split(df, test_size=0.2, shuffle=True, random_state=42 , stratify=df['CATEGORY'])
dataset_valid, dataset_test = train_test_split(other, test_size=0.5, shuffle=True, random_state=42 , stratify=other['CATEGORY'])

print('train:', dataset_train.describe())
print('test:', dataset_test.describe())
print('valid:', dataset_valid.describe())

In [None]:
dataset_train.to_csv('./train.txt', sep='\t', index=False)
dataset_test.to_csv('./test.txt', sep='\t', index=False)
dataset_valid.to_csv('./valid.txt', sep='\t', index=False)

# 特徴量抽出
## 学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [None]:
#CATEGORY	News category (b =business, t = science and technology, e = entertainment, m = health)
df = pd.concat([dataset_train, dataset_test, dataset_valid], axis=0)
df

In [None]:
import string
import re

#ID修正
df.reset_index(drop=True, inplace=True)
#小文字化
df['TITLE'] = df['TITLE'].str.lower()
#記号変換
table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
df['TITLE'] = df['TITLE'].str.translate(table)
df

In [None]:
# train, test, valid = df[:len(train)], df[len(train):len(train) + len(test)], df[len(train) + len(test):] 

## trainとvalidは同じ状態で前処理を行って直前に直前に分けたほうが良いここで時間がかかった

# #文章のベクトル化
# from sklearn.feature_extraction.text import TfidfVectorizer

# vec_tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2))

# train_title = vec_tfidf.fit_transform(train['TITLE'])
# test_title = vec_tfidf.fit_transform(test['TITLE'])
# valid_title = vec_tfidf.fit_transform(valid['TITLE'])

# X_train = pd.DataFrame(train_title.toarray())
# X_test = pd.DataFrame(test_title.toarray())
# X_valid = pd.DataFrame(valid_title.toarray())
# X_test

In [None]:
#文章のベクトル化
from sklearn.feature_extraction.text import TfidfVectorizer

vec_tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2))
df_title = vec_tfidf.fit_transform(df['TITLE'])

df = pd.DataFrame(df_title.toarray(), columns=vec_tfidf.get_feature_names())
X_train, X_test, X_valid = df[:len(dataset_train)], df[len(dataset_train):len(dataset_train) + len(dataset_test)], df[len(dataset_train) + len(dataset_test):]
X_train

In [None]:
# 参考https://qiita.com/Yuu94/items/ca1ad7eb1dd5c9686c5a
# https://qiita.com/tag1216/items/df6c93bdb823dd48af6c
# https://pyhoo.jp/upper-lower

# 学習
## 51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [None]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(penalty='l2', solver='sag', random_state=42, max_iter=100)
lg.fit(X_train, dataset_train['CATEGORY'])

In [None]:
# https://qiita.com/fujin/items/f5656afc8a40fcf55386

# 予測
## 52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

In [None]:
train_pred = lg.predict(X_train)
train_pred

In [None]:
test_pred = lg.predict(X_test)
test_pred

# 正解率の計測
## 52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ

In [None]:
from sklearn.metrics import accuracy_score

train_accuracy = accuracy_score(dataset_train['CATEGORY'], train_pred)
print(f'score：{train_accuracy:.3f}')

In [None]:
test_accuracy = accuracy_score(dataset_test['CATEGORY'], test_pred)
print(f'score：{test_accuracy:.3f}')

# 混同行列の作成
## 52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

train_cm = confusion_matrix(dataset_train['CATEGORY'], train_pred)
print(train_cm)
sns.heatmap(train_cm, cmap='Greens')

In [None]:
test_cm = confusion_matrix(dataset_test['CATEGORY'], test_pred)
print(test_cm)
sns.heatmap(test_cm, cmap='Greens')

In [None]:
# 参考https://note.nkmk.me/python-sklearn-confusion-matrix-score/

# 適合率，再現率，F1スコアの計測
## 52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

In [None]:
# from sklearn.metrics import precision_score, recall_score, f1_score
# import numpy as np

# # 適合率
# precision = precision_score(test['CATEGORY'], test_pred, average=None, labels=['b', 'e', 't', 'm'])
# precision = np.append(precision, precision_score(test['CATEGORY'], test_pred, average='micro'))
# precision = np.append(precision, precision_score(test['CATEGORY'], test_pred, average='macro'))

# # 再現率
# recall = recall_score(test['CATEGORY'], test_pred, average=None, labels=['b', 'e', 't', 'm'])
# recall = np.append(recall, recall_score(test['CATEGORY'], test_pred, average='micro'))
# recall = np.append(recall, recall_score(test['CATEGORY'], test_pred, average='macro'))

# # F1スコア
# f1 = f1_score(test['CATEGORY'], test_pred, average=None, labels=['b', 'e', 't', 'm'])
# f1 = np.append(f1, f1_score(test['CATEGORY'], test_pred, average='micro'))
# f1 = np.append(f1, f1_score(test['CATEGORY'], test_pred, average='macro'))

# score_dataframe = pd.DataFrame({'適合率': precision, '再現率': recall, 'F1スコア': f1},
#                         index=['b', 'e', 't', 'm', 'マイクロ平均', 'マクロ平均'])
# score_dataframe

In [None]:
from sklearn.metrics import classification_report

# target_names = ['b', 'e', 'm', 't']
print('test_data\n')
print(classification_report(dataset_test['CATEGORY'], test_pred))

In [None]:
# 参考https://qiita.com/g-k/items/14bf10cce79f0db72b24
# https://qiita.com/shoku-pan/items/cef42c9a3f472eb571b9
# https://note.nkmk.me/python-sklearn-confusion-matrix-score/
# https://gotutiyan.hatenablog.com/entry/2020/09/09/111840

# 特徴量の重みの確認Permalink
## 52で学習したロジスティック回帰モデルの中で，重みの高い特徴量トップ10と，重みの低い特徴量トップ10を確認せよ．

In [None]:
import numpy as np

cols = X_train.columns.values
index = [_ for _ in range(1, 11)]
for category, score in zip(lg.classes_, lg.coef_):
    print(f'【カテゴリ】{category}')
    high = pd.DataFrame(cols[np.argsort(score)[::-1][:10]], columns=['重みの高い特徴量'], index=index)
    low = pd.DataFrame(cols[np.argsort(score)[:10]], columns=['重みの低い特徴量'], index=index)
    frame = pd.concat([high, low], axis=1)
    print(frame)

In [None]:
# 参考https://gotutiyan.hatenablog.com/entry/2020/09/08/122621#classes_%E3%81%A9%E3%81%AE%E3%83%A9%E3%83%99%E3%83%AB%E3%81%8C%E3%81%A9%E3%81%AEID%E3%81%AA%E3%81%AE%E3%81%8B%E3%82%92%E5%8F%96%E5%BE%97
# https://qiita.com/rubberduck/items/c2ade1ac10c80651f4b1

# 正則化パラメータの変更
## ロジスティック回帰モデルを学習するとき，正則化パラメータを調整することで，学習時の過学習（overfitting）の度合いを制御できる．異なる正則化パラメータでロジスティック回帰モデルを学習し，学習データ，検証データ，および評価データ上の正解率を求めよ．実験の結果は，正則化パラメータを横軸，正解率を縦軸としたグラフにまとめよ．

In [None]:
from sklearn.linear_model import LogisticRegression

result = []
Cs = [10 ** i for i in range(-3, 3)]

for C in Cs:
    lg = LogisticRegression(penalty='l2', solver='sag', random_state=42, max_iter=100, C=C)
    lg.fit(X_train, dataset_train['CATEGORY'])
    # 予測
    train_pred = lg.predict(X_train)
    valid_pred = lg.predict(X_valid)
    test_pred = lg.predict(X_test)

    # 誤差
    train_accuracy = accuracy_score(dataset_train['CATEGORY'], train_pred)
    valid_accuracy = accuracy_score(dataset_valid['CATEGORY'], valid_pred)
    test_accuracy = accuracy_score(dataset_test['CATEGORY'], test_pred)

    # 結果
    result.append([C, train_accuracy, valid_accuracy, test_accuracy])

In [None]:
result = np.array(result).T
plt.plot(result[0], result[1], label='train')
plt.plot(result[0], result[2], label='valid')
plt.plot(result[0], result[3], label='test')
plt.ylim(0, 1.1)
plt.ylabel('score')
plt.xlabel('Normalization')
plt.legend()
plt.show()

In [None]:
# 参考http://harmonizedai.com/article/%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E3%83%BC-c%E3%80%80%E3%83%AD%E3%82%B8%E3%82%B9%E3%83%86%E3%82%A3%E3%83%83%E3%82%AF%E5%9B%9E%E5%B8%B0%E3%81%AE%E3%83%8F%E3%82%A4%E3%83%91%E3%83%BC%E3%83%91/
# https://machine-earning.net/article/numpy-logspace/

# ハイパーパラメータの探索
## 学習アルゴリズムや学習パラメータを変えながら，カテゴリ分類モデルを学習せよ．検証データ上の正解率が最も高くなる学習アルゴリズム・パラメータを求めよ．また，その学習アルゴリズム・パラメータを用いたときの評価データ上の正解率を求めよ．

In [None]:
import itertools

def calc_scores(C,class_weight):
    model = LogisticRegression(random_state=42, max_iter=100, C=C, class_weight=class_weight)
    model.fit(X_train, dataset_train['CATEGORY'])

    train_pred = model.predict(X_train)
    valid_pred = model.predict(X_valid)
    test_pred = model.predict(X_test)

    scores = []
    scores.append(accuracy_score(dataset_train['CATEGORY'], train_pred))
    scores.append(accuracy_score(dataset_valid['CATEGORY'], valid_pred))
    scores.append(accuracy_score(dataset_test['CATEGORY'], test_pred))
    return scores

In [None]:
#定義
cs = [10 ** _ for _ in range(-3, 3)]
class_weight = [None, 'balanced']
best_parameter = None
best_scores = None
max_valid_score = 0

#探索
for c, w in itertools.product(cs, class_weight):
    print(c, w)
    scores = calc_scores(c, w)
    if scores[1] > max_valid_score:
        max_valid_score = scores[1]
        best_parameter = [c, w]
        best_scores = scores

#結果
print (f'Normalization: {best_parameter[0]} weight: {best_parameter[1]}')
print (f'train score: {best_scores[0]} valid score: {best_scores[1]} test score: {best_scores[2]}')

In [None]:
# lightgbm
from lightgbm import LGBMRegressor

category_dict = {'b': 0, 't': 1, 'e': 2, 'm': 3}
y_train = dataset_train['CATEGORY'].map(lambda x: category_dict[x]).values
y_valid = dataset_valid['CATEGORY'].map(lambda x: category_dict[x]).values
y_test = dataset_test['CATEGORY'].map(lambda x: category_dict[x]).values

model = LGBMRegressor(random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.predict(X_test)

In [None]:
model.score(X_test, y_test)

In [None]:
model.predict(X_valid)

In [None]:
model.score(X_valid, y_valid)

In [None]:
# 参考https://docs.python.org/ja/3/library/itertools.html
# https://techacademy.jp/magazine/46131
# https://qiita.com/kimisyo/items/f7755eb6846193e3ae23
# https://datadriven-rnd.com/lightgbm/