# 第６章：機械学習

## 50.データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
file_name = "newsCorpora.csv"
# コーパスを列ラベル付きにしてDataFrame形式で保存
df = pd.read_csv(file_name, sep='\t', header=None, names=["ID", "TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"])

In [3]:
# "PUBLISHER"列が"Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"のみを抽出
df = df[df["PUBLISHER"].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"])]
print(df)

            ID                                              TITLE  \
12          13       Europe reaches crunch point on banking union   
13          14  ECB FOCUS-Stronger euro drowns out ECB's messa...   
19          20  Euro Anxieties Wane as Bunds Top Treasuries, S...   
20          21  Noyer Says Strong Euro Creates Unwarranted Eco...   
29          30  REFILE-Bad loan triggers key feature in ECB ba...   
...        ...                                                ...   
422300  422819     UN: Ebola Could Eventually Infect 20000 People   
422301  422820  Ebola toll tops 1550, continues to accelerate ...   
422302  422821  UPDATE 1-Ebola toll tops 1550, outbreak accele...   
422309  422828  Ebola Cases May Surpass 20000, WHO Says in Upd...   
422319  422838  Air France suspends flights to Ebola-hit Sierr...   

                                                      URL        PUBLISHER  \
12      http://in.reuters.com/article/2014/03/10/eu-ba...          Reuters   
13      http://

In [4]:
train,val_test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True, stratify=df['CATEGORY'])
valid, test = train_test_split(val_test, test_size=0.5, random_state=42, shuffle=True, stratify=val_test['CATEGORY'])

In [5]:
train[["CATEGORY", "TITLE"]].to_csv('./data/train.txt', sep='\t', index=False)
valid[["CATEGORY", "TITLE"]].to_csv('./data/valid.txt', sep='\t', index=False)
test[["CATEGORY", "TITLE"]].to_csv('./data/test.txt', sep='\t', index=False)

In [6]:
print("-----train-----")
print(train["CATEGORY"].value_counts())
print("-----valid-----")
print(valid["CATEGORY"].value_counts())
print("-----test-----")
print(test["CATEGORY"].value_counts())

-----train-----
b    4502
e    4223
t    1219
m     728
Name: CATEGORY, dtype: int64
-----valid-----
b    562
e    528
t    153
m     91
Name: CATEGORY, dtype: int64
-----test-----
b    563
e    528
t    152
m     91
Name: CATEGORY, dtype: int64


## 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [8]:
import string
import re

In [10]:
def preprocessing_text(text):
    for p in string.punctuation:
        text = text.replace(p, " ") 
    text = text.lower()
    text = re.sub('[0-9]+', '0', text)
    return text

In [11]:
train['TITLE'] = train['TITLE'].map(preprocessing_text)
valid['TITLE'] = valid['TITLE'].map(preprocessing_text)
test['TITLE'] = test['TITLE'].map(preprocessing_text)

### CountVectorizer

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [110]:
vectorizer = CountVectorizer()

In [113]:
train_count = vectorizer.fit_transform(train['TITLE'])
valid_count = vectorizer.transform(valid['TITLE'])
test_count = vectorizer.transform(test['TITLE'])

In [115]:
train_count = pd.DataFrame(train_count.toarray(), columns=vectorizer.get_feature_names_out())
valid_count = pd.DataFrame(valid_count.toarray(), columns=vectorizer.get_feature_names_out())
test_count = pd.DataFrame(test_count.toarray(), columns=vectorizer.get_feature_names_out())

In [57]:
train_count.to_csv('./data_count/train.feature.txt', sep='\t', index=False)
valid_count.to_csv('./data_count/valid.feature.txt', sep='\t', index=False)
test_count.to_csv('./data_count/test.feature.txt', sep='\t', index=False)

In [116]:
train_count.loc[0][train_count.loc[0]>0]

and           1
birthday      1
celebrates    1
daughter      1
day           1
father        1
first         1
kanye         2
west          2
Name: 0, dtype: int64

### tf-idf

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [120]:
vectorizer = TfidfVectorizer()

In [121]:
train_tfidf = vectorizer.fit_transform(train['TITLE'])
valid_tfidf = vectorizer.transform(valid['TITLE'])
test_tfidf = vectorizer.transform(test['TITLE'])

In [82]:
vectorizer


In [122]:
train_tfidf = pd.DataFrame(train_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
valid_tfidf = pd.DataFrame(valid_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
test_tfidf = pd.DataFrame(test_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

In [84]:
train_tfidf.to_csv('./data_tfidf/train.feature.txt', sep='\t', index=False)
valid_tfidf.to_csv('./data_tfidf/valid.feature.txt', sep='\t', index=False)
test_tfidf.to_csv('./data_tfidf/test.feature.txt', sep='\t', index=False)

## 52. 学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ

In [59]:
# ラベル（カテゴリ）抽出
train_y = pd.read_csv('./data/train.txt', sep='\t')
valid_y = pd.read_csv('./data/valid.txt', sep='\t')
test_y = pd.read_csv('./data/test.txt', sep='\t')

In [61]:
train_y = train_y["CATEGORY"]
valid_y = valid_y["CATEGORY"]
test_y = test_y["CATEGORY"]

In [117]:
# モデル学習
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(train_count, train_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 53. 予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ

In [168]:
model.predict(valid_count)

array(['b', 'b', 'b', ..., 'e', 'b', 'b'], dtype=object)

In [104]:
model.predict_proba(valid_count)



array([[0.6032107 , 0.26781757, 0.05719794, 0.0717738 ],
       [0.98315223, 0.00714263, 0.00546352, 0.00424162],
       [0.62791999, 0.1946964 , 0.04538893, 0.13199468],
       ...,
       [0.06303454, 0.87489556, 0.0308467 , 0.03122321],
       [0.98030861, 0.0100075 , 0.00357099, 0.00611291],
       [0.97984676, 0.00751316, 0.00421809, 0.00842199]])

In [105]:
import numpy as np
def score(model, texts):
    probs = model.predict_proba(texts).max(axis=1)
    preds = model.predict(texts)
    return np.stack([probs, preds],axis=1)

In [106]:
score(model, valid_count)

array([[0.603210696029133, 'b'],
       [0.9831522260237566, 'b'],
       [0.6279199873936484, 'b'],
       ...,
       [0.8748955559406222, 'e'],
       [0.9803086054175252, 'b'],
       [0.979846763824319, 'b']], dtype=object)

## 54. 正解率の計測
52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

In [119]:
from sklearn.metrics import accuracy_score

pred_y_train = model.predict(train_count)
pred_y_test = model.predict(test_count)
print(f"train_accuracy：{accuracy_score(train_y, pred_y_train)}")
print(f"test_accuracy：{accuracy_score(test_y, pred_y_test)}")

train_accuracy：0.9960644677661169
test_accuracy：0.9265367316341829


In [123]:
from sklearn.metrics import accuracy_score
 
pred_y_train = model.predict(train_tfidf)
pred_y_test = model.predict(test_tfidf)
print(f"train_accuracy：{accuracy_score(train_y, pred_y_train)}")
print(f"test_accuracy：{accuracy_score(test_y, pred_y_test)}")

train_accuracy：0.8638493253373314
test_accuracy：0.8500749625187406
