# 第8章: ニューラルネット
自然言語処理100本ノック (https://nlp100.github.io/ja/) の第8章です。


第6章で取り組んだニュース記事のカテゴリ分類を題材として，ニューラルネットワークでカテゴリ分類モデルを実装する．なお，この章ではPyTorch, TensorFlow, Chainerなどの機械学習プラットフォームを活用せよ．

## 70. 単語ベクトルの和による特徴量
長いので省略するが、問題50で扱った記事見出し-カテゴリデータについて、
記事の特徴量を各単語ベクトルの平均を取れとのこと

In [1]:
import pandas as pd

df = pd.read_csv("NewsAggregatorDataset/newsCorpora.csv", sep="\t", index_col=0, names=["id", "title", "url", "publisher", "category", "story", "hostname", "timestamp"])
df.head()

Unnamed: 0_level_0,title,url,publisher,category,story,hostname,timestamp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [2]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

In [3]:
import numpy as np

def create_seq_feature(text):
    word_list = text.split(" ")
    word_list = [w.rstrip(",") for w in word_list]
    word_list = [w.rstrip(",") for w in word_list]
    word_list = [w for w in word_list if w in model.vocab.keys()]
    if len(word_list) == 0:
        word_list = ["%"]
    feature_array = np.array([model[w] for w in word_list])
    seq_feature = feature_array.mean(axis=0)
    return seq_feature

ビジネス (b) を 0、科学技術 (t) を 1、エンターテインメント (e) を 2、健康 (m) を 3 にエンコードする 

In [4]:
def cat_encoder(cat):
    if cat == "b":
        return 0
    elif cat == "t":
        return 1
    elif cat == "e":
        return 2
    elif cat == "m":
        return 3
    else:
        raise NotImplementedError
        
cat_list = ["m", "e", "t", "b"]
print([cat_encoder(c) for c in cat_list])

[3, 2, 1, 0]


In [5]:
def table_to_X_y(df):
    text_list = list(df["title"])
    features = [create_seq_feature(t) for t in text_list]
    X = np.stack(features, axis=0)
    cat_list = list(df["category"])
    y = np.array([cat_encoder(c) for c in cat_list])
    return X, y

6章と同様にしてテーブルを分割

In [6]:
valid_publisher = ["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"]
df = df.loc[df["publisher"].isin(valid_publisher), :]
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)

total_size = len(df)
train_size = int(0.8*total_size)
valid_size = int(0.1*total_size)
test_size = total_size - train_size - valid_size

train_df = df.iloc[:train_size, :]
valid_df = df.iloc[train_size:train_size+valid_size, :]
test_df = df.iloc[train_size+valid_size:, :]


print(len(train_df))
print(len(valid_df))
print(len(test_df))

10672
1334
1334


In [7]:
import numpy as np

X_train, y_train = table_to_X_y(train_df)
X_val, y_val = table_to_X_y(valid_df)
X_test, y_test = table_to_X_y(test_df)

np.save( "train.npy", np.concatenate([X_train, y_train[:, None]], axis=1))
np.save( "val.npy", np.concatenate([X_val, y_val[:, None]], axis=1))
np.save( "test.npy", np.concatenate([X_test, y_test[:, None]], axis=1))

In [11]:
X_train.shape

(10672, 300)

## 71. 単層ニューラルネットワークによる予測
問題70で保存した行列を読み込み，学習データについて以下の計算を実行せよ．

ここも問題文が長いので省略するが、単層の全結合層＋softmaxの出力を取れと書いてある。

In [14]:
import torch
import torch.nn as nn

class SingleLayerNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(300, 4, bias=False)
        self.active = nn.Softmax()
    
    def forward(self, x):
        return self.active(self.linear(x))

In [17]:
slnn = SingleLayerNN()

x = torch.from_numpy(X_train[0:4, :].astype(np.float32)).clone()

with torch.no_grad():
    y = slnn(x).numpy().copy()

print(y)

[[0.26014006 0.2516684  0.24600601 0.24218556]
 [0.249415   0.24832833 0.25000504 0.25225163]
 [0.25495055 0.24785142 0.2584714  0.23872657]
 [0.25953215 0.25985867 0.2468626  0.23374653]]


  # This is added back by InteractiveShellApp.init_path()
