# Tweetのセンチメント分析モデルの学習

このノートブックでは、これまで紹介した内容を組み合わせてみます。まず、Tweetを前処理し、その次に単語埋め込みを構築します。最後に、ロジスティック回帰モデルを用いて分類をします。

## 準備

### パッケージのインストール

In [1]:
!pip install -q nltk==3.2.5 numpy==1.19.5 pandas==1.1.5 gensim==4.1.2 scikit-learn==0.23.2 demoji==1.1.0

[K     |████████████████████████████████| 24.1 MB 2.5 kB/s 
[K     |████████████████████████████████| 6.8 MB 48.7 MB/s 
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
[?25h

### データのアップロード

本書リポジトリの第8章のフォルダから、`sts_cold_tweet.csv`をアップロードしましょう。

In [2]:
from google.colab import files

uploaded = files.upload()

Saving sts_gold_tweet.csv to sts_gold_tweet.csv


### モジュールのアップロード

本書リポジトリの第8章のフォルダに、ソーシャルメディアのテキストを前処理するためのモジュールとして`O5_smtd_preprocessing.py`があります。そちらをアップロードしましょう。

In [3]:
from google.colab import files

uploaded = files.upload()

Saving O5_smtd_preprocessing.py to O5_smtd_preprocessing.py


### インポート

In [4]:
import os
import sys
import warnings

import gensim.downloader as api
import nltk
import numpy as np
import pandas as pd
from gensim.models import Word2Vec, KeyedVectors
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

preprocessing_path = "O5smtd_preprocessing.py"
sys.path.append(os.path.abspath(preprocessing_path))
import O5_smtd_preprocessing as smtd_preprocessing

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### データの読み込み

In [5]:
df = pd.read_csv("sts_gold_tweet.csv", error_bad_lines=False, delimiter=";")
df.drop(columns=["id"], inplace=True)
df.polarity = df.polarity.map({0: 0, 4: 1})
df.head()

Unnamed: 0,polarity,tweet
0,0,the angel is going to miss the athlete this we...
1,0,It looks as though Shaq is getting traded to C...
2,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
3,0,drinking a McDonalds coffee and not understand...
4,0,So dissapointed Taylor Swift doesnt have a Twi...


## 前処理

In [6]:
# 自前のモジュールでテキストの前処理
df["tweet"] = df["tweet"].apply(lambda x: smtd_preprocessing.process_TweetText(x))

# TweetTokenizerで単語分割
tweet_tokenizer = TweetTokenizer()
df["tweet_tokens"] = df["tweet"].apply(lambda x: tweet_tokenizer.tokenize(x))

# ストップワードの除去
df["tweet_no_stopwords"] = df["tweet_tokens"].apply(
    lambda x: [word for word in x if word not in stopwords.words("english")]
)
tweets_processed = df["tweet_tokens"].values
tweets_cat = df["polarity"].values

display(df.head())
print("Number of tweets and categories")
print(len(tweets_processed), len(tweets_cat))
print("\nExamle of polarity, processed tweet, processed tweet without stopwords")
print(tweets_cat[0], ",", tweets_processed[0], ",", df["tweet_no_stopwords"].values[0])

Unnamed: 0,polarity,tweet,tweet_tokens,tweet_no_stopwords
0,0,the angel is going to miss the athlete this we...,"[the, angel, is, going, to, miss, the, athlete...","[angel, going, miss, athlete, weekend]"
1,0,it looks as though shaq is getting traded to c...,"[it, looks, as, though, shaq, is, getting, tra...","[looks, though, shaq, getting, traded, clevela..."
2,0,constantnonbrandmention constantdate isn't com...,"[constantnonbrandmention, constantdate, isn't,...","[constantnonbrandmention, constantdate, coming..."
3,0,drinking a mcdonalds coffee and not understand...,"[drinking, a, mcdonalds, coffee, and, not, und...","[drinking, mcdonalds, coffee, understanding, s..."
4,0,so dissapointed taylor swift doesnt have a twi...,"[so, dissapointed, taylor, swift, doesnt, have...","[dissapointed, taylor, swift, doesnt, twitter]"


Number of tweets and categories
2034 2034

Examle of polarity, processed tweet, processed tweet without stopwords
0 , ['the', 'angel', 'is', 'going', 'to', 'miss', 'the', 'athlete', 'this', 'weekend'] , ['angel', 'going', 'miss', 'athlete', 'weekend']


## 埋め込みの用意

今回は、埋め込みとして、gensimに用意されている`glove-twitter-100`を利用しましょう。

In [7]:
glove_vectors = api.load("glove-twitter-200")



In [8]:
print('cat' in glove_vectors)
print(glove_vectors['cat'])

True
[ 1.4557e-01 -4.7214e-01  4.5594e-02 -1.1133e-01 -4.4561e-01  1.6502e-02
  4.6724e-01 -1.8545e-01  4.1239e-01 -6.7263e-01 -4.8698e-01  7.2586e-01
 -2.2125e-01 -2.0023e-01  1.7790e-01  6.7062e-01  4.1636e-01  6.5783e-02
  4.8212e-01 -3.5627e-02 -4.7048e-01  7.7485e-02 -2.8296e-01 -4.9671e-01
  3.3700e-01  7.1805e-01  2.2005e-01  1.2718e-01  6.7862e-02  4.0265e-01
 -1.8210e-02  7.8379e-01 -5.2571e-01 -3.9359e-01 -5.6827e-01 -1.5662e-01
 -8.4099e-02 -2.0918e-01 -6.6157e-02  2.5114e-01 -4.0015e-01  1.5930e-01
  1.7887e-01 -3.2110e-01  9.9510e-02  5.2923e-01  4.8289e-01  1.4505e-01
  4.4368e-01  1.7365e-01  3.6350e-01 -5.1496e-01 -1.2889e-01 -1.9713e-01
  1.8096e-01 -1.1301e-02  8.4409e-01  9.8606e-01  8.3535e-01  3.5410e-01
 -2.3395e-01  3.5510e-01  4.1899e-01 -5.4763e-02  2.2902e-01 -1.9593e-01
 -5.7777e-01  2.9728e-01  3.3972e-01 -3.1119e-01 -3.2498e-01 -4.2557e-01
 -7.0302e-01 -7.2515e-01 -2.9349e-01  4.9964e-01 -3.2889e-01  2.4359e-01
  1.3243e-01  3.1164e-01  1.2156e+00  3.1241e-

単語埋め込みを読み込んだら、それを用いて文書埋め込みを作成する関数を書きます。

In [9]:
def embedding_feats(list_of_lists, vectors):
    DIMENSION = 200
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this = np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in vectors:
                feat_for_this += vectors[token]
                count_for_this += 1
        feats.append(
            feat_for_this / count_for_this if count_for_this > 0 else feat_for_this
        )
    return feats


train_vectors = embedding_feats(df["tweet_no_stopwords"].values, glove_vectors)
print(len(train_vectors))

2034


データを学習用とテスト用に分割したら、

In [10]:
x_train, x_test, y_train, y_test = train_test_split(
    train_vectors, df["polarity"].values
)

あとはモデルを学習するだけです。

In [11]:
classifier = LogisticRegression(random_state=2021)
classifier.fit(x_train, y_train)

LogisticRegression(random_state=2021)

In [12]:
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred, digits=4))
print("Accuracy: ", classifier.score(x_test, y_test))
print("f1: ", f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0     0.8803    0.9118    0.8958       363
           1     0.7594    0.6918    0.7240       146

    accuracy                         0.8487       509
   macro avg     0.8199    0.8018    0.8099       509
weighted avg     0.8456    0.8487    0.8465       509

Accuracy:  0.8487229862475442
f1:  0.7240143369175628
