## 文本情感分析

从Kaggle下载并读取数据集，包含了从2020年4月9日到2020年7月16号的标普500相关推文。

In [3]:
import pandas as pd
tweets = pd.read_csv('tweets.csv', sep = ';')
tweets.head()

Unnamed: 0,id,created_at,full_text
0,1,2020-04-09 23:59:51+00:00,@KennyDegu very very little volume. With $10T ...
1,2,2020-04-09 23:58:55+00:00,#ES_F achieved Target 2780 closing above 50% #...
2,3,2020-04-09 23:58:52+00:00,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,2020-04-09 23:58:27+00:00,@Issaquahfunds Hedged our $MSFT position into ...
4,5,2020-04-09 23:57:59+00:00,RT @zipillinois: 3 Surprisingly Controversial ...


接下来分几步来处理这些数据。首先，把带时间的日期转换为日期。

In [4]:
tweets["created_at"] = pd.to_datetime(tweets["created_at"]).dt.date
tweets.head()

Unnamed: 0,id,created_at,full_text
0,1,2020-04-09,@KennyDegu very very little volume. With $10T ...
1,2,2020-04-09,#ES_F achieved Target 2780 closing above 50% #...
2,3,2020-04-09,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,2020-04-09,@Issaquahfunds Hedged our $MSFT position into ...
4,5,2020-04-09,RT @zipillinois: 3 Surprisingly Controversial ...


精简一下，我们只需要日期和文字两列。

In [5]:
tweets.rename(columns = {'created_at' : 'date', 'full_text' : 'text'}, inplace = True)
tweets.drop(['id'],axis=1,inplace=True)
tweets.head()

Unnamed: 0,date,text
0,2020-04-09,@KennyDegu very very little volume. With $10T ...
1,2020-04-09,#ES_F achieved Target 2780 closing above 50% #...
2,2020-04-09,RT @KimbleCharting: Silver/Gold indicator crea...
3,2020-04-09,@Issaquahfunds Hedged our $MSFT position into ...
4,2020-04-09,RT @zipillinois: 3 Surprisingly Controversial ...


检查一下有没有空值。

In [6]:
tweets.isnull().any()

date    False
text    False
dtype: bool

定义一个文字处理器，来把推文进行分词并清洗的处理。

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
import string
import nltk
import xgboost 
import time
import re
import warnings
warnings.filterwarnings("ignore")

from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.corpus import wordnet
from wordcloud import WordCloud ,STOPWORDS
from nltk.stem import WordNetLemmatizer
from sklearn import metrics
from nltk.tokenize import word_tokenize 
from nltk.tag import perceptron
from nltk import word_tokenize

nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('words')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

def text_processor(text):
    
    # Making the text lower case
    text_lower = text.lower()
    
    # Removing the punctation
    translator = str.maketrans('','',string.punctuation)
    text_nopunc = text_lower.translate(translator)
    
    # Splitting the strings
    tokens = text_nopunc.split()
    
    # Eliminating the digits (numbers)
    tokens_nonumbers = [t for t in tokens if not t.isdigit()]
    
    # Dropping the stopwords
    stoplist = stopwords.words('english')
    tokens_nostop = [t for t in tokens_nonumbers if t not in stoplist]
    
    # Automatically reducing all lemmas, saving the tedium of POS tagging
    lemmatizer = WordNetLemmatizer()
    text = " ".join([lemmatizer.lemmatize(t) for t in tokens_nostop])

    return text

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rlyryanz/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rlyryanz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /Users/rlyryanz/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /Users/rlyryanz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rlyryanz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rlyryanz/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


把文字处理器用在这些推文上，由于数据量过大，加上进度条心里有数。

In [8]:
from tqdm.notebook import tqdm
tqdm.pandas()

tweets['text'] = tweets['text'].progress_apply(text_processor)
tweets.head()

HBox(children=(FloatProgress(value=0.0, max=923673.0), HTML(value='')))




Unnamed: 0,date,text
0,2020-04-09,kennydegu little volume 10t youd think could s...
1,2020-04-09,esf achieved target closing fibonacci level mo...
2,2020-04-09,rt kimblecharting silvergold indicator creates...
3,2020-04-09,issaquahfunds hedged msft position close seeme...
4,2020-04-09,rt zipillinois surprisingly controversial stoc...


<p>到这一步，文字清洗完毕。接下来，我要训练一个文本情感打分器。具体的步骤是：<p>
<p>1. 找到打好标签的Twitter数据集<p>
<p>2. 用机器学习对训练集进行训练，然后给未打标签的Twitter进行打分<p>
<p>3. 计算每日positve，negative贴文数，并按照公式求出每日情感分数，汇总成时间序列，加入模型<p>

## 读取打好标签的数据集并处理

In [9]:
tweets_labelled = pd.read_csv('tweets_labelled.csv')
tweets_labelled.head()

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


In [10]:
tweets_labelled.rename(columns = {'Sentence' : 'text', 'Sentiment' : 'sentiment'}, inplace = True)
tweets_labelled.head()

Unnamed: 0,text,sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


In [11]:
tweets_labelled['text'] = tweets_labelled['text'].progress_apply(text_processor)
tweets_labelled.head()

HBox(children=(FloatProgress(value=0.0, max=5842.0), HTML(value='')))




Unnamed: 0,text,sentiment
0,geosolutions technology leverage benefon gps s...,positive
1,esi low bk real possibility,negative
2,last quarter componenta net sale doubled eur13...,positive
3,according finnishrussian chamber commerce majo...,neutral
4,swedish buyout firm sold remaining percent sta...,neutral


虽然没看到，但是以防万一先去除emoji吧。

In [12]:
tweets_labelled = tweets_labelled.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
tweets_labelled.head()

Unnamed: 0,text,sentiment
0,geosolutions technology leverage benefon gps s...,positive
1,esi low bk real possibility,negative
2,last quarter componenta net sale doubled eur13...,positive
3,according finnishrussian chamber commerce majo...,neutral
4,swedish buyout firm sold remaining percent sta...,neutral


## 应用机器学习

In [13]:
tweets_labelled.isnull().any()

text         False
sentiment    False
dtype: bool

In [14]:
tweets_labelled = tweets_labelled.sample(frac=1.0).reset_index(drop=True)
tweets_labelled.head()

Unnamed: 0,text,sentiment
0,consolidated unaudited result amanda capital i...,neutral
1,rt robbielolz nflx close looking good bull hol...,positive
2,also lemminkinen profit accounting period went...,positive
3,volume focus already outside finland group pro...,neutral
4,total donation amount eur,neutral


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

x=tweets_labelled['text'] #自变量
y=tweets_labelled['sentiment']  #因变量

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) #划分测试集和训练集

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=2, max_features=5000)

### KNN

In [16]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier()
senti_KNN = make_pipeline(tfidf_vectorizer, KNN)
senti_KNN.fit(x_train, y_train)

y_pred = senti_KNN.predict(x_test) #预测

print(metrics.classification_report(y_test, y_pred)) #评估

              precision    recall  f1-score   support

    negative       0.17      0.29      0.21       157
     neutral       0.63      0.77      0.69       617
    positive       0.74      0.25      0.37       395

    accuracy                           0.53      1169
   macro avg       0.51      0.44      0.42      1169
weighted avg       0.60      0.53      0.52      1169



### Logistic Regression

In [17]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
senti_LR = make_pipeline(tfidf_vectorizer, LR)
senti_LR.fit(x_train, y_train)

y_pred = senti_LR.predict(x_test) #预测

report = metrics.classification_report(y_test, y_pred, output_dict=True)

LR_report = pd.DataFrame(report).transpose()#评估

LR_report

Unnamed: 0,precision,recall,f1-score,support
negative,0.342857,0.152866,0.211454,157.0
neutral,0.691503,0.857374,0.765557,617.0
positive,0.772455,0.653165,0.707819,395.0
accuracy,0.693755,0.693755,0.693755,0.693755
macro avg,0.602272,0.554468,0.56161,1169.0
weighted avg,0.672032,0.693755,0.67163,1169.0


### Multinomial Naive Bayes

In [18]:
from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB()
senti_NB = make_pipeline(tfidf_vectorizer, NB)
senti_NB.fit(x_train, y_train)

y_pred = senti_NB.predict(x_test) #预测

print(metrics.classification_report(y_test, y_pred)) #评估

              precision    recall  f1-score   support

    negative       0.53      0.05      0.09       157
     neutral       0.66      0.94      0.78       617
    positive       0.73      0.53      0.61       395

    accuracy                           0.68      1169
   macro avg       0.64      0.51      0.49      1169
weighted avg       0.67      0.68      0.63      1169



### Random Forest

In [19]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier()
senti_RF = make_pipeline(tfidf_vectorizer, RF)
senti_RF.fit(x_train, y_train)

y_pred = senti_RF.predict(x_test) #预测

print(metrics.classification_report(y_test, y_pred)) #评估

              precision    recall  f1-score   support

    negative       0.22      0.19      0.20       157
     neutral       0.68      0.78      0.73       617
    positive       0.78      0.63      0.70       395

    accuracy                           0.65      1169
   macro avg       0.56      0.54      0.54      1169
weighted avg       0.65      0.65      0.65      1169



### SVM

In [20]:
from sklearn.linear_model import SGDClassifier

SVM = SGDClassifier()
senti_SVM = make_pipeline(tfidf_vectorizer, SVM)
senti_SVM.fit(x_train, y_train)

y_pred = senti_SVM.predict(x_test) #预测

print(metrics.classification_report(y_test, y_pred)) #评估

              precision    recall  f1-score   support

    negative       0.29      0.20      0.24       157
     neutral       0.71      0.76      0.74       617
    positive       0.72      0.73      0.72       395

    accuracy                           0.67      1169
   macro avg       0.57      0.56      0.57      1169
weighted avg       0.66      0.67      0.66      1169



## 为原数据打上标签

我们看到多项式朴素贝叶斯的模型预测效果是最好的，所以我们用这个模型来为我们的原始数据集打上标签。

In [34]:
test = pd.read_excel("Test.xlsx")
test.head()

Unnamed: 0,date,text
0,2022-09-12,I am your father
1,2022-09-13,you are mad
2,2022-09-14,too bad you are a bastard


In [36]:
test['sentiment'] = senti_SVM.predict(test['text'])
test.head()

Unnamed: 0,date,text,sentiment
0,2022-09-12,I am your father,neutral
1,2022-09-13,you are mad,neutral
2,2022-09-14,too bad you are a bastard,negative


In [21]:
print(senti_LR.predict(tweets['text'].head()))

['neutral' 'positive' 'positive' 'positive' 'positive']


In [22]:
tweets['sentiment'] = senti_LR.predict(tweets['text'])
tweets.head()

Unnamed: 0,date,text,sentiment
0,2020-04-09,kennydegu little volume 10t youd think could s...,neutral
1,2020-04-09,esf achieved target closing fibonacci level mo...,positive
2,2020-04-09,rt kimblecharting silvergold indicator creates...,positive
3,2020-04-09,issaquahfunds hedged msft position close seeme...,positive
4,2020-04-09,rt zipillinois surprisingly controversial stoc...,positive


In [23]:
tweets[(tweets['sentiment'] == 'positive')].shape[0]

440978

In [24]:
tweets[(tweets['sentiment'] == 'neutral')].shape[0]

379767

In [25]:
tweets[(tweets['sentiment'] == 'negative')].shape[0]

102928

# 计算情感因子分

In [26]:
tweets_senti = tweets.groupby(['date','sentiment']).size().unstack(fill_value=0).add_suffix('_posts')
tweets_senti['total_posts'] = tweets_senti.sum(axis=1)
tweets_senti.head()

sentiment,negative_posts,neutral_posts,positive_posts,total_posts
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-04-09,1421,6666,5854,13941
2020-04-10,527,3605,2507,6639
2020-04-11,360,2428,2520,5308
2020-04-12,669,2575,2906,6150
2020-04-13,1335,5201,6066,12602


In [27]:
tweets_senti = ((tweets_senti['positive_posts'] - tweets_senti['negative_posts']) / tweets_senti['total_posts']).to_frame('senti score')
tweets_senti.head()

Unnamed: 0_level_0,senti score
date,Unnamed: 1_level_1
2020-04-09,0.317983
2020-04-10,0.298238
2020-04-11,0.406933
2020-04-12,0.36374
2020-04-13,0.375417


# 平稳性检验

In [28]:
from statsmodels.tsa.stattools import adfuller

X_senti = tweets_senti['senti score'].values
senti_adf = adfuller(X_senti)
print('ADF Statistic: %f' % senti_adf[0])
print('p-value: %f' % senti_adf[1])
print('Critical Values:')
for key, value in senti_adf[4].items():
	print('\t%s: %.3f' % (key, value))

ADF Statistic: -5.679582
p-value: 0.000001
Critical Values:
	1%: -3.521
	5%: -2.901
	10%: -2.588


In [29]:
%store tweets_senti

Stored 'tweets_senti' (DataFrame)
