# Análise de sentimento do mercado de ações usando Python e aprendizado de máquina

## Introdução

> **Análise de Sentimentos** é o processo de identificar e categorizar computacionamente as opiniões expressas em um trecho de texto, especialmente para determinar se a atitude do autor do texto em relação a um determinado assunto, produto, etc. é positiva, negativa ou eutra.

## Objetivo

Prever se o preço das ações de uma companhia irá crescer, descrecer com base nas principais manchetes nos meios de comunicação.

## Bibliotecas importadas:

In [1]:
import re
import numpy as np
import pandas as pd
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import string
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

## Obtenção e Preparação do conjunto de dados

Os dados trabalhados neste projeto fazem parte de um _dataset_ de estudos de [**notícias diárias para previsão do mercado de ações**](https://www.kaggle.com/aaron7sun/stocknews). Conforme descrição do autor do _dataset_, existem dois canais de dados fornecidos neste conjunto de dados:

1. Dados de notícias: manchetes de notícias históricas do Reddit WorldNews Channel (/r/worldnews). Elas são classificadas pelos votos dos usuários do reddit, e apenas os 25 principais títulos são considerados para uma única data.
    (Intervalo: 08-08-2008 a 01-07-2016):

In [2]:
df1 = pd.read_csv('../data/Down_Jones_Industrial_Average_News.csv', parse_dates=['Date'], index_col='Date')
df1.head(3)

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."


Quantidade de observações (linhas) e campos ou variáveis (colunas) do _dataframe_ `df1`:

In [3]:
df1.shape 

(1989, 26)

In [4]:
df1.columns

Index(['Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8',
       'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16',
       'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24',
       'Top25'],
      dtype='object')

Quanto às colunas do conjunto de dados no _dataframe_ `df1`, note-se que existem 25 colunas nomeadas de "Top1" a "Top25" que identificam manchetes de norícias em uma determinada data. A coluna "Label" por sua vez refere-se a se houve uma baixa (0) ou alta (1) no preço de fechamento ajustado do ídice Dow Jones na referida data. 

In [5]:
df1['Label'].unique()

array([0, 1], dtype=int64)

2. Dados de estoque: Dow Jones Industrial Average (DJIA) é usado para "provar o conceito".
    (Intervalo: 08-08-2008 a 01-07-2016):

In [6]:
df2 = pd.read_csv('../data/Down_Jones_Industrial_Average_Stock.csv', parse_dates=['Date'], index_col='Date')
df2.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688


Quantidade de observações (linhas) e campos ou variáveis (colunas) do _dataframe_ `df2`:

In [7]:
df2.shape

(1989, 6)

In [8]:
df2.columns

Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

Como verificado acima, o _dataframe_ `df2` contém dados de preço de abertura ("Open"), preço máximo ("High"), preço mínimo ("Low"), preço de fechamento ("Close"), volume de ações negociadas ("Volume") e preço de fechamento ajustado ("Adj Close") do índice Dow Jones em uma determinada data.

### Combinando os dois _dataframes_ num único conjunto de dados

A primeira etapa é combinar todas as manchetes de notícias em um único trecho de texto para cada uma das datas no _dataframe_ `df1`:

In [9]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\igoan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
headlines = []
for row in range(df1.shape[0]):
    l = [str(x) for x in df1.iloc[row, 1:]]
    l = [x[2:-1] for x in l]
    headlines.append(' '.join(x for x in l))

In [11]:
headlines[0]

'Georgia \'downs two Russian warplanes\' as countries move to brink of war BREAKING: Musharraf to be impeached. Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube) Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing 150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets. Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it. Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO] Did the U.S. Prep Georgia for War with Russia? Rice Gives Green Light for Israel to Atta

Note-se no exemplo acima que o texto contém inúmeros caracteres indesejados que não correspondem a palavras. Nesse sentido, deve-se realizar uma "limpeza" nesses dados, como mostrado a seguir:

In [12]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    #text = [word for word in text.split(' ') if word not in stopword]
    #text = " ".join(text)
    #text = [stemmer.stem(word) for word in text.split(' ')]
    return text.strip()

In [13]:
temp = pd.DataFrame()
temp['Headlines'] = headlines
temp.index = df1.index
temp.head()

Unnamed: 0_level_0,Headlines
Date,Unnamed: 1_level_1
2008-08-08,Georgia 'downs two Russian warplanes' as count...
2008-08-11,Why wont America and Nato help us? If they won...
2008-08-12,Remember that adorable 9-year-old who sang at ...
2008-08-13,U.S. refuses Israel weapons to attack Iran: r...
2008-08-14,All the experts admit that we should legalise ...


In [14]:
# Limpeza de dados
temp['Headlines'] = temp['Headlines'].apply(clean)
temp['Headlines'].head()

Date
2008-08-08    georgia downs two russian warplanes as countri...
2008-08-11    why wont america and nato help us if they wont...
2008-08-12    remember that adorable  who sang at the openin...
2008-08-13    us refuses israel weapons to attack iran repor...
2008-08-14    all the experts admit that we should legalise ...
Name: Headlines, dtype: object

In [15]:
temp['Headlines'][0]                

'georgia downs two russian warplanes as countries move to brink of war breaking musharraf to be impeached russia today columns of troops roll into south ossetia footage from fighting youtube russian tanks are moving towards the capital of south ossetia which has reportedly been completely destroyed by georgian artillery fire afghan children raped with impunity un official says  this is sick a three year old was raped and they do nothing  russian tanks have entered south ossetia whilst georgia shoots down two russian jets breaking georgia invades south ossetia russia warned it would intervene on sos side the enemy combatent trials are nothing but a sham salim haman has been sentenced to   years but will be kept longer anyway just because they feel like it georgian troops retreat from s osettain capital presumably leaving several hundred people killed  did the us prep georgia for war with russia rice gives green light for israel to attack iran says us has no veto over israeli military op

Em seguida, o _dataframe_ de preços `df2` é concatenado com o _dataframe_ de manchetes agrupadas:

In [16]:
merge = pd.concat([df2, temp, df1['Label']], axis=1)
merge.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,Headlines,Label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2008-08-08,11432.089844,11759.959961,11388.040039,11734.320312,212830000,11734.320312,georgia downs two russian warplanes as countri...,0
2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609,why wont america and nato help us if they wont...,1
2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727,remember that adorable who sang at the openin...,0


Como as datas de cada uma das observações eram iguais nos dois _dataframes_, não houve perda de informações nem linhas com dados faltantes:

In [17]:
merge.shape

(1989, 8)

In [18]:
merge.isna().sum()

Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
Headlines    0
Label        0
dtype: int64

Como se pode verificar pelo resultado abaixo, o conjunto de manchetes agrupadas está textualmente mais limpo e legível:

## Criação de uma funções para a análise de sentimentos em textos

A função abaixo retorna uma medida da subjetividade presente em um dada `string` de texto:

In [19]:
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

A seguinte função mede o nível de polaridade em textos:

In [20]:
def get_polarity(text):
    return TextBlob(text).sentiment.polarity

Por aplicação das funções acima, duas novas colunas serão adicionadas ao _dataframe_ `merge`: uma referente à subjetividade e outra à polaridade em cada grupo de notícias:

In [21]:
# Criação das colunas 'Subjectivity' e 'Polarity'
merge['Subjectivity'] = merge['Headlines'].apply(get_subjectivity)
merge['Polarity'] = merge['Headlines'].apply(get_polarity)
merge.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,Headlines,Label,Subjectivity,Polarity
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2008-08-08,11432.089844,11759.959961,11388.040039,11734.320312,212830000,11734.320312,georgia downs two russian warplanes as countri...,0,0.267549,-0.048568
2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609,why wont america and nato help us if they wont...,1,0.374806,0.109325
2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727,remember that adorable who sang at the openin...,0,0.536234,-0.044302


A próxima função atribui uma pontuação como medida do sentimento em cada registro dos dados:

In [22]:
def get_sia(text):
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(text)
    return sentiment

In [None]:
sentiments = {
    'compound' : [],
    'neg' : [],
    'pos' : [],
    'neu'  : []
}

SIA = 0
for headline in merge['Headlines']:
    SIA = get_sia(headline)
    for name in ['compound', 'neg', 'pos', 'neu']:
        sentiments[name].append(SIA[name])

In [None]:
for name in ['compound', 'neg', 'pos', 'neu']:
    merge[name] = sentiments[name]

In [None]:
merge.head(3)

## Modelo Predtivo

In [None]:
keep_columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'Headlines', 'Subjectivity', 'Polarity', 'compound', 'neg', 'pos', 'neu', 'Label']
df = merge[keep_columns]
df.columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'Headlines', 'Subjectivity', 'Polarity', 'Compound', 'Negative', 'POsitive', 'Neutral', 'Label']
df.head()

In [None]:
# Criação dos dados de feature
X = np.array(df.drop(['Label'], axis = 1))

# Criação dos dados target
y = np.array(df['Label'])

Em seguida os dados criados são divididos sendo 80% para treinamento do modelo e o restante para teste.

In [None]:
# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Criação do modelo de treinamento

In [None]:
model = LinearDiscriminantAnalysis().fit(X_train, y_train)