# Projetando sua própria ferramenta de análise de sentimento

Embora existam muitas ferramentas que nos darão automaticamente uma sensação de um trecho de texto, aprendemos que nem sempre elas concordam! Vamos projetar o nosso próprio para ver como essas ferramentas funcionam internamente e como podemos testá-las para ver como elas podem funcionar.



### Trabalho preparatório: baixando os arquivos necessários
Antes de começar, precisamos baixar todos os dados que usaremos.
* ** sentiment140-subset.csv: ** subconjunto limpo de dados do Sentiment140 - meio milhão de tweets marcados como positivos ou negativos


In [1]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip -P data
!unzip -n -d data data/sentiment140-subset.csv.zip

--2022-04-10 17:05:48--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip
Resolvendo nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Conectando-se a nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 17927149 (17M) [application/zip]
Salvando em: “data/sentiment140-subset.csv.zip”


2022-04-10 17:05:53 (4,13 MB/s) - “data/sentiment140-subset.csv.zip” salvo [17927149/17927149]

Archive:  data/sentiment140-subset.csv.zip
  inflating: data/sentiment140-subset.csv  


In [2]:
# !pip install sklearn


## Treinamento em tweets

Digamos que vamos analisar o sentimento dos tweets. Se tivéssemos uma lista de tweets com pontuação positiva vs. negativa, poderíamos ver quais palavras geralmente estão associadas a pontuações positivas e quais geralmente estão associadas a pontuações negativas.

Felizmente, temos ** Sentiment140 ** - http://help.sentiment140.com/for-students - uma lista de 1,6 milhão de tweets junto com uma pontuação para determinar se são negativos ou positivos. Vamos usá-lo para construir nosso próprio algoritmo de aprendizado de máquina para ver separar a positividade da negatividade.

### Leia em nossos dados 

In [3]:
import pandas as pd

df = pd.read_csv("data/sentiment140-subset.csv", nrows=30000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last ti...
4,0,@MrKinetik Not a thing!!! I don't really have...


Não é um conjunto de dados muito complicado. `polaridade` é se é positivo ou não,` texto` é o texto do próprio tweet.

Quantas linhas nós temos?

In [4]:
df.shape

(30000, 2)

Quantos tweets ** positivos ** em comparação com quantos tweets ** negativos **?

In [5]:
df.polarity.value_counts()

1    15064
0    14936
Name: polarity, dtype: int64

## treinando nosso algoritmo


### Vectorize nossos tweets

Crie um `TfidfVectorizer` e use-o para vetorizar nossos tweets. Já que não temos todo o tempo do mundo, provavelmente deveríamos usar `max_features` apenas para pegar uma seleção de termos - que tal 1000 por agora?

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334095,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.427465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Configurando nossas variáveis

Porque queremos nos ajustar com todos os outros programadores, precisamos criar duas variáveis: uma chamada `X` e outra chamada` y`.

`X` é todos os nossos ** recursos **, as coisas que usamos para prever o positivo ou o negativo. Essas serão nossas palavras.

`y` são todos os nossos ** rótulos **, a avaliação positiva ou negativa. Usaremos a coluna `polaridade` para isso.

In [8]:
X = words_df
y = df.polarity

### Escolhendo um algoritmo

Que tipo de algoritmo queremos? Quem sabe, não sabemos nada sobre aprendizado de máquina! ** Vamos escolher TODOS ELES. **

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

### Treinando nossos algoritmos

Quando ensinamos nosso algoritmo sobre a aparência de um tweet positivo ou negativo, isso é chamado de ** treinamento **. O treinamento pode levar diferentes períodos de tempo com base no tipo de algoritmo que você está usando.

In [10]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 13.7 s, sys: 1.04 s, total: 14.8 s
Wall time: 7.76 s


In [11]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 26.9 s, sys: 211 ms, total: 27.2 s
Wall time: 27 s


In [12]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 324 ms, sys: 2.07 ms, total: 326 ms
Wall time: 328 ms


In [13]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 195 ms, sys: 16 ms, total: 211 ms
Wall time: 145 ms


** Quanto tempo cada um levou para treinar? ** Quão mais rápido alguns foram em comparação com outros?

## Use nossos modelos

Agora que treinamos nossos modelos, ** eles podem tentar prever se algum conteúdo é positivo ou negativo **.

### Preparando os dados

** Adicione mais algumas frases abaixo. ** Elas devem ser uma mistura de positivas e negativas. Eles podem ser enfadonhos, podem ser emocionantes, podem ser curtos, podem ser longos.

In [14]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

unknown = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})
unknown

Unnamed: 0,content
0,I love love love love this kitten
1,I hate hate hate hate this keyboard
2,I'm not sure how I feel about toast
3,Did you see the baseball game yesterday?
4,The package was delivered late and the contents were broken
5,Trashy television shows are some of my favorites
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
7,"I find chirping birds irritating, but I know I'm not the only one"


Primeiro, precisamos ** vetorizar ** nossas sentenças em números, para que o algoritmo possa entendê-las.

Nosso algoritmo conhece apenas ** certas palavras. ** Execute `vectorizer.get_feature_names ()` para mostrar a lista de palavras que ele conhece.

In [15]:
print(vectorizer.get_feature_names())

['10', '100', '11', '12', '15', '1st', '20', '2day', '2nd', '30', 'able', 'about', 'account', 'actually', 'add', 'after', 'afternoon', 'again', 'ago', 'agree', 'ah', 'ahh', 'ahhh', 'air', 'album', 'all', 'almost', 'alone', 'already', 'alright', 'also', 'although', 'always', 'am', 'amazing', 'amp', 'an', 'and', 'annoying', 'another', 'any', 'anymore', 'anyone', 'anything', 'anyway', 'app', 'apparently', 'apple', 'appreciate', 'are', 'around', 'art', 'as', 'ask', 'asleep', 'ass', 'at', 'ate', 'aw', 'awake', 'awards', 'away', 'awesome', 'aww', 'awww', 'baby', 'back', 'bad', 'band', 'bbq', 'bday', 'be', 'beach', 'beautiful', 'because', 'bed', 'been', 'beer', 'before', 'behind', 'being', 'believe', 'best', 'bet', 'better', 'big', 'bike', 'birthday', 'bit', 'bitch', 'black', 'blip', 'blog', 'blue', 'body', 'boo', 'book', 'books', 'bored', 'boring', 'both', 'bought', 'bout', 'box', 'boy', 'boys', 'break', 'breakfast', 'bring', 'bro', 'broke', 'broken', 'brother', 'brothers', 'btw', 'bus', 'bu

Normalmente, quando usamos o vetorizador, escrevemos código como este:
    
`` `python
vetores = vectorizer.fit_transform (....)
`` `

Que aprende todas as palavras ** e ** as conta. Neste caso ** já temos a lista de palavras que conhecemos, queremos apenas contá-las. ** Portanto, em vez de `.fit_transform`, usamos apenas` .transform`:

`` `python
vetores_desconhecidos = vectorizer.transform (desconhecido.content)
desconhecido_words_df = ......
`` `

Termine de fazer seu `unknown_words_df` na célula abaixo.

In [16]:
# Passe pelo vetorizador

# transform, not fit_transform, porque já aprendemos todas as nossas palavras
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.417209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.537291,0.0,0.0,0.244939,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215967,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirme se `unknown_words_df` tem 11 linhas e 2.000 colunas.

In [17]:
unknown_words_df.shape

(8, 1000)

### Predizendo com nossos modelos

Para fazer uma previsão para cada uma de nossas sentenças, você pode usar `.predict` com cada um de nossos modelos. Por exemplo, seria assim para regressão linear:

```python
desconhecido ['pred_logreg'] = logreg.predict (desconhecido_words_df)
```

Para adicionar a previsão de regressão logística, você executaria um código `.predict` semelhante, que forneceria um` 0` (negativo) ou um `1` (positivo). A diferença entre os dois é que, para regressão logística, você também pode ** perguntar sobre a probabilidade de que a frase esteja na categoria `1` ** em vez de simplesmente na categoria. Para fazer isso, você usa este código:

```python
desconhecido ['pred_logreg_prob'] = linreg.predict_proba (unknown_words_df) [:, 1]
```

** Adicione novas colunas para cada um dos modelos que você treinou. ** Se o modelo tiver um `.predict_proba`, adicione-o também como uma coluna.

* ** Dica: ** Tab é útil para saber se `.predict_proba` é uma opção.
* ** Dica: ** não se esqueça de `[:, 1]` após `.predict_proba`, significa" me dê a probabilidade para a categoria `1`

In [18]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [19]:
unknown

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,I love love love love this kitten,1,0.950516,1,0.857023,1,1,0.747222
1,I hate hate hate hate this keyboard,0,0.009595,0,0.0,0,0,0.122383
2,I'm not sure how I feel about toast,0,0.180953,0,0.22,0,0,0.416819
3,Did you see the baseball game yesterday?,1,0.614999,1,0.62,1,1,0.509662
4,The package was delivered late and the contents were broken,0,0.058225,1,0.58,0,0,0.219788
5,Trashy television shows are some of my favorites,0,0.330459,0,0.273333,0,1,0.534234
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",1,0.558401,0,0.16,1,1,0.533493
7,"I find chirping birds irritating, but I know I'm not the only one",0,0.060197,0,0.48,0,0,0.295739


### Perguntas

* O que os números significam? Qual é a diferença entre 0 e 1? A 0,5? Números negativos?
* Houve alguma frase sobre a qual os classificadores pareciam discordar? Como você se sente sobre a quantidade de discordância deles?
* Qual é a diferença entre usar um 0/1 para falar sobre sentimento em comparação com 0-1? Quando você pode usar um em comparação com o outro?
* Qual é a diferença entre o modelo de regressão linear e os outros modelos que estamos usando? Por que pode ou não caber?
* Entre 0-1, qual intervalo você acha que conta como "negativo", "positivo" e "neutro"?
* A variação nas pontuações reflete a variação que você veria entre as pessoas? Ou é melhor ou pior?

## Testando nossos modelos

Podemos ver ** qual modelo tem o melhor desempenho! ** Lembra como treinamos nossos modelos em tweets? Podemos perguntar a cada modelo sobre cada tweet e ver se obtém a resposta certa.

In [20]:
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


Nosso dataframe original é uma lista de muitos, muitos tweets. Transformamos isso em `X` - palavras vetorizadas - e` y` - seja o tweet negativo ou positivo.

Antes de usarmos `.fit (X, y)` para treinar em todos os nossos dados. Em vez disso, ** podemos testar nossos modelos ** fazendo uma divisão teste / trem e ver se as previsões correspondem aos rótulos reais.

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [22]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 27.9 s, sys: 901 ms, total: 28.8 s
Wall time: 23.6 s


### Matrizes de confusão

Para ver se eles se saíram bem, usaremos uma ["matriz de confusão"] (https://en.wikipedia.org/wiki/Confusion_matrix) para cada um. Acho que as matrizes de confusão são chamadas assim porque são confusas.

In [23]:
from sklearn.metrics import confusion_matrix

#### Logistic Regression

In [24]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2743,982
Is positive,915,2860


#### Random forest

In [25]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2774,951
Is positive,1126,2649


#### SVC

In [26]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2741,984
Is positive,896,2879


#### Multinomial Naive Bayes

In [27]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2803,922
Is positive,1039,2736


### Matrizes de confusão baseadas em porcentagem

Esses são irritantes porque são apenas números. Vamos tentar porcentagens

#### Logisitic

In [28]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.736376,0.263624
Is positive,0.242384,0.757616


#### Logistic regression

In [29]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.736376,0.263624
Is positive,0.242384,0.757616


#### Random forest

In [30]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.744698,0.255302
Is positive,0.298278,0.701722


#### SVC

In [31]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.735839,0.264161
Is positive,0.237351,0.762649


#### Multinomial Naive Bayes

In [32]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.752483,0.247517
Is positive,0.275232,0.724768


## Análise

Se você não estiver satisfeito com uma ferramenta, tente criar a sua própria! Isso é exatamente o que tentamos fazer, usando o ** conjunto de dados Sentiment140 ** e vários algoritmos de aprendizado de máquina.

Sentiment140 é um banco de dados de tweets que vêm pré-rotulados com sentimentos positivos ou negativos, atribuídos automaticamente pela presença de um `:)` ou `: (`. Nosso primeiro passo foi usar um ** vetorizador ** para converter os tweets em números um computador poderia entender.

Depois disso, construímos quatro ** modelos ** diferentes usando diferentes algoritmos de aprendizado de máquina. Cada um recebeu uma lista das ** características ** de cada tweet - as palavras - e o ** rótulo ** de cada tweet - o sentimento - na esperança de que mais tarde pudesse prever os rótulos se recebesse novos tweets. Este processo de ensino do algoritmo é chamado de ** treinamento **.

Para testar nossos algoritmos, dividimos nossos dados em seções - dados de ** treinar ** e ** testar **. Você ensina o algoritmo com o primeiro grupo e, em seguida, pede previsões para o segundo conjunto. Você pode então comparar suas previsões com as respostas certas usando uma ** matriz de confusão **.

Embora ** diferentes algoritmos levem diferentes quantidades de tempo para treinar **, todos eles acabaram com cerca de 70-75% de precisão.