# Lecture 1: Twitter Sentiment Analysis I

WARNING: The Politecnico di Torino, especially Viviana Patti here, are world-renowned in the area of many of the topics covered here. I cannot even reach up to tie her shoelaces! So again, I will concentrate on working with whatever is available on the web, rather than the state-of-the-art.

## Polarity

**Def'n**: $P_s(T)=p \in \mathbb{R}$ is a function from a tweet's text to a real number $-1 \leq p \leq 1$, where -1 is negative polarity, 0 is neutral and 1 is positive polarity. We want to discuss $P_s$, of course. To do that, we will try to reproduce

 - Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. Technical report, Stanford.<br>[ [paper](http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) ]
 
the base of [Sentiment140](http://www.sentiment140.com/), a well-known SA website and API, where Naive Bayes, Maximum Entropy and  Support Vector Machines are compared to a baseline based on an approach based on a list of positive and negative words. There's a nice API to work with Sentiment140:

In [6]:
!curl -d "{'data': [{'text': 'I love Titanic.'}]}" \
http://www.sentiment140.com/api/bulkClassifyJson

{"data":[{"text":"I love Titanic.","polarity":4,"meta":{"language":"en"}}]}


where the polarity values are: 0 (negative), 2 (neutral), 4 (positive).

In [18]:
!curl -d "{'data': [{'text': 'I hate Titanic.'}]}" \
http://www.sentiment140.com/api/bulkClassifyJson

{"data":[{"text":"I hate Titanic.","polarity":0,"meta":{"language":"en"}}]}


In [19]:
!curl -d "{'data': [{'text': 'Titanic was white.'}]}" \
http://www.sentiment140.com/api/bulkClassifyJson

{"data":[{"text":"Titanic was white.","polarity":2,"meta":{"language":"en"}}]}


Two bad things about Sentiment140, though:

 1.  Sentiment140 only support English and Spanish at this stage: so no Italian (http://help.sentiment140.com/api), see the "Specifying a Language" section. 
 1. Unfortunately, there is no open source version of that algorithm (shame).

Bulk processing of Spanish tweets doesn't work that well... obviously.

In [23]:
!curl -d "{'data': [{'text': 'Odio el mundo!','language': 'es'}]}" \
http://www.sentiment140.com/api/bulkClassifyJson

{"data":[{"text":"Odio el mundo!","polarity":2,"language":"es","meta":{"language":"es"}}]}


Still, there are lots of github repos doing polarity and sentiment analysis in general. The closest one is probably [http://www.sananalytics.com/lab/twitter-sentiment/](http://www.sananalytics.com/lab/twitter-sentiment/). It doesn't work for Italian either (anyone up for fixing this?), but some parts of the code are worth looking at, and at least we can use it for bootstrapping:

There is even a Quora question about it, though it seems it's from before Viviana started working on this... or was it you, Vivi? [Are there any twitter dataset in the italian language for sentiment analysis?](https://www.quora.com/Are-there-any-twitter-dataset-in-the-italian-language-for-sentiment-analysis)

There's quite a bit of awesome work on the Italian language for sentiment analysis for Twitter.

 - http://www.aclweb.org/anthology/W13-1614
 - http://clic.humnet.unipi.it/proceedings/vol1/CLICIT2014173.pdf
 - http://clic.humnet.unipi.it/proceedings/vol2/clicit2014217.pdf
 
and from the University of Torino itself! Look at:

 - http://ijcai.org/Proceedings/15/Papers/587.pdf
 - http://www.di.unito.it/~tutreeb/sentiTUT.html#corpus
 
But actually finding a system that does this has proven to be *very* hard.

TextBlob and NLTK do provide some sentiment analysis engine, but they are not geared for Twitter. All of them have been trained on the movie review dataset. We don't want that, we want to see what happens to tweets, right? Let's look at an example of what `textblob` does, which I do think is cool: it returns a measure for polarity, and a measure for "subjectivity". We will talk about subjectivity tomorrow.

In [10]:
from textblob import TextBlob

In [15]:
testimonial = TextBlob("I love cars")
testimonial.sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. Once again, it only works for English. (and was trained on the movie reviews corpus using a Naive Bayes Analyzer, hardly good for Twitter)

https://github.com/sloria/TextBlob/blob/dev/textblob/en/sentiments.py

## The goal for today

We need to train an SVM, NB, and NN on the polarity column of http://www.di.unito.it/~tutreeb/sentipolc-evalita14/data.html. We want to replicate Alec Go's paper for the Italian dataset provided to us by Viviana Patti (Thanks, Vivi).
 - Baseline
   - 
 - Naive Bayes
   - http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
   - http://scikit-learn.org/stable/modules/naive_bayes.html
 - SVM
   - https://marcobonzanini.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/
   - http://scikit-learn.org/stable/modules/svm.html
 - Maximum entropy
   - http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/
   - scikit-learn calls MaxEnt logistic regression, http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
 - Deep learning (anyone up for doing this?)
   - http://nlp.stanford.edu/sentiment/code.html
   
### Death

This is what I mean: the following paper whets your appetite and then kills you with despair:

 - http://arxiv.org/pdf/1507.00133.pdf

#### Orphan paper(s)

Anyone up for doing this in Italian as a final project? :)

- Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter Sentiment Analysis with Deep Convolutional Neural Networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 959-962. DOI=http://dx.doi.org/10.1145/2766462.2767830.<br>[Paper: [SM2015](http://dl.acm.org/citation.cfm?id=2767830&CFID=771940515&CFTOKEN=64862584)] [Software: [PDNN](https://www.cs.cmu.edu/~ymiao/pdnntk.html)]

This is a very good answer in StackOverflow about datasets and corpora: http://stackoverflow.com/a/33215771/50305

In particular, there's code for Spanish here: http://www.sngularmeaning.team/TASS2013/corpus.php. Except as usual, most of the links are broken.