In [4]:
from IPython.core.display import HTML
HTML("""
<style>
.definition{padding         : 1em;
            background-color: Aquamarine;
            border: 1px solid blue;}
.important{ padding         : 1em;
            background-color: red;
            border: 1px solid blue;}
</style>
""")

# Lecture 1: Basic NLP & SA I

This lecture has three parts:

1. Preliminaries and Introduction to Sentiment Analysis
2. Polarity
3. Opinion

This document contains notes for Part 1. [Polariy](Polarity.ipynb) and [Opinion](Opinion.ipynb) contain notes for Parts 2 and 3.

## Preliminaries 

### A tweet<br>


<div class="definition">
**Def'n**: A tweet $t$ is a sequence (string) of Unicode (UTF-8) encoded characters $(c_1,c_2,...,c_n)$, where $|t|=n$ is the length of $t$, and $0 < | t | \leq 140$,
</div>

where UTF-8 means encoded in one to four bytes:

In [5]:
!unicode e

[1mU+0065 LATIN SMALL LETTER E[0m
[32mUTF-8: [0m65  [32mUTF-16BE: [0m0065  [32mDecimal: [0m&#101;
e (E)
[32mUppercase: [0mU+0045
[32mCategory: [0mLl (Letter, Lowercase)
[32mBidi: [0mL (Left-to-Right)



In [6]:
!unicode é # combined

[1mU+00E9 LATIN SMALL LETTER E WITH ACUTE[0m
[32mUTF-8: [0mc3 a9  [32mUTF-16BE: [0m00e9  [32mDecimal: [0m&#233;
é (É)
[32mUppercase: [0mU+00C9
[32mCategory: [0mLl (Letter, Lowercase)
[32mBidi: [0mL (Left-to-Right)
[32mDecomposition: [0m0065 0301



In [7]:
!unicode -x 301 # diacritical

[1mU+0301 COMBINING ACUTE ACCENT[0m
[32mUTF-8: [0mcc 81  [32mUTF-16BE: [0m0301  [32mDecimal: [0m&#769;
 ́
[32mCategory: [0mMn (Mark, Non-Spacing)
[32mBidi: [0mNSM (Non-Spacing Mark)
[32mCombining: [0m230 (Above)



`unicode` is in the Ubuntu repositories. To install: `sudo apt-get install unicode`

`e` is one byte (`0x65`), `é` is either two bytes `0xc3 0xc9` or three `0x65 0xcc 0x81`. See [this](https://twitter.com/leoferres/status/729705274408239105).

Twitter's javascript page imposes composition, the API does not. What can we do with information load about this? What's the length of CJK languages?

**Note**: We do not deal with a tweet's metadata, which would mean defining a tweet as a k-tuple with the tweet's text being only one member of the tuple.

Let's work out some examples:

In [8]:
t1 = u"felicidadees!! k t lo pases muy bien!! =)"
t2 = u"Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :D"
t3 = u"FeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) ♥"
t4 = "FeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) ♥"

In [9]:
seq =[]
for c in t3:
    print c,
    seq.append(c)

F e l i i c C i i d a D e s S !   : D   Q   t t e   L o 0   p a s e S   b N !   ; )   ♥


In [10]:
print "Number of characters: " + str(len(t3))
print len(seq)

Number of characters: 44
44


In [11]:
for c in t4:
    print c,
    seq.append(c)

F e l i i c C i i d a D e s S !   : D   Q   t t e   L o 0   p a s e S   b N !   ; )   � � �


In [12]:
print "Number of characters: " + str(len(t4))

Number of characters: 46


In [13]:
!unicode ♥

[1mU+2665 BLACK HEART SUIT[0m
[32mUTF-8: [0me2 99 a5  [32mUTF-16BE: [0m2665  [32mDecimal: [0m&#9829;
♥
[32mCategory: [0mSo (Symbol, Other)
[32mBidi: [0mON (Other Neutrals)



Our main target languages are Spanish, Italian and English. We still need to (at the very least) discard those messages that are not in this language. The following module, available at https://github.com/saffsd/langid.py, trained on 97 languages, does the identification for us. See: M. Lui and T. Baldwin. 2012. langid.py: An off-theshelf language identification tool. In Proc. of ACL. [Architecture paper](http://www.aclweb.org/anthology/P12-3005) and [theory paper](http://www.aclweb.org/anthology/I11-1062) (Cited by 47).

In [14]:
import sys
sys.path.append('../modules/langid.py/langid')
from langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

In [15]:
print identifier.classify(t1)
print identifier.classify(t2)
print identifier.classify(t3) # t4 is t3 with no unicode

('es', 0.9999693636198159)
('pt', 0.8822975698500081)
('no', 0.332943138461849)


In [16]:
t5 = "Ho aggiunto un video a una playlist di @YouTube: https://t.co/jrSt4uW17P Joybiza"
identifier.classify(t5)

('it', 0.9999999999999993)

byte n-grams where $1\leq n\leq 5$, no assumptions about the language. the N most frequent terms for each language are retained in the global feature set.

Notice how difficult it becomes for `LangID` to identify some of the tweets, given that some of the byte n-grams do not really belong to Spanish (but maybe they do in Portuguese/Norwegian?). This brings us to another pre-processing step: tweet **normalization**.

### Words

The def'n above is sufficiently broad to account for any tweet in any (human or machine) language. But we want to know what a tweet is "communicating", not what bytes it's composed of. 

For that, "split" a tweet (in the sense above) into composing elements. Some elements are inherent to tweets (at-mentions, hashtags, or RTs), some are more general: words in the language the tweet is written in.

Words are notoriously difficult to define. However, for practical reasons, we will have that: <br><br>

<div class="definition">
**Def'n**: A word $w$ is a sequence of Unicode (UTF-8) encoded characters separated by either a space or a symbol in some predefined set of [punctuation marks](https://en.wikipedia.org/wiki/Punctuation_of_English).
</div>

We may extend this if we find someone willing to work on CJK languages, which would be awesome.

## Tokenization

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. [from Wikipedia](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)). For most of our purposes, the simple, regular expression package in NLTK is enough for tokenizing tweets.

In [17]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [18]:
print tknzr.tokenize(t1)
print tknzr.tokenize(t2)
print tknzr.tokenize(t3)

[u'felicidadees', u'!', u'!', u'k', u't', u'lo', u'pases', u'muy', u'bien', u'!', u'!', u'=)']
[u'Feeeliiciidaaadeeess', u'!', u'!', u'(:', u'Felicidadesss', u'!', u'!', u'pasatelo', u'genialll', u':D']
[u'FeliicCiidaDesS', u'!', u':D', u'Q', u'tte', u'Lo0', u'paseS', u'bN', u'!', u';)', u'\u2665']


In [28]:
import sys
sys.path.append('../modules/ark-twokenize-py')
import twokenize as twokenize
t = twokenize.tokenize(t1)
print t

[u'felicidadees', u'!!', u'k', u't', u'lo', u'pases', u'muy', u'bien', u'!!', u'=)']


### Normalization

Normalization is tough. Twitter's messages show very noise input, like the ones shown again below:

In [20]:
print "1." + t1
print "2." + t2
print "3." + t3
print "4" + t5

1.felicidadees!! k t lo pases muy bien!! =)
2.Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :D
3.FeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) ♥
4Ho aggiunto un video a una playlist di @YouTube: https://t.co/jrSt4uW17P Joybiza


We define the task of text normalisation to be a mapping from “ill-formed” OOV lexical items to their standard lexical forms [[P11-1038.pdf](./papers/P11-1038.pdf)]. Twitter text normalization is not a simple task. It involves restoring capitalization, normalizing weird spelling conventions, among many other processes. However, we do not want to lose information. Thus, although we normalize for furthering the next processes, we keep the original spelling, which also conveys information. For example, the tweet

We will be somewhat naïve in normalizing at this point. But I want to introduce the concept of *word* a bit later, so we'll use a simple normalizer based on a list of words. It is called [deflog](https://github.com/sbruno/deflog). Comes from de-fotologging... don't search for flogger :)

In [23]:
sys.path.append('../modules/')
import libdeflog as df

In [26]:
q = [df.desms(df.desmultiplicar(w)) for w in tknzr.tokenize(t1)]
print q

[u'felicidades', u'!', u'!', u'que', u'te', u'lo', u'pases', u'muy', u'bien', u'!', u'!', u'=)']


However, we must be careful: by "doubling up" letters expresses a sentiment emphasis of sorts. That is the kind of information, we want to maintain! So instead of actually clobbering stuff, we may want to simply **add** information to some structure, like a `json` file. We will use this later.

## POS-Tagging

Finally, there's POS-Tagging, probably the most difficult *linguistic* task in this introductory document.

Part of speech tagging is the process of identifying nouns, verbs, adjectives, and other parts of speech in context.

There are several POSTaggers. 

### Stanford

1. Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip (or latest), see documentation at http://nlp.stanford.edu/software/tagger.shtml
1. Make sure you have Java > 8, if not, follow instructions from http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html

In [None]:
from nltk.tag.stanford import StanfordPOSTagger as POSTagger
spanish_postagger = POSTagger('../modules/stanford-postagger-full-2015-04-20/models/spanish.tagger','../modules/stanford-postagger-full-2015-04-20/stanford-postagger.jar')

So, enough of introductory stuff. We have covered many (although very naïve) pre-processing techniques for the manipulation of Twitter's text field. Now we will work with the tasks that are important to us, and the first is Sentiment Analysis.

In [27]:
spanish_postagger.tag(q)

[(u'felicidades', u'nc0p000'),
 (u'!', u'fat'),
 (u'!', u'fat'),
 (u'que', u'cs'),
 (u'te', u'pp000000'),
 (u'lo', u'pp000000'),
 (u'pases', u'vmsp000'),
 (u'muy', u'rg'),
 (u'bien', u'rg'),
 (u'!', u'fat'),
 (u'!', u'fat'),
 (u'=)', u'aq0000')]

In [32]:
print t1
spanish_postagger.tag(t1)

felicidadees!! k t lo pases muy bien!! =)


[(u'f', u'nc0n000'),
 (u'e', u'cc'),
 (u'l', u'np00000'),
 (u'i', u'nc0s000'),
 (u'c', u'np00000'),
 (u'i', u'nc0s000'),
 (u'd', u'np00000'),
 (u'a', u'sp000'),
 (u'd', u'nc0s000'),
 (u'e', u'cc'),
 (u'e', u'cc'),
 (u's', u'pi000000'),
 (u'!', u'fat'),
 (u'!', u'fat'),
 (u'k', u'np00000'),
 (u't', u'np00000'),
 (u'l', u'np00000'),
 (u'o', u'cc'),
 (u'p', u'nc0n000'),
 (u'a', u'sp000'),
 (u's', u'nc0n000'),
 (u'e', u'cc'),
 (u's', u'np00000'),
 (u'm', u'np00000'),
 (u'u', u'cc'),
 (u'y', u'cc'),
 (u'b', u'fz'),
 (u'i', u'nc0s000'),
 (u'e', u'cc'),
 (u'n', u'nc0n000'),
 (u'!', u'fat'),
 (u'!', u'fat'),
 (u'=', u'f0'),
 (u')', u'f0')]