In [85]:
%load_ext watermark
%watermark

11/12/2015 18:17:39

CPython 3.5.0
IPython 4.0.0

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.13.0-68-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


**Once we have the tweets parsed, we can try and perform sentiment analysis**

We will use an existing labeled dataset to predict the sentiment of the new tweets

Since the tweets dataset consists of tweets from my home town in Spain, that means that we have to perform sentiment analysis focused on Spanish language. Which is more difficult than doing it in English. There are multiple already trained models to perform sentiment analysis in English (for example [textblob](https://textblob.readthedocs.org/)).

I had three options to do sentiment analysis in Spanish:
1. Translate the tweets to english (using google translate or other alternative) and then perform regular sentiment analysis in English.
2. Create my own corpus (a common corpus for sentiment analysis is the imdb dataset)
3. Use an existing corpus in Spanish. 

I found a very, very good dataset provided by the Spanish Society for Natural Language Processing [SEPLN].

Every year they organize the TASS, a workshop focused on sentiment analysis in Spanish.

They provide their [corpus](http://www.sngularmeaning.team/TASS2015/tass2015.php#corpus) for free for academic purposes.

The TASS Dataset includes many files that sum about 50000 tweets from a variety of topics (general, tv, politics, sports), and include the polarity as an ordinal variable. Polarity values include: N+ (very negative), N (negative), NEU(Neutral), P (Positive), P+ (very positive)

In [46]:
cd sentiment/TASS\ Corpus

/media/manuel/DATA/Backup/Proyectos/tweepy murcia/sentiment/TASS Corpus


In [2]:
ls

corpus_data_features.mtx          politics2013.qrel
corpus_vectorizer.mtx             politics2013-tweets-test-tagged.xml
[0m[01;31meval-task1.php.gz[0m                 socialtv-sentiment.qrel
[01;31meval-task2.php.gz[0m                 socialtv-tweets-test-tagged.csv
general-sentiment-3l-1k.qrel      socialtv-tweets-test-tagged.xml
general-sentiment-3l.qrel         socialtv-tweets-test.xml
general-sentiment-5l-1k.qrel      socialtv-tweets-train-tagged.csv
general-sentiment-5l.qrel         socialtv-tweets-train-tagged.xml
general-topics_2013.qrel          stompol-sentiment.qrel
general-tweets-test1k-tagged.xml  stompol-tweets-test-tagged.csv
general-tweets-test1k.xml         stompol-tweets-test-tagged.xml
general-tweets-test-tagged.csv    stompol-tweets-test.xml
general-tweets-test-tagged.xml    stompol-tweets-train-tagged.csv
general-tweets-test.xml           stompol-tweets-train-tagged.xml
general-tweets-train-tagged.csv   test.pkl
general-tweets-train-tagged.xml 

In [2]:
!head -n 30 general-tweets-test-tagged.xml

<?xml version="1.0" encoding="UTF-8"?>
<tweets>
 <tweet>
  <tweetid>142378325086715906</tweetid>
  <user>jesusmarana</user>
  <content><![CDATA[Portada 'Público', viernes. Fabra al banquillo por 'orden' del Supremo; Wikileaks 'retrata' a 160 empresas espías. http://t.co/YtpRU0fd]]></content>
  <date>2011-12-02T00:03:32</date>
  <lang>es</lang>
  <sentiments>
   <polarity><value>N</value></polarity>
  </sentiments>
  <topics>
   <topic>política</topic>
  </topics>
 </tweet>
 <tweet>
  <tweetid>142379080808013825</tweetid>
  <user>EvaORegan</user>
  <content><![CDATA[Grande! RT @veronicacalderon "El periodista es alguien que quiere contar la realidad, pero no vive en ella" via @galtares]]></content>
  <date>2011-12-02T00:06:32</date>
  <lang>es</lang>
  <sentiments>
   <polarity><value>NONE</value></polarity>
  </sentiments>
  <topics>
   <topic>política</topic>
  </topics>
 </tweet>
 <tweet>
  <tweetid>142379173120442368</tweetid>


So we see that tweets come as xml files. TASS files come in different xml schemas (different TASS seminars seem to have used slightly different files), so they need to be parsed each one separately.

In [47]:
import pandas as pd
pd.set_option('max_colwidth',1000)

In [48]:
try:
    general_tweets_corpus_train = pd.read_csv('general-tweets-train-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('general-tweets-train-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_train = pd.DataFrame(columns=('content', 'polarity', 'agreement'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [tweet.content.text, tweet.sentiments.polarity.value.text, tweet.sentiments.polarity.type.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_train = general_tweets_corpus_train.append(row_s)
    general_tweets_corpus_train.to_csv('general-tweets-train-tagged.csv', index=False, encoding='utf-8')

In [49]:
try:
    general_tweets_corpus_test = pd.read_csv('general-tweets-test-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('general-tweets-test-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    general_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
        row_s = pd.Series(row)
        row_s.name = i
        general_tweets_corpus_test = general_tweets_corpus_test.append(row_s)
    general_tweets_corpus_test.to_csv('general-tweets-test-tagged.csv', index=False, encoding='utf-8')

In [50]:
try:
    stompol_tweets_corpus_train = pd.read_csv('stompol-tweets-train-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('stompol-tweets-train-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    stompol_tweets_corpus_train = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        stompol_tweets_corpus_train = stompol_tweets_corpus_train.append(row_s)
    stompol_tweets_corpus_train.to_csv('stompol-tweets-train-tagged.csv', index=False, encoding='utf-8')

In [51]:
try:
    stompol_tweets_corpus_test = pd.read_csv('stompol-tweets-test-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('stompol-tweets-test-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    stompol_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        stompol_tweets_corpus_test = stompol_tweets_corpus_test.append(row_s)
    stompol_tweets_corpus_test.to_csv('stompol-tweets-test-tagged.csv', index=False, encoding='utf-8')

In [52]:
try:
    social_tweets_corpus_test = pd.read_csv('socialtv-tweets-test-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('socialtv-tweets-test-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    social_tweets_corpus_test = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        social_tweets_corpus_test = social_tweets_corpus_test.append(row_s)
    social_tweets_corpus_test.to_csv('socialtv-tweets-test-tagged.csv', index=False, encoding='utf-8')

In [53]:
try:
    social_tweets_corpus_train = pd.read_csv('socialtv-tweets-train-tagged.csv', encoding='utf-8')
except:

    from lxml import objectify
    xml = objectify.parse(open('socialtv-tweets-train-tagged.xml'))
    #sample tweet object
    root = xml.getroot()
    social_tweets_corpus_train = pd.DataFrame(columns=('content', 'polarity'))
    tweets = root.getchildren()
    for i in range(0,len(tweets)):
        tweet = tweets[i]
        row = dict(zip(['content', 'polarity', 'agreement'], [' '.join(list(tweet.itertext())), tweet.sentiment.get('polarity')]))
        row_s = pd.Series(row)
        row_s.name = i
        social_tweets_corpus_train = social_tweets_corpus_train.append(row_s)
    social_tweets_corpus_train.to_csv('socialtv-tweets-train-tagged.csv', index=False, encoding='utf-8')

Once we have all the xml files parsed we concatenate them

In [54]:
tweets_corpus = pd.concat([
        social_tweets_corpus_train,
        social_tweets_corpus_test,
        stompol_tweets_corpus_test,
        stompol_tweets_corpus_train,
        general_tweets_corpus_test,
        general_tweets_corpus_train
        
    ])
tweets_corpus.sample(20)

Unnamed: 0,agreement,content,polarity
23802,,Comité Ejecutivo Nacional. http://t.co/fJ71x9dY,NONE
22433,,Se investiga accidente de helicóptero de ayer en Afganistán. Han muerto 6 soldados estadounidenses de la misión de la OTAN,N+
15539,,Así es Paco! RT @PacoFMarugan Lo que ha ocurrido con los ajustes en la UE es muy preocupante en vez de reducir los problemas los ha agravado,N+
43392,,"Acabo de terminar una carrera de 12,3 km con un tiempo de 1:03:15 con Nike+ GPS. #nikeplus",NONE
21707,,La unión de un esfuerzo inversor en la extensión público-privada en la extensión de todas las redes de (cont) http://t.co/3VzPoFe0,P+
58641,,Disculpad pero tengo q marcharme. Buenas noches. Seguiremos...,NONE
9064,,Y ahora que Antena 3 ha absorbido a la Sexta se fusionará El Hormiguero y El Intermedio? #dudasexistenciales,NONE
34883,,@jesusmarana El Gobierno lo está desmintiendo,N
46983,,Ourense ;-))),P+
22042,,"RT @diegogpellicer: Es su filosofía, convertir Cataluña en un negocio, por eso en 23 años no quitaron los peajes a pesar de sus promesas.",NONE


The newest corpus has an additional field per tweet named `agreement`, which is an indication of the level of agreement or disagreement of the expressed sentiment within the content, with two possible values: `AGREEMENT` and `DISAGREEMENT`. This is used in those tweets with Neutral polarity because of conflicting (positive *and* negative) words.

We will filter only those tweets where there is Agreement on the polarity. (And there is some polarity).

In [55]:
tweets_corpus = tweets_corpus.query('agreement != "DISAGREEMENT" and polarity != "NONE"')

In [56]:
#remove links
tweets_corpus = tweets_corpus[-tweets_corpus.content.str.contains('^http.*$')]

tweets_corpus.shape

(48328, 3)

## Tokenizing & Stemming

In [57]:
#download spanish stopwords
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
spanish_stopwords = stopwords.words('spanish')

[nltk_data] Downloading package stopwords to /home/manuel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:
from string import punctuation
non_words = list(punctuation)

#we add spanish punctuation
non_words.extend(['¿', '¡'])
non_words.extend(map(str,range(10)))
non_words

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '¿',
 '¡',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9']

In [59]:
from sklearn.feature_extraction.text import CountVectorizer       
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = SnowballStemmer('spanish')
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = ''.join([c for c in text if c not in non_words])
    # tokenize
    tokens =  word_tokenize(text)

    # stem
    try:
        stems = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stems = ['']
    return stems

# Model evaluation

In [60]:
from sklearn.cross_validation import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

We will binarize the sentiment. So instead of having a 5-class classification problem we turn it into a binary one. Ideally we would perform Ordinal Classification, but scikit learn seems to be more focused on categorical/binary classification.

In [61]:
tweets_corpus = tweets_corpus[tweets_corpus.polarity != 'NEU']

tweets_corpus['polarity_bin'] = 0
tweets_corpus.polarity_bin[tweets_corpus.polarity.isin(['P', 'P+'])] = 1
tweets_corpus.polarity_bin.value_counts(normalize=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


1    0.576129
0    0.423871
Name: polarity_bin, dtype: float64

Since the problem is now a binary classification one, the score will be the Area Under the Curve (roc_auc).

At this point we would try multiple models and evaluate their performance. I have done this using Scikit Learn Laboratory (Skull). After trying multiple models (a very lenghty task), Linear Support Vector Classifier turned to be the top performant model.

Once we have the model we want to use, we do a GridSearch to find the optimal hyperparameters.

In [32]:
vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = spanish_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}


grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search.fit(tweets_corpus.content, tweets_corpus.polarity_bin)

  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspe

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['de', 'la'...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=-1,
       param_grid={'vect__max_features': (500, 1000), 'cls__loss': ('hinge', 'squared_hinge'), 'cls__max_iter': (500, 1000), 'vect__min_df': (10, 20, 50), 'cls__C': (0.2, 0.5, 0.7), 'vect__max_df': (0.5, 1.9), 'vect__ngram_range': ((1, 1), (1, 2))},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='roc_auc', verbose=0)

  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspe

In [42]:
grid_search.best_params_

{'cls__C': 0.2,
 'cls__loss': 'squared_hinge',
 'cls__max_iter': 1000,
 'vect__max_df': 1.9,
 'vect__max_features': 1000,
 'vect__min_df': 50,
 'vect__ngram_range': (1, 1)}

Save the model

In [37]:
from sklearn.externals import joblib
joblib.dump(grid_search, 'grid_search.pkl')

['grid_search.pkl',
 'grid_search.pkl_01.npy',
 'grid_search.pkl_02.npy',
 'grid_search.pkl_03.npy',
 'grid_search.pkl_04.npy',
 'grid_search.pkl_05.npy',
 'grid_search.pkl_06.npy',
 'grid_search.pkl_07.npy',
 'grid_search.pkl_08.npy',
 'grid_search.pkl_09.npy',
 'grid_search.pkl_10.npy',
 'grid_search.pkl_11.npy',
 'grid_search.pkl_12.npy',
 'grid_search.pkl_13.npy',
 'grid_search.pkl_14.npy',
 'grid_search.pkl_15.npy',
 'grid_search.pkl_16.npy',
 'grid_search.pkl_17.npy',
 'grid_search.pkl_18.npy',
 'grid_search.pkl_19.npy',
 'grid_search.pkl_20.npy',
 'grid_search.pkl_21.npy',
 'grid_search.pkl_22.npy',
 'grid_search.pkl_23.npy',
 'grid_search.pkl_24.npy',
 'grid_search.pkl_25.npy',
 'grid_search.pkl_26.npy',
 'grid_search.pkl_27.npy',
 'grid_search.pkl_28.npy',
 'grid_search.pkl_29.npy',
 'grid_search.pkl_30.npy',
 'grid_search.pkl_31.npy',
 'grid_search.pkl_32.npy',
 'grid_search.pkl_33.npy',
 'grid_search.pkl_34.npy',
 'grid_search.pkl_35.npy',
 'grid_search.pkl_36.npy',
 'grid_s

**We do crossvalidation here to show the performance of the model**

In [52]:
model = LinearSVC(C=.2, loss='squared_hinge',max_iter=1000,multi_class='ovr',
              random_state=None,
              penalty='l2',
              tol=0.0001
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = spanish_stopwords,
    min_df = 50,
    max_df = 1.9,
    ngram_range=(1, 1),
    max_features=1000
)

corpus_data_features = vectorizer.fit_transform(tweets_corpus.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [56]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(tweets_corpus)],
    y=tweets_corpus.polarity_bin,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)
  args, varargs, kw, default = inspect.getargspec(init)


0.92038489623127862

Our model has an AUC of 0.92 That will work!.

### Polarity prediction

** Now we use the trained model to predict sentiment on the downloaded tweets **

In [None]:
tweets = pd.read_csv('tweets_parsed.csv', encoding='utf-8')

In [24]:
tweets = tweets[tweets.tweet.str.len() < 150]
tweets.lat = pd.to_numeric(tweets.lat, errors='coerce')
tweets = tweets[tweets.lat.notnull()]


#We make sure only those tweets in the Murcia bounding box are kept
min_lon = -1.157420
max_lon = -1.081202
min_lat = 37.951741
max_lat = 38.029126

tweets = tweets[(tweets.lat.notnull()) & (tweets.lon.notnull())]

tweets = tweets[(tweets.lon > min_lon) & (tweets.lon < max_lon) & (tweets.lat > min_lat) & (tweets.lat < max_lat)]
tweets.shape

  interactivity=interactivity, compiler=compiler, result=result)


(96889, 9)

### Language detection

We see a problem here, we have people tweeting in multiple languages, and we only care about those in spanish for sentiment purposes

I use [langid.py](https://github.com/saffsd/langid.py), TextBlob, and [langdetect](https://pypi.python.org/pypi/langdetect/1.0.1) for language detection.

I will keep those tweets in which at least two of those packages agree on the language being spanish.

In [None]:
import langid
from langdetect import detect
import textblob

def langid_safe(tweet):
    try:
        return langid.classify(tweet)[0]
    except Exception as e:
        pass
        
def langdetect_safe(tweet):
    try:
        return detect(tweet)
    except Exception as e:
        pass

def textblob_safe(tweet):
    try:
        return textblob.TextBlob(tweet).detect_language()
    except Exception as e:
        pass   

In [None]:
#this will take a loong time.
tweets['lang_langid'] = tweets.tweet.apply(langid_safe)
tweets['lang_langdetect'] = tweets.tweet.apply(langdetect_safe)
tweets['lang_textblob'] = tweets.tweet.apply(textblob_safe)

In [None]:
tweets['lang_textblob'] = tweets.tweet.apply(textblob_safe)

In [40]:
tweets.to_csv('tweets_parsed2.csv', encoding='utf-8')

we only keep the tweets in which langdetect and langid have agreed that the language is Spanish

In [62]:
tweets = tweets.query(''' lang_langdetect == 'es' or lang_langid == 'es' or lang_textblob == 'es'  ''')
tweets.shape

(77550, 10)

In [63]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = spanish_stopwords,
            min_df = 50,
            max_df = 1.9,
            ngram_range=(1, 1),
            max_features=1000
            )),
    ('cls', LinearSVC(C=.2, loss='squared_hinge',max_iter=1000,multi_class='ovr',
             random_state=None,
             penalty='l2',
             tol=0.0001
             )),
])

In [64]:
pipeline.fit(tweets_corpus.content, tweets_corpus.polarity_bin)
tweets['polarity'] = pipeline.predict(tweets.tweet)

In [65]:
tweets[['tweet', 'polarity']].sample(20)

Unnamed: 0,tweet,polarity
602353,@AlgoMortal Muchas felicidades que lo pases muy bien :),1
589926,"@eslatarde @PPopular En una palabra, INSULTANTE!!!",0
519183,@Niita349 jaja eso espero :),1
525427,Hay cosas quee no entiendo pero bueno :),1
168354,#meteoAlarm en #Murcia para el 25-04-2014 08:51:00 sobre Viento de nivel #BLANCO,0
516783,Amos de tascas,1
22394,#meteoAlarm en #Murcia para el 30-08-2015 18:18:00 sobre Tormentas de nivel #BLANCO,0
179007,@Pauleeta_alu pues uno de Twitter jajaja ;) sigueme y md,1
12963,En la mierda,1
157553,Estrella de levante se identifica nada mas que cuando la hueles,0


We save the file as both a csv and a text file for heatmap.py

In [66]:
tweets[['tweet', 'lat', 'lon', 'polarity']].to_csv('tweets_polarity_bin.csv', encoding='utf-8')

In [67]:
with open('../../heatmap/tweets_heatmap_polarity_binary','w') as file:
    file.write(tweets[['lat','lon', 'polarity']].to_string(header=False, index=False))