## Classifier and preprocessing

In this notebook, the noironicos dataset will be treated, since ironicos's tweets are all ironic and we want a mixture of ironic and non ironic.

In [112]:
# General import and load data
import numpy as np
import nltk
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize, word_tokenize
import re

# Needed for running
nltk.download('punkt')
nltk.download('stopwords')

# Import database
df_noironicos_1=pd.read_csv('DATA/noironicos_bodies.csv')
df_train = pd.read_csv('DATA/train_data.csv')
df_development = pd.read_csv('DATA/development.csv')

# Encode categorical variable (ironic)
df_noironicos.loc[df_noironicos['ironic']==True, "ironic"] = 1
df_noironicos.loc[df_noironicos['ironic']==False, "ironic"] = 0
df_train.loc[df_train['sentiment/polarity/value']== 'NEU', 'sentiment/polarity/value'] = 0
df_train.loc[df_train['sentiment/polarity/value']== 'NONE', 'sentiment/polarity/value'] = 0
df_train.loc[df_train['sentiment/polarity/value']== 'N', 'sentiment/polarity/value'] = 1
df_train.loc[df_train['sentiment/polarity/value']== 'P', 'sentiment/polarity/value'] = 1
df_development.loc[df_development['sentiment'] == 'NONE', 'sentiment'] = 0
df_development.loc[df_development['sentiment'] == 'NEU', 'sentiment'] = 0
df_development.loc[df_development['sentiment'] == 'N', 'sentiment'] = 1
df_development.loc[df_development['sentiment'] == 'P', 'sentiment'] = 1



# Drop non-used columns
#df_noironicos.drop(['id_tweet', 'depends_image', 'depends_link', 'depends_retweet'], axis=1, inplace=True)
df_train.drop(['tweetid', 'user', 'date', 'lang'], axis=1, inplace=True)
df_development.drop(['tweetid', 'user', 'date', 'lang'], axis=1, inplace=True)

#
# Drop nan rows
df_clean=df_noironicos.dropna(subset=['text'])
df_noironicos=df_clean

# Final dataset
df_noironicos

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,ironic,text
0,1,Algunas personas sufren en las discos mientras...
3,1,@jacevedoaraya es para sostener el marcador......
4,1,Alguna de estas imágenes te sacara una sonrisa...
5,1,@_Eurovision2014 en 2013 falta esdm jajajajaja...
6,1,Hooo que buen padre...#Sarcasmo #GH2015
7,1,"@JhoynerV ja ja ja ja ja así o más claro, cas..."
8,0,@patronbermudez con todo respeto lo principios...
11,1,Gran rapidez todo en la UPO y no iban a ser me...
12,1,¿Que humilde es Simeone no? #ironía #llorón #f...
13,1,¿Alguien se viene a la playa conmigo? #resfria...


In [70]:
# Define X and Y
X = df_noironicos['text'].values
y = df_noironicos['ironic'].values.astype(int)

df_noironicos[df_noironicos.ironic==1].size


9058

## Lexical features
The lexical features analysis will be performed by using the twitter tokenizer provided by nltk library.


In [50]:
# Sample statistics using NLTK
# A transformer will be implemented

from sklearn.base import BaseEstimator, TransformerMixin
from nltk.tokenize import sent_tokenize, word_tokenize

class LexicalStats (BaseEstimator, TransformerMixin):
    """Extract lexical features from each document"""
    
    def number_sentences(self, doc):
        sentences = sent_tokenize(doc, language='spanish')
        return len(sentences)

    def fit(self, x, y=None):
        return self

    def transform(self, docs):
        return [{'length': len(doc),
                 'num_sentences': self.number_sentences(doc)}
                for doc in docs]

In [62]:
# A tokenizer will be defined
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.stem import SnowballStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

def custom_tokenizer(words):
    tokens = word_tokenize(words.lower())
    stemmer = SnowballStemmer('spanish')
    lemmas = [stemmer.stem(t) for t in tokens]
    stoplist = stopwords.words('spanish')
    lemmas_clean = [w for w in lemmas if w not in stoplist]
    punctuation = set(string.punctuation)
    lemmas_punct = [w for w in lemmas_clean if  w not in punctuation]
    print(tokens)
    return lemmas_punct



['@', '_eurovision2014', 'en', '2013', 'falta', 'esdm', 'jajajajajajajajajajja', 'top', '1', 'hombre', '#', 'ironia']


['_eurovision2014',
 '2013',
 'falt',
 'esdm',
 'jajajajajajajajajajj',
 'top',
 '1',
 'hombr',
 'ironi']

## Syntactic features

ALOMEJOR HAY QUE QUITARLO

In [71]:
# We will use NLTK's tag set
from nltk import pos_tag, word_tokenize
import collections

# We can extract particular chunks (trozos, pedazos) from the sentence
# if we use a RegExpParser. See Syntactic Processing
def PosStats(BaseEstimator, TransformerMixin):
    
    def stats(self, doc):
        tokens = custom_tokenizer(doc)
        
        tagged = pos_tag(tokens, tagset = 'universal' )
        counts = collections.Counter(tag for word, tag in tagged)
        total = sum(counts.values())
        #copy tags so that we return always the same number of features
        pos_features = {'NOUN': 0, 'ADJ': 0, 'VERB': 0, 'ADV': 0, 'CONJ': 0, 
                        'ADP': 0, 'PRON':0, 'NUM': 0}
        
        pos_dic = dict((tag, float(count)/total) for tag,count in counts.items())
        for k in pos_dic:
            if k in pos_features:
                pos_features[k] = pos_dic[k]
        return pos_features
    
    def transform(self, docs, y=None):
        return [self.stats(doc) for doc in docs]
    
    def fit(self, docs, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self
        

## Feature extraction Pipeline
The feature extraction will be carried out by using pipelines. The defined pipelines are selected in order to extract the desired features

In [72]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer


ngrams_featurizer = Pipeline([
  ('count_vectorizer',  CountVectorizer(ngram_range = (1, 4), encoding = 'ISO-8859-1', 
                                        tokenizer=custom_tokenizer)),
  ('tfidf_transformer', TfidfTransformer())
])

## Feature Union Pipeline
Now we define which features we want to extract, how to combine them and later apple machine learning in the resulting feature set.

In [75]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.decomposition import NMF, LatentDirichletAllocation

def Pipeline(clf):
    pipeline = Pipeline([
           ('features', FeatureUnion([
                        ('lexical_stats', Pipeline([
                                    ('stats', LexicalStats()),
                                    ('vectors', DictVectorizer())
                                ])),
                        ('words', TfidfVectorizer(tokenizer=custom_tokenizer)),
                        ('ngrams', ngrams_featurizer),
                        ('pos_stats', Pipeline([
                                    ('pos_stats', PosStats()),
                                    ('vectors', DictVectorizer())
                                ])),
                        ('lda', Pipeline([ 
                                    ('count', CountVectorizer(tokenizer=custom_tokenizer)),
                                    ('lda',  LatentDirichletAllocation(n_topics=4, max_iter=5,
                                                           learning_method='online', 
                                                           learning_offset=50.,
                                                           random_state=0))
                                ])),
                    ])),

            ('clf', clf)  # classifier
        ])

# Using KFold validation

cv = KFold(X.shape[0], 2, shuffle=True, random_state=33)
scores = cross_val_score(pipeline, X, y, cv=cv)
print("Scores in every iteration", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Yo quiero hacer el confussion matrix y f score. Para ello, necesito el training y testing dataset. Eso lo puedo hacer usando el vector X, pero el vector X tiene strings (arrays de twits) y el modelo (osea mi pipeline) transforma ese array de strings en números. Esos números, son procesados y metidos en el pipeline para darme el modelo. Entonces, como quiero hacer el f1 score, necesito un training y testing dataset. Pero, paa hacer el método pipeline.predict(X_test) necesito que X_test sea números (sino salta error que ya me ha pasado). Entonces, como hago ese test split en el pipeline?

## Train Optimize and Evaluate

### MultinomialNB

### K- Fold evaluation

## SVC