# Pipelines Exploration

The purpose of a pipeline is to seamlessly fix various problems in the machine learning process

* Data might not be in $R^d$
* Features need to be engineered
* Transformations need to be applied
* Hyperparameters need to be tuned

In [1]:
from dao import DataAccess

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

In [2]:
import numpy as np

First we easily instantiate a `DataAccess` and get the data in a DataFrame

In [3]:
X = DataAccess.as_dataframe()

In [4]:
X.head(2)

Unnamed: 0_level_0,created_at,labels,predict,text,user
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
556e0ee3d6dfbb462880f0a5,Tue Jun 02 20:16:08 +0000 2015,{'alcohol': 0},0.52605,Impatiently waiting to get our hands on the ne...,"{'friends_count': 1997, 'created_at': 'Thu Jun..."
556e128ad6dfbb46288111e4,Tue Jun 02 20:31:44 +0000 2015,{'alcohol': 1},0.516649,Beer fans need their @ColumbusBrewing Bodhi. I...,"{'friends_count': 960, 'created_at': 'Mon Oct ..."


In [26]:
class ItemGetter(BaseEstimator, TransformerMixin):
    """
    This canonical ItemGetter is a Transformer for Pipeline objects.
    Initialize the ItemGetter with a `key` and its transform call will select a column out
    of the specified column.
    """
    
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return X[self.key]
    
    def fit_transform(self, X, y=None):
        return X[self.key]

## Text Pipeline

First on our list is a simple Text Pipeline that uses TfidfVectorizer and TruncatedSVD (LSI)
Also use Twokenize from [brendano/tweetmotif](https://github.com/brendano/tweetmotif).

    Brendan O'Connor, Michel Krieger, and David Ahn. TweetMotif: Exploratory Search and Topic Summarization for Twitter. ICWSM-2010.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from twokenize import tokenize

text_pipe = []

text_pipe.append(
    ("text", 
     ItemGetter("text")
    )
)

text_pipe.append(
    ("tfidf", 
     TfidfVectorizer(
            analyzer="char",
            ngram_range=(2,5),
            min_df = 10,
            max_df = .98
        )
    )
)

text_pipe.append(
    ("lsi",
    TruncatedSVD(
            n_components=3000
        )
    )
)

# TruncatedSVD is annoying expensive...
text_pipeline = Pipeline(text_pipe[:2])

### Future Work: 

Incorperate Gensim Phrases, and LDA as custom transformers with the ability to load them from files

# Time Pipeline

### Vectorizers

I'll describe vectorization process a bit later as i've design it in a way so that it can be easily modified
for future implementations.

### Transformers

`DateTimeTransformer` takes the `created_at` selection and converts it into a `pandas.DatetimeIndex` which is amazingly powerful.

Currently I am using the `dayofweek`, `hour`, and `hourofweek` features.

In [11]:
import pandas as pd

class DateTimeTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return pd.DatetimeIndex(X)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [28]:
from scipy.sparse import csc_matrix

import pandas as pd

class DatetimeIndex(BaseEstimator, TransformerMixin):
    
    allowed_kinds = {"dayofweek", "hour", "hourofweek"}
    
    def __init__(self, kind):
        self.kind = kind
        
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        n = len(X)
        if self.kind == "hourofweek":
            col = X.dayofweek * 24 + X.hour
        else:
            col = getattr(X, self.kind)
        return pd.DataFrame(col)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [29]:
from sklearn.preprocessing import OneHotEncoder

time_pipe = list()

time_pipe.append(
    ("get_created_at", 
     ItemGetter("created_at")
    )
)

time_pipe.append(
    ("to_datetimeindex",
    DateTimeTransformer()
    )
)

time_pipe.append(
    ("features",
    FeatureUnion([
        ("dayofweek", 
         Pipeline(
                    [("index", DatetimeIndex("dayofweek")),
                     ("onehot", OneHotEncoder())])),
        ("hour", 
         Pipeline(
                    [("index", DatetimeIndex("hour")),
                     ("onehot", OneHotEncoder())])),
        ("hourofweek", 
         Pipeline(
                    [("index", DatetimeIndex("hourofweek")),
                     ("onehot", OneHotEncoder())]))
        ])
    )
)

tp = Pipeline(time_pipe)

### Future Work

Notice that right now the things are all OneHotEncoded. This will change later. We see that a lot of this infromation is periodic so we can probably include features like the different between Phases rather than the time itself.

This will probably function better than collapsing it into larger semantic intervals like `Afternoon` or `Sunday Afternoon`

## User Pipeline

From prior exploration it seems that log scaling them helps a lot. Noreover, a feature called normality also helps greatly.


On read, There user column is a dict so we need a transformer to convert it.

In [45]:
class Dict2DF(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return pd.DataFrame.from_records(X, index=X.index)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [47]:
u = Dict2DF().fit_transform(X.user).head()

In [48]:
u

Unnamed: 0_level_0,created_at,favourites_count,followers_count,friends_count,statuses_count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
556e0ee3d6dfbb462880f0a5,Thu Jun 12 22:14:05 +0000 2014,394,407,1997,823
556e128ad6dfbb46288111e4,Mon Oct 06 21:00:38 +0000 2008,806,1006,960,10442
556e1464d6dfbb4628812330,Sun Mar 11 08:22:56 +0000 2012,860,703,684,89573
556e15f1d6dfbb4628813236,Thu Jan 14 03:03:33 +0000 2010,3473,9414,1486,16435
556e1adcd6dfbb50e34a1ed6,Sun Oct 24 23:02:03 +0000 2010,3964,519,434,32154


In [None]:
class UserEgoVectorizer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        X["normality"] = u.friends_count / (u.followers_count + u.friends_count + 1)
        u.favourites_count = np.log(u.favourites_count)
        u.followers_count = np.log(u.followers_count)
        u.statuses_count = np.log(u.statuses_count)
        u.favourites_count = np.log(u.favourites_count)
        return pd.DataFrame.from_records(X)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)