# Pipelines Exploration

The purpose of a pipeline is to seamlessly fix various problems in the machine learning process

* Data might not be in $R^d$
* Features need to be engineered
* Transformations need to be applied
* Hyperparameters need to be tuned

In [3]:
from dao import DataAccess

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

In [4]:
import numpy as np

First we easily instantiate a `DataAccess` and get the data in a DataFrame

In [5]:
X = DataAccess.as_dataframe()

In [6]:
X.head(2)

Unnamed: 0_level_0,created_at,labels,predict,text,user
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
556e0ee3d6dfbb462880f0a5,Tue Jun 02 20:16:08 +0000 2015,{'alcohol': 0},0.52605,Impatiently waiting to get our hands on the ne...,"{'verified': False, 'statuses_count': 823, 'fa..."
556e128ad6dfbb46288111e4,Tue Jun 02 20:31:44 +0000 2015,{'alcohol': 1},0.516649,Beer fans need their @ColumbusBrewing Bodhi. I...,"{'verified': False, 'statuses_count': 10442, '..."


In [63]:
class ExplodingRecordJoiner(BaseEstimator, TransformerMixin):
    def __init__(self, **kwargs):
        self.cols = kwargs
        
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        # Extract column of dicts then apply from_records,
        # Match indicies then select the `subcols` we want,
        # Join with existing DataFrame.
        for col, subcol in self.cols.items():
            X = X.join(
                pd.DataFrame.from_records(X[col], index=X.index)[subcols]
            )
        return X
    
    def fit_transform(self, X, y=None):
        return self.transform(X)
    
    def __repr__(self):
        st = [k+"="+ str(v) for k,v in self.cols.items()]
        return "ExplodingRecordJoiner({})".format(", ".join(st))
    
    def get_params():
        return self.cols
            

In [64]:
ExplodingRecordJoiner(
    labels=["alcohol"],
    user=['created_at', 'favourites_count', 'followers_count', 
          'friends_count', 'statuses_count', 'verified']).get_params()

TypeError: get_params() takes 0 positional arguments but 1 was given

In [7]:
class ItemGetter(BaseEstimator, TransformerMixin):
    """
    This canonical ItemGetter is a Transformer for Pipeline objects.
    Initialize the ItemGetter with a `key` and its transform call will select a column out
    of the specified column.
    """
    
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return X[self.key]
    
    def fit_transform(self, X, y=None):
        return X[self.key]

## Text Pipeline

First on our list is a simple Text Pipeline that uses TfidfVectorizer and TruncatedSVD (LSI)
Also use Twokenize from [brendano/tweetmotif](https://github.com/brendano/tweetmotif).

    Brendan O'Connor, Michel Krieger, and David Ahn. TweetMotif: Exploratory Search and Topic Summarization for Twitter. ICWSM-2010.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from twokenize import tokenize

text_pipe = []

text_pipe.append(
    ("text", 
     ItemGetter("text")
    )
)

text_pipe.append(
    ("tfidf", 
     TfidfVectorizer(
            analyzer="char",
            ngram_range=(2,5),
            min_df = 10,
            max_df = .98
        )
    )
)

text_pipe.append(
    ("lsi",
    TruncatedSVD(
            n_components=3000
        )
    )
)

# TruncatedSVD is annoying expensive...
text_pipeline = Pipeline(text_pipe[:2])

### Future Work: 

Incorperate Gensim Phrases, and LDA as custom transformers with the ability to load them from files

# Time Pipeline

### Vectorizers

I'll describe vectorization process a bit later as i've design it in a way so that it can be easily modified
for future implementations.

### Transformers

`DateTimeTransformer` takes the `created_at` selection and converts it into a `pandas.DatetimeIndex` which is amazingly powerful.

Currently I am using the `dayofweek`, `hour`, and `hourofweek` features.

In [17]:
import pandas as pd

class Timestamp2DatetimeIndex(BaseEstimator, TransformerMixin):
    """
    This consumes a timestamp series and applies `pandas.DatetimeIndex`
    to return a DatetimeIndex object
    """
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return pd.DatetimeIndex(X)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [19]:
from scipy.sparse import csc_matrix

import pandas as pd

class DatetimeIndexAttr(BaseEstimator, TransformerMixin):
    """
    Accesses all of the available `pandas.DatetimeIndex` attributes when initialized.
    Also provides a new attribute called "hourofweek".
    """
    
    def __init__(self, kind):
        self.kind = kind
        
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        n = len(X)
        if self.kind == "hourofweek":
            col = X.dayofweek * 24 + X.hour
        else:
            col = getattr(X, self.kind)
        return pd.DataFrame(col)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

Everything is one hot encoded, am I ashamed? A little...

In [20]:
from sklearn.preprocessing import OneHotEncoder

time_pipe = list()

time_pipe.append(
    ("get_created_at", 
     ItemGetter("created_at")
    )
)

time_pipe.append(
    ("to_datetimeindex",
    DateTimeTransformer()
    )
)

time_pipe.append(
    ("features",
    FeatureUnion([
        ("dayofweek", 
         Pipeline(
                    [("index", DatetimeIndex("dayofweek")),
                     ("onehot", OneHotEncoder())])),
        ("hour", 
         Pipeline(
                    [("index", DatetimeIndex("hour")),
                     ("onehot", OneHotEncoder())])),
        ("hourofweek", 
         Pipeline(
                    [("index", DatetimeIndex("hourofweek")),
                     ("onehot", OneHotEncoder())]))
        ])
    )
)

### Future Work

Notice that right now the things are all OneHotEncoded. This will change later. We see that a lot of this infromation is periodic so we can probably include features like the different between Phases rather than the time itself.

This will probably function better than collapsing it into larger semantic intervals like `Afternoon` or `Sunday Afternoon`

Moreover instead of using prior densities based on our other data, we could also have that a part of the fit process...

## User Pipeline

From prior exploration it seems that log scaling them helps a lot. Noreover, a feature called normality also helps greatly.


On read, There user column is a dict so we need a transformer to convert it.

In [12]:
class Dict2DF(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        return pd.DataFrame.from_records(X, index=X.index)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [13]:
u = Dict2DF().fit_transform(X.user).head()

In [14]:
u

Unnamed: 0_level_0,created_at,favourites_count,followers_count,friends_count,statuses_count,verified
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
556e0ee3d6dfbb462880f0a5,Thu Jun 12 22:14:05 +0000 2014,394,407,1997,823,False
556e128ad6dfbb46288111e4,Mon Oct 06 21:00:38 +0000 2008,806,1006,960,10442,False
556e1464d6dfbb4628812330,Sun Mar 11 08:22:56 +0000 2012,860,703,684,89573,False
556e15f1d6dfbb4628813236,Thu Jan 14 03:03:33 +0000 2010,3473,9414,1486,16435,True
556e1adcd6dfbb50e34a1ed6,Sun Oct 24 23:02:03 +0000 2010,3964,519,434,32154,False


In [15]:
class UserEgoVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, log=True, mean=True):
        self.log = log
        self.mean = mean
    
    def fit(self, X, y=None):
        
        u = X
        
        if self.log:
            favourites_count = np.log(u.favourites_count)
            followers_count = np.log(u.followers_count)
            statuses_count = np.log(u.statuses_count)
        
        if self.mean:
            self.mean_fav = np.mean(u.favourites_count)
            self.mean_fol = np.mean(u.followers_count)
            self.mean_fri = np.mean(u.friends_count)
        return 
    
    def transform(self, X, y=None):
        
        u = X
        
        u["normality"] = u.friends_count / (u.followers_count + u.friends_count + 1)
        
        if self.log:
            u["log_favourites_count"] = np.log(u.favourites_count)
            u["log_followers_count"] = np.log(u.followers_count)
            u["log_statuses_count"] = np.log(u.statuses_count)
            u["log_friends_count"] = np.log(u.friends_count)
        
        if self.mean:
            u["fav_kre"] = (u.favourites_count - self.mean_fav) 
            u["fol_kre"] = (u.followers_count - self.mean_fol)
            u["fri_kre"] = (u.friends_count - self.mean_fri)
        del u["created_at"]
        return u
    
    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

In [16]:
class UserAgeMonths(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        pass
    
    def transform(self, X, y=None):
        from operator import itemgetter
        tweet_time = pd.to_datetime(X.created_at)
        user_time = pd.to_datetime(X.user.apply(itemgetter("created_at")))
        return (tweet_time - user_time).apply(int) // 2.62974e15
    
    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

In [35]:
pd.DataFrame.from_records(X.user, index=X.index).columns

Index(['created_at', 'favourites_count', 'followers_count', 'friends_count',
       'statuses_count', 'verified'],
      dtype='object')

In [32]:
X.join(pd.DataFrame.from_records(X.user, index=X.index), rsuffix="_user")

Unnamed: 0_level_0,created_at,labels,predict,text,user,created_at_user,favourites_count,followers_count,friends_count,statuses_count,verified
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
556e0ee3d6dfbb462880f0a5,Tue Jun 02 20:16:08 +0000 2015,{'alcohol': 0},0.526050,Impatiently waiting to get our hands on the ne...,"{'verified': False, 'statuses_count': 823, 'fa...",Thu Jun 12 22:14:05 +0000 2014,394,407,1997,823,False
556e128ad6dfbb46288111e4,Tue Jun 02 20:31:44 +0000 2015,{'alcohol': 1},0.516649,Beer fans need their @ColumbusBrewing Bodhi. I...,"{'verified': False, 'statuses_count': 10442, '...",Mon Oct 06 21:00:38 +0000 2008,806,1006,960,10442,False
556e1464d6dfbb4628812330,Tue Jun 02 20:39:37 +0000 2015,{'alcohol': 1},0.502633,Stone Cold use to be the baddest MF in my book...,"{'verified': False, 'statuses_count': 89573, '...",Sun Mar 11 08:22:56 +0000 2012,860,703,684,89573,False
556e15f1d6dfbb4628813236,Tue Jun 02 20:46:14 +0000 2015,{'alcohol': 1},0.535758,Now @iamjohnoliver has to drink a Bud Light Li...,"{'verified': True, 'statuses_count': 16435, 'f...",Thu Jan 14 03:03:33 +0000 2010,3473,9414,1486,16435,True
556e1adcd6dfbb50e34a1ed6,Tue Jun 02 21:07:13 +0000 2015,{'alcohol': 0},0.533892,I'm ready for a yard sale and to sell all the...,"{'verified': False, 'statuses_count': 32154, '...",Sun Oct 24 23:02:03 +0000 2010,3964,519,434,32154,False
556e1e58d6dfbb50e34a3eb7,Tue Jun 02 21:22:06 +0000 2015,{'alcohol': 1},0.572562,"""I need a drink because I'm upset about some p...","{'verified': False, 'statuses_count': 12889, '...",Tue Jul 29 13:51:57 +0000 2008,2489,727,624,12889,False
556e2133d6dfbb50e34a59c8,Tue Jun 02 21:34:17 +0000 2015,{'alcohol': 1},0.603472,"So for now, no drinking for the foreseeable fu...","{'verified': False, 'statuses_count': 778, 'fa...",Thu Mar 26 13:25:53 +0000 2015,0,13,42,778,False
556e2484d6dfbb50e34a7958,Tue Jun 02 21:48:26 +0000 2015,{'alcohol': 1},0.554492,.@MagicTreePub &amp; Eatery has T-Shirt Tuesda...,"{'verified': False, 'statuses_count': 23532, '...",Tue Jun 11 18:43:03 +0000 2013,1133,1912,699,23532,False
556e3d1ed6dfbb6d1e39fe92,Tue Jun 02 23:33:23 +0000 2015,{'alcohol': 1},0.540555,Wtf was I drinking Sunday?😂🔫🙊 http://t.co/3c2m...,"{'verified': False, 'statuses_count': 11567, '...",Tue Nov 02 17:30:02 +0000 2010,3466,309,173,11567,False
556e3dfcd6dfbb6d1e3a075e,Tue Jun 02 23:37:05 +0000 2015,{'alcohol': 0},0.504472,U aint makin money fuck yo lame excuse// kim k...,"{'verified': False, 'statuses_count': 794, 'fa...",Thu Sep 12 23:00:31 +0000 2013,608,1626,428,794,False
