# Baseline Approach 

This Notebook tries to implement the Baseline Approach described in the RecSys Challenge 2020 Paper (https://arxiv.org/abs/2004.13715). 

First we are going to read in the dataset (train and validation) and do some preprocessing in order to fit a model.

### Load Training Data

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp


from helper.data_loading import load_subsample

In [2]:
train_df = load_subsample("data/train.csv")

In [3]:
pd.set_option('display.max_columns', None)
train_df.head()

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaged_with_user_account_creation,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101 105549 10133 117 10105 100 11704 71136 104...,ABCED825B354CDD12A92D0C05686C7B5,4B3C351F949F3322D596E95B09B80008,[Video],,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,2020-02-06,F09E66BE3C210A32379427220DE06E91,17198316,1122,True,2007-03-15 06:22:13,04F8F1705C61E1A18B80DC9748CE02FF,37469,2019,False,2014-06-29 18:09:00,False,NaT,NaT,NaT,NaT
1,101 14120 131 120 120 188 119 11170 120 155 11...,,2E76E8E1F009D7954B07AF6C69650D07,[Video],,,TopLevel,B9175601E87101A984A50F8A62A1C374,2020-02-06,CAC0F5C16EF014303E9864AF6599E038,95993,3,False,2019-06-28 14:39:38,00BF7B74D57FD5D9DDF0919F3E612048,175,144,False,2019-05-27 05:35:30,False,NaT,NaT,NaT,NaT
2,101 58573 24951 11369 38351 11090 4476 4348 10...,91BDC623D8F241C76449E29368ACC270 857BAD78736C4...,D7F4F31D796404E8F5E2BAC79954EC4F,[GIF],[A0ECAE935A744B2AEFB7D185E14DF9CF],[7EA44583A7695522550E85C618413F3E],TopLevel,22C448FF81263D4BAF2A176145EE9EAD,2020-02-06,6798E612759FE86EBE05CA137BEE78EB,73720,5,True,2013-03-10 05:12:12,00494200F720D728953E799EA753188D,62,451,False,2017-11-11 09:46:43,False,NaT,NaT,NaT,NaT
3,101 10747 10124 32650 97038 19718 10111 11951 ...,11DD75033652B845468C84856328E657,9E5CECEAC7D51D0A99FA841150DAD0DC,[],[D334E773309486B6BF6899502C54D14E],[B3206482C1A292DC87C9E4F7CF05A5E4],TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,2020-02-06,1E94943C521EDC5FD0C1AAF190563418,438860,159,True,2016-11-27 18:12:08,01DF9BB8C5C6A703493309A5F6C156A9,14,199,False,2018-01-16 21:41:17,False,NaT,NaT,NaT,NaT
4,101 29005 10107 53499 29837 29284 13845 10225 ...,,FD1A4958DA5DE8DC3930346A7D1A585C,[],[B06C08BF4E54F3CA3AAEEF3E3CBA77B7],[7C36CB8CD2F180359FFE793D870E365A],TopLevel,ECED8A16BE2A5E8871FD55F4842F16B1,2020-02-06,A5D608BDB5F093C3FD3EDEDA7C517D84,720143,328,True,2009-01-27 19:51:14,002E9B36C19A48A1825F092352A4DD4F,206,1221,False,2014-12-19 02:05:42,False,NaT,NaT,NaT,NaT


## Preprocessing
This Part is used to preprocess the data.
The steps performed are:

1) Encode the Response Variables from NaN or Timestamps to 0 or 1 respectively

2) One-Hot Encode the Tweet Features
    - Language
    - Present Media
    - Tweet Type
    
3) TF-IDF Representation of the Text Tokens

4) TF-IDF Representation of the Hashtags

5) Convert the IDs to Integers
    - Tweet ID
    - Engaged with User ID
    - Engaging User ID

6) Convert Boolean Values to 0 / 1
    - Engaged with User is verified
    - Engaging User is verified
    - Engaged follows engaging
    
7) Convert to Variable is_present(1) or is_not_present(0)
    - present Links
    - present Domains
   
 8) Concatenate all Features

#### 1) Encode the Response Variables from NaN or Timestamps to 0 or 1 respectively

In [4]:
reply = train_df["reply_timestamp"].notnull().astype(int).to_numpy()
retweet = train_df["retweet_timestamp"].notnull().astype(int).to_numpy()
retweet_with_comment = train_df["retweet_with_comment_timestamp"].notnull().astype(int).to_numpy()
like = train_df["like_timestamp"].notnull().astype(int).to_numpy()

In [5]:
response = np.hstack((reply, retweet, retweet_with_comment,like))

#### 2) One-Hot Encode the Tweet Features

In [6]:
from sklearn.preprocessing import OneHotEncoder

language_encoder = OneHotEncoder()
language = language_encoder.fit_transform(train_df["language"].to_numpy().reshape(-1,1))

In [7]:
tweet_type_encoder = OneHotEncoder()
tweet_type = tweet_type_encoder.fit_transform(train_df["tweet_type"].to_numpy().reshape(-1,1))

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer

present_media_encoder = MultiLabelBinarizer(sparse_output=False)
present_media = present_media_encoder.fit_transform(train_df["present_media"])

In [9]:
tweet_features = sp.hstack([language, tweet_type, present_media])

#### 3) TF-IDF Representation of the Text Tokens

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_tfidf = TfidfVectorizer()
text_tokens = text_tfidf.fit_transform(train_df['text_tokens'])

#### 4) TF-IDF Representation of the Hashtags

In [11]:
hashtags_tfidf = TfidfVectorizer()
hashtags = hashtags_tfidf.fit_transform(train_df['hashtags'])

In [12]:
tweet_features = sp.hstack((text_tokens,hashtags))  # NOT np.vstack

#### 5) Convert the IDs to Integers (Bukets)

In [13]:
train_df['tweet_id'] = train_df["tweet_id"].map(hash)
train_df['engaged_with_user_id'] = train_df["engaged_with_user_id"].map(hash)
train_df['engaging_user_id'] = train_df["engaging_user_id"].map(hash)

In [14]:
from sklearn.preprocessing import KBinsDiscretizer
tweet_discretizer = KBinsDiscretizer(n_bins=50)
tweet_id = tweet_discretizer.fit_transform(train_df['tweet_id'].to_numpy().reshape(-1, 1))

In [15]:
engaged_with_user_discretizer = KBinsDiscretizer(n_bins=50)
engaged_with_user_id = engaged_with_user_discretizer.fit_transform(train_df['engaged_with_user_id'].to_numpy().reshape(-1, 1))

In [16]:
engaging_user_discretizer = KBinsDiscretizer(n_bins=50)
engaging_user_id = engaging_user_discretizer.fit_transform(train_df['engaging_user_id'].to_numpy().reshape(-1, 1))

In [17]:
id_features = sp.hstack([tweet_id, engaged_with_user_id, engaging_user_id])

#### 6) Convert Boolean Values to 0 / 1

In [18]:
engaged_with_user_is_verified = train_df["engaged_with_user_is_verified"].astype(int).to_numpy()
engaging_user_is_verified = train_df["engaging_user_is_verified"].astype(int).to_numpy()
engaged_follows_engaging = train_df["engaged_follows_engaging"].astype(int).to_numpy()

In [19]:
boolean_features = np.column_stack([engaged_with_user_is_verified,engaging_user_is_verified, engaged_follows_engaging ])

#### 7) Convert to Variable is_present(1) or is_not_present(0)

In [20]:
present_links = train_df["present_links"].notnull().astype(int).to_numpy()
present_domains = train_df["present_domains"].notnull().astype(int).to_numpy()

In [21]:
present_features = np.column_stack([present_links,present_domains ])

#### 8) Concatenate all Features

In [22]:
tweet_features.shape

(56297, 68331)

In [23]:
id_features.shape

(56297, 150)

In [24]:
boolean_features.shape

(56297, 3)

In [25]:
present_features.shape

(56297, 2)

In [26]:
X_train = sp.hstack([tweet_features, id_features, boolean_features, present_features])

In [27]:
X_train.shape

(56297, 68486)

In [28]:
Y_train = response

### Save Pipeline Components as Dict

In [29]:
components = {
    "language_encoder": language_encoder,
    "tweet_type_encoder": tweet_type_encoder,
    "present_media_encoder": present_media_encoder,
    "text_tfidf": text_tfidf,
    "hashtags_tfidf": hashtags_tfidf,
    "tweet_discretizer": tweet_discretizer,
    "engaged_with_user_discretizer": engaged_with_user_discretizer,
    "engaging_user_discretizer": engaging_user_discretizer
}

In [30]:
import pickle
with open("pipeline/pipeline_components.pkl", "wb") as file:
    pickle.dump(components, file)

#### How to use the helper file to load the validation/test data with the fitted pipeline components

In [33]:
from helper.preprocessing import preprocess_dataset

In [36]:
dev_df = load_subsample("data/validation.csv")

X_val, y_val = preprocess_dataset(dev_df, "pipeline/pipeline_components.pkl", load=True)

In [37]:
X_val.shape

(12064, 68486)