# Baseline Approach 

This Notebook tries to implement the Baseline Approach described in the RecSys Challenge 2020 Paper (https://arxiv.org/abs/2004.13715). 

First we are going to read in the dataset (train and validation) and do some preprocessing in order to fit a model.

### Load Training Data

In [1]:
import pandas as pd

from helper.data_loading import load_subsample

In [2]:
train_df = load_subsample("data/train.csv")

In [3]:
pd.set_option('display.max_columns', None)
train_df.head()

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaged_with_user_account_creation,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101 105549 10133 117 10105 100 11704 71136 104...,ABCED825B354CDD12A92D0C05686C7B5,4B3C351F949F3322D596E95B09B80008,[Video],,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,2020-02-06,F09E66BE3C210A32379427220DE06E91,17198316,1122,True,2007-03-15 06:22:13,04F8F1705C61E1A18B80DC9748CE02FF,37469,2019,False,2014-06-29 18:09:00,False,NaT,NaT,NaT,NaT
1,101 14120 131 120 120 188 119 11170 120 155 11...,,2E76E8E1F009D7954B07AF6C69650D07,[Video],,,TopLevel,B9175601E87101A984A50F8A62A1C374,2020-02-06,CAC0F5C16EF014303E9864AF6599E038,95993,3,False,2019-06-28 14:39:38,00BF7B74D57FD5D9DDF0919F3E612048,175,144,False,2019-05-27 05:35:30,False,NaT,NaT,NaT,NaT
2,101 58573 24951 11369 38351 11090 4476 4348 10...,91BDC623D8F241C76449E29368ACC270 857BAD78736C4...,D7F4F31D796404E8F5E2BAC79954EC4F,[GIF],[A0ECAE935A744B2AEFB7D185E14DF9CF],[7EA44583A7695522550E85C618413F3E],TopLevel,22C448FF81263D4BAF2A176145EE9EAD,2020-02-06,6798E612759FE86EBE05CA137BEE78EB,73720,5,True,2013-03-10 05:12:12,00494200F720D728953E799EA753188D,62,451,False,2017-11-11 09:46:43,False,NaT,NaT,NaT,NaT
3,101 10747 10124 32650 97038 19718 10111 11951 ...,11DD75033652B845468C84856328E657,9E5CECEAC7D51D0A99FA841150DAD0DC,,[D334E773309486B6BF6899502C54D14E],[B3206482C1A292DC87C9E4F7CF05A5E4],TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,2020-02-06,1E94943C521EDC5FD0C1AAF190563418,438860,159,True,2016-11-27 18:12:08,01DF9BB8C5C6A703493309A5F6C156A9,14,199,False,2018-01-16 21:41:17,False,NaT,NaT,NaT,NaT
4,101 29005 10107 53499 29837 29284 13845 10225 ...,,FD1A4958DA5DE8DC3930346A7D1A585C,,[B06C08BF4E54F3CA3AAEEF3E3CBA77B7],[7C36CB8CD2F180359FFE793D870E365A],TopLevel,ECED8A16BE2A5E8871FD55F4842F16B1,2020-02-06,A5D608BDB5F093C3FD3EDEDA7C517D84,720143,328,True,2009-01-27 19:51:14,002E9B36C19A48A1825F092352A4DD4F,206,1221,False,2014-12-19 02:05:42,False,NaT,NaT,NaT,NaT


## Preprocessing
This Part is used to preprocess the data.
The steps performed are:

1) Encode the Response Variables from NaN or Timestamps to 0 or 1 respectively

2) One-Hot Encode the Tweet Features
    - Language
    - Present Media
    - Tweet Type
    
3) TF-IDF Representation of the Text Tokens

4) TF-IDF Representation of the Hashtags

5) Convert the IDs to Integers
    - Tweet ID
    - Engaged with User ID
    - Engaging User ID

6) Convert Boolean Values to 0 / 1
    - Engaged with User is verified
    - Engaging User is verified
    - Engaged follows engaging
    
7) Convert to Variable is_present(1) or is_not_present(0)
    - present Links
    - present Domains
   
8) Drop Creation Date 
    - Tweet
    - Engaging User
    - Engaged with User

#### 1) Encode the Response Variables from NaN or Timestamps to 0 or 1 respectively

In [4]:
train_df["reply_timestamp"] = train_df["reply_timestamp"].notnull().astype(int)
train_df["retweet_timestamp"] = train_df["retweet_timestamp"].notnull().astype(int)
train_df["retweet_with_comment_timestamp"] = train_df["retweet_with_comment_timestamp"].notnull().astype(int)
train_df["like_timestamp"] = train_df["like_timestamp"].notnull().astype(int)

#### 2) One-Hot Encode the Tweet Features

In [5]:
columns_to_encode = ["language", "tweet_type"]
for column in columns_to_encode:
    # use pd.concat to join the new columns with your original dataframe
    train_df = pd.concat([train_df,pd.get_dummies(train_df[column], prefix=column)],axis=1)

    # now drop the original 'country' column (you don't need it anymore)
    train_df.drop([column],axis=1, inplace=True)

In [6]:
from sklearn.preprocessing import MultiLabelBinarizer

columns_to_encode = ["present_media"]

for column in columns_to_encode:
    mlb = MultiLabelBinarizer()
    train_df = train_df.join(pd.DataFrame(mlb.fit_transform(train_df.pop(column)),
                              columns=column + mlb.classes_,
                              index=train_df.index))


TypeError: 'float' object is not iterable

#### 3) TF-IDF Representation of the Text Tokens

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(train_df['text_tokens'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
train_df.drop('text_tokens', axis=1, inplace=True)
train_df = pd.concat([train_df, df1], axis=1)

#### 4) TF-IDF Representation of the Hashtags

In [8]:
v = TfidfVectorizer()
x = v.fit_transform(train_df['hashtags'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
train_df.drop('hashtags', axis=1, inplace=True)
train_df = pd.concat([train_df, df1], axis=1)

#### 5) Convert the IDs to Integers

In [9]:
train_df['tweet_id'] = train_df["tweet_id"].map(hash)
train_df['engaged_with_user_id'] = train_df["engaged_with_user_id"].map(hash)
train_df['engaging_user_id'] = train_df["engaging_user_id"].map(hash)

#### 6) Convert Boolean Values to 0 / 1

In [10]:
train_df["engaged_with_user_is_verified"] = train_df["engaged_with_user_is_verified"].astype(int)
train_df["engaging_user_is_verified"] = train_df["engaging_user_is_verified"].astype(int)
train_df["engaged_follows_engaging"] = train_df["engaged_follows_engaging"].astype(int)

#### 7) Convert to Variable is_present(1) or is_not_present(0)

In [11]:
train_df["present_links"] = train_df["present_links"].notnull().astype(int)
train_df["present_domains"] = train_df["present_domains"].notnull().astype(int)

#### 8) Drop Creation Date 

In [12]:
train_df.drop('tweet_timestamp', axis=1, inplace=True)
train_df.drop('engaging_user_account_creation', axis=1, inplace=True)
train_df.drop('engaged_with_user_account_creation', axis=1, inplace=True)

In [13]:
train_df.head()

Unnamed: 0,tweet_id,present_links,present_domains,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp,language_022EC308651FACB02794A8147AEE1B78,language_0331BF70E606D62D92C96CE9AD71A7CF,language_06BEAB41D66CCFF329D1ED8BA120A6C2,language_06D61DCBBE938971E1EA0C38BD9B5446,language_125C57F4FA6D4E110983FB11B52EFD4E,language_12D8CEB94F89D11D7EB95EAE9689B009,language_167115458A0DBDFF7E9C0C53A83BAC9B,language_190BA7DA361BC06BC1D7E824C378064D,language_1BC639981AE88E09129594B11F894A21,language_1FFD2FE4297F5E70EBC6C3230D95CB9C,language_2216D01F7B48554E4211021A46816FCF,language_22C448FF81263D4BAF2A176145EE9EAD,language_259A6F6DFD672CB1F883CBEC01B99F2D,language_2996EB2FE8162C076D070A4C8D6532CD,language_2E18F6F53E3CF073911AF0A93BBE5373,language_3121F7240D488F74EEED9312E174B217,language_3820C29CBCA409A33BADF68852057C4A,language_3A85BCEC571C3F5AB1069E4924189177,language_3E16B11B7ADE3A22DDFC4423FBCEAD5D,language_4249CE88433AEA3F8DCEECF008B3CB95,language_48236EC80FDDDFADE99420ABC9210DDF,language_4DC22C3F31C5C43721E6B5815A595ED6,language_54208B51D44E7D91DC2F3DD02ADEDEC2,language_544FA32458C903F1125FE6598300A047,language_57ADD4576E2AD6648E9B2DE32F3462A5,language_60A3DB168094D41241E45E0DE3539BC0,language_60FBA0E834CC59D647C3599AD763FFDF,language_6431A618DCF7F4CB7F62A95A39BAB77A,language_691890251F2B9FF922BE6D3699ABEFD2,language_69C4A33B9AD29AF883D60BA61CC08702,language_717293301FE296B0B61950D041485825,language_76B8A9C3013AE6414A3E6012413CDC3B,language_89616CFF8EC8637092F885C7EFF43D74,language_920502FAA080485768AA89BC96A55C47,language_975B38F44D65EE42A547283787FF5A21,language_9BF3403E0EB7EA8A256DA9019C0B0716,language_9ECD42BC079C20F156F53CB3B99E600E,language_A0C7021AD8299ADF0C9EBE326C115F6F,language_AC1F0671A4B0D5B8112F87DE7B490E6D,language_AEF22666801F0A5846D853B9CEB2E327,language_B6D90127A09AB1229731898AEF9D4D7C,language_B9175601E87101A984A50F8A62A1C374,language_BF477808A37E3E4E9C5D9F1839E8519E,language_C7A400D9AD489ACF673CF12FBB80AAE5,language_C942E369C88CE7C56E69A84D04319FF0,language_CB11E9CF42BD0A1BAD5E27BF3422D99D,language_D3164C7FBCF2565DDF915B1B3AEFB1DC,language_D413F5FE5236E5650A46FD983AB39212,language_DBEEFB80F8A314311E2B4BD593E11DFE,language_E59EF8BB86A6D815331DDF4C467CE0C7,language_E7BB61D2A87C1E72DF1C7BC292B86A1C,language_ECED8A16BE2A5E8871FD55F4842F16B1,language_F3E1016563360F9434FA986CA86C249C,language_F4FD40A716F1572C9A28E9CAA58BE3A5,language_F73266A79468BB89C4325FDEDB0B533C,language_FA3F382BC409C271E3D6EAF8BE4648DD,language_FF60A88F53E63000266F8B9149E35AD9,language_FF7EABB5A382356D54D9C41BA0125E09,tweet_type_Quote,tweet_type_Retweet,tweet_type_TopLevel
0,4590069584091491409,0,0,-136733291582698449,17198316,1122,1,-3928338212496585445,37469,2019,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,8829879499837066784,0,0,-2589379172838009695,95993,3,0,392474534274647486,175,144,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,5482716909529045989,1,1,156673691155750861,73720,5,1,7102884400595132217,62,451,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,-7863212792125820657,1,1,-1637782398906196582,438860,159,1,-5507891599144401415,14,199,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,8731020775755400127,1,1,-6079224041861503608,720143,328,1,-3884849683006866670,206,1221,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1


In [14]:
train_df.columns

Index(['tweet_id', 'present_links', 'present_domains', 'engaged_with_user_id',
       'engaged_with_user_follower_count', 'engaged_with_user_following_count',
       'engaged_with_user_is_verified', 'engaging_user_id',
       'engaging_user_follower_count', 'engaging_user_following_count',
       'engaging_user_is_verified', 'engaged_follows_engaging',
       'reply_timestamp', 'retweet_timestamp',
       'retweet_with_comment_timestamp', 'like_timestamp',
       'language_022EC308651FACB02794A8147AEE1B78',
       'language_0331BF70E606D62D92C96CE9AD71A7CF',
       'language_06BEAB41D66CCFF329D1ED8BA120A6C2',
       'language_06D61DCBBE938971E1EA0C38BD9B5446',
       'language_125C57F4FA6D4E110983FB11B52EFD4E',
       'language_12D8CEB94F89D11D7EB95EAE9689B009',
       'language_167115458A0DBDFF7E9C0C53A83BAC9B',
       'language_190BA7DA361BC06BC1D7E824C378064D',
       'language_1BC639981AE88E09129594B11F894A21',
       'language_1FFD2FE4297F5E70EBC6C3230D95CB9C',
       'language_22