In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_recall_curve, auc, log_loss

# Data understanding

## Collect initial data 

**Task: Collect initial data**\
Acquire within the project the data (or access to the data) listed in the
project resources. This initial collection includes data loading if necessary
for data understanding. For example, if you apply a specific tool for data
understanding, it makes perfect sense to load your data into this tool.
This effort possibly leads to initial data preparation steps.
Note: if you acquire multiple data sources, integration is an additional
issue, either here or in the later data preparation phase

**Output: Initial data collection report**\
List the dataset (or datasets) acquired, together with their locations
within the project, the methods used to acquire them and any problems
encountered. Record problems encountered and any solutions achieved
to aid with future replication of this project or with the execution of
similar future projects

In [2]:
file = 'training_sample.tsv'

In [3]:
column_names = ["text_tokens", "hashtags", "tweet_id", "present_media", "present_links", "present_domains",\
                "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id", "engaged_with_user_follower_count",\
               "engaged_with_user_following_count", "engaged_with_user_is_verified", "engaged_with_user_account_creation",\
               "engaging_user_id", "engaging_user_follower_count", "engaging_user_following_count", "engaging_user_is_verified",\
               "engaging_user_account_creation", "engaged_follows_engaging", "reply_timestamp", "retweet_timestamp", "retweet_with_comment_timestamp", "like_timestamp"]

In [4]:
df = pd.read_csv(file, header=None, names=column_names, delimiter='\x01')

In [5]:
pd.set_option('display.max_columns', None)
print(df.shape)
display(df.head())

(80425, 24)


Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaged_with_user_account_creation,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101\t56898\t137\t174\t63247\t10526\t131\t3197\...,,3C21DCFB8E3FEC1CB3D2BFB413A78220,Video,,,Retweet,76B8A9C3013AE6414A3E6012413CDC3B,1581467323,D1AA2C85FA644D64346EDD88470525F2,737,706,False,1403069820,000046C8606F1C3F5A7296222C88084B,131,2105,False,1573978269,False,,,,
1,101\t102463\t10230\t10105\t21040\t10169\t12811...,,3D87CC3655C276F1771752081423B405,,BB422AA00380E45F312FD2CAA75F4960,92D397F8E0F1E77B36B8C612C2C51E23,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1580975391,4DC65AC7BD963DE1F7617C047C33DE99,52366425,2383,True,1230139136,00006047187D0D18598EF12A650E1DAC,22,50,False,1340673962,False,,,,
2,101\t56898\t137\t11255\t22037\t10263\t168\t111...,DB32BD91C2F1B37BE700F374A07FBC61,3701848B96AA740528A2B0E247777D7D,,2423BA02A75DB2189335DDC3FB6B74A1,6D323BE93766E79BE423FAC5C28BE39B,Retweet,22C448FF81263D4BAF2A176145EE9EAD,1581257232,5C671539CB41B9807E209349B101E9FF,988,167,False,1530094483,0000648BAA193AE4C625DDF789B57172,251,719,False,1456473671,False,,,,
3,101\t13073\t28757\t106\t100\t14120\t131\t120\t...,,18176C6AD2871729384062F073CCE94D,Video,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581164292,70B900BE17416923D1E236A38798F202,1228134,5413,False,1378699943,000071667F50BAFEA722A8E8284581E5,18,58,False,1378427564,False,,,,1581305000.0
4,101\t3460\t1923\t6632\t2824\t30368\t2179\t1881...,,AF11AF01F842E7F120667B7B0B38676D,,,,Quote,22C448FF81263D4BAF2A176145EE9EAD,1581233650,E94C0E9E8494F3D603F9D1A5C5242E3D,73,299,False,1549054499,00007745A6EE969F1A0F44B10DC17671,268,526,False,1252294800,False,,,,


## Describe data 

**Task: Describe data**\
Examine the “gross” or “surface” properties of the acquired data and
report on the results.

**Output: Data description report**\
Describe the data which has been acquired, including: the format of
the data, the quantity of data, for example number of records and fields
in each table, the identities of the fields and any other surface features
of the data which have been discovered. Does the data acquired satisfy
the relevant requirements?

| Feature category    | Feature name                 | Feature dtype | Feature description                                                                           |
|---------------------|------------------------------|---------------|-----------------------------------------------------------------------------------------------|
| User features       | userId                       | string        | User identifier                                                                               |
| User features       | follower count               | int           | Number of followers of the user                                                               |
| User features       | following count              | int           | Number of accounts this user is following                                                     |
| User features       | is verified                  | bool          | Is the account verified?                                                                      |
| User features       | account creation             | timestamp     | in ms int Unix timestamp (in seconds) of the creation time of the account                     |
| Tweet features      | tweetId                      | string        | Tweet identifier                                                                              |
| Tweet features      | presentMedia                 | list[string]  | Tab-separated list of media types;  media type can be in (Photo, Video, Gif)                  |
| Tweet features      | presentLinks                 | list[string]  | Tab-separated list of links included in the tweet                                             |
| Tweet features      | presentDomains               | list[string]  | Tab-separated list of domains (e.g. twitter.com) included in the tweet                        |
| Tweet features      | tweetType                    | string        | Tweet type, can be either Retweet, Quote, Reply, or Toplevel                                  |
| Tweet features      | language                     | string        | Identifier corresponding to inferred language of the tweet                                    |
| Tweet features      | tweet timestamp              | int           | Unix timestamp, in seconds of the creation time of the Tweet                                  |
| Tweet features      | tweet tokens                 | list[int]     | Ordered list of Bert ids corresponding to Bert tokenization of Tweet text                     |
| Tweet features      | tweet hashtags               | list[string]  | Tab-separated list of hashtags present in the tweet                                           |
| Engagement features | reply engagement timestamp   | int           | Unix timestamp, in seconds, of the Reply engagement if one exists.                            |
| Engagement features | retweet engagement timestamp | int           | Unix timestamp, in seconds, of the Retweet engagement if one exists.                          |
| Engagement features | quote engagement timestamp   | int           | Unix timestamp, in seconds, of the Quote engagement if one exists.                            |
| Engagement features | like engagement timestamp    | int           | Unix timestamp, in seconds, of the Like engagement if one exists.                             |
| Engagement features | engageeFollowsEngager        | bool          | Does the account of the engaged tweet author follow the account that has made the engagement? |

## Explore data 


**Task: Explore data**\
This task tackles the data mining questions, which can be addressed
using querying, visualization and reporting. These include: distribution
of key attributes, for example the target attribute of a prediction task;
relations between pairs or small numbers of attributes; results of
simple aggregations; properties of significant sub-populations; simple
statistical analyses. These analyses may address directly the data mining goals; they may also contribute to or refine the data description
and quality reports and feed into the transformation and other data
preparation needed for further analysis.


**Output: Data exploration report\
Describe results of this task including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate,
include graphs and plots, which indicate data characteristics or lead
to interesting data subsets for further examination.

## Verify data quality

**Task: Verify data quality**\
Examine the quality of the data, addressing questions such as: is the
data complete (does it cover all the cases required)? Is it correct or
does it contain errors and if there are errors how common are they?
Are there missing values in the data? If so how are they represented,
where do they occur and how common are they?


**Output: Data quality report**\
List the results of the data quality verification; if quality problems
exist, list possible solutions. Solutions to data quality problems
generally depend heavily on both data and business knowledge.

In [6]:
display(df)

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaged_with_user_account_creation,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101\t56898\t137\t174\t63247\t10526\t131\t3197\...,,3C21DCFB8E3FEC1CB3D2BFB413A78220,Video,,,Retweet,76B8A9C3013AE6414A3E6012413CDC3B,1581467323,D1AA2C85FA644D64346EDD88470525F2,737,706,False,1403069820,000046C8606F1C3F5A7296222C88084B,131,2105,False,1573978269,False,,,,
1,101\t102463\t10230\t10105\t21040\t10169\t12811...,,3D87CC3655C276F1771752081423B405,,BB422AA00380E45F312FD2CAA75F4960,92D397F8E0F1E77B36B8C612C2C51E23,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1580975391,4DC65AC7BD963DE1F7617C047C33DE99,52366425,2383,True,1230139136,00006047187D0D18598EF12A650E1DAC,22,50,False,1340673962,False,,,,
2,101\t56898\t137\t11255\t22037\t10263\t168\t111...,DB32BD91C2F1B37BE700F374A07FBC61,3701848B96AA740528A2B0E247777D7D,,2423BA02A75DB2189335DDC3FB6B74A1,6D323BE93766E79BE423FAC5C28BE39B,Retweet,22C448FF81263D4BAF2A176145EE9EAD,1581257232,5C671539CB41B9807E209349B101E9FF,988,167,False,1530094483,0000648BAA193AE4C625DDF789B57172,251,719,False,1456473671,False,,,,
3,101\t13073\t28757\t106\t100\t14120\t131\t120\t...,,18176C6AD2871729384062F073CCE94D,Video,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581164292,70B900BE17416923D1E236A38798F202,1228134,5413,False,1378699943,000071667F50BAFEA722A8E8284581E5,18,58,False,1378427564,False,,,,1.581305e+09
4,101\t3460\t1923\t6632\t2824\t30368\t2179\t1881...,,AF11AF01F842E7F120667B7B0B38676D,,,,Quote,22C448FF81263D4BAF2A176145EE9EAD,1581233650,E94C0E9E8494F3D603F9D1A5C5242E3D,73,299,False,1549054499,00007745A6EE969F1A0F44B10DC17671,268,526,False,1252294800,False,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80420,101\t56898\t137\t14796\t13711\t17617\t10161\t1...,FC7321735734C2FC8A3CAE30D266CD71,533F80610C8C2F4345517986B5BB58E5,,,,Retweet,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581212842,0BCA6643D664442CA7901690F5843C1A,1432450,1869,True,1220061361,06209A39A94A7AF33B253C1EFA2D52E5,32,443,False,1393279525,False,,,,
80421,101\t56898\t137\t11885\t11273\t40154\t10206\t1...,0FE0A5F06FA20E3C2CDE7F65ACA0046C,DAC3216BB2DC4747BB2CCBA6D253A308,,,,Retweet,06D61DCBBE938971E1EA0C38BD9B5446,1581533061,F44A5E2FD8B6A2ACF0A1B97D57ED3C92,29193,1434,False,1251330211,0620C4B9A7E8153DFD1ECEE5FE257F9C,54,755,False,1275826974,False,,,,
80422,101\t22800\t10531\t10124\t28780\t104939\t10230...,,5F9EFD38F96180EAB6BAA74481C0E6FE,,,,Quote,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581119848,95CD94FE6760E0A5C8A183D821B8460A,96,242,False,1559761254,062154E2ED505B1DA7A9883921E42838,31,65,False,1543720066,True,,,,1.581122e+09
80423,101\t18249\t112\t187\t169\t16745\t26133\t117\t...,,8CB10325EAAD5E121E686EF222B8598C,Photo,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581052325,9DF1155503CCA735A24A0B61E0445EF7,3134,5000,False,1452619872,0621E5EBF6FB229F57303B1FB6CF7B3A,7,100,False,1395188402,False,,,,


Q: Why do some users have the exactly same timestamp for two types of engagements (e.g. row_id 80424)?\
A: 

# Data Preparation

## Select data

**Task: Select data**\
Decide on the data to be used for analysis. Criteria include relevance
to the data mining goals, quality and technical constraints such as
limits on data volume or data types. Note that data selection covers
selection of attributes (columns) as well as selection of records (rows)
in a table.

**Output: Rationale for inclusion/exclusion**\
List the data to be included/excluded and the reasons for these decisions.

## Clean data

**Task: Clean data**\
Raise the data quality to the level required by the selected analysis
techniques. This may involve selection of clean subsets of the data, the
insertion of suitable defaults or more ambitious techniques such as the
estimation of missing data by modeling.


**Output: Data cleaning report**\
Describe what decisions and actions were taken to address the data
quality problems reported during the verify data quality task of the
data understanding phase. Transformations of the data for cleaning
purposes and the possible impact on the analysis results should be
considered.

## Construct data

**Task: Construct data**\
This task includes constructive data preparation operations such as the
production of derived attributes, entire new records or transformed
values for existing attributes.


**Outputs: Derived attributes**\
Derived attributes are new attributes that are constructed from one or more
existing attributes in the same record. Examples: area = length * width.
Generated records
Describe the creation of completely new records. Example: create
records for customers who made no purchase during the past year.
There was no reason to have such records in the raw data, but for
modeling purposes it might make sense to explicitly represent the fact
that certain customers made zero purchases.

### Enhance dataset with derived columns

## Independent variables
The following new variables were defined in the dataframe:

| Name | Type | Description |
| --- | --- | --- |
| has_media | bool | Indicates that is there any media type related to the tweet or not? (should be omitted later, it was just a helper for further extraction) |
| no_Photo | Integer | number of photo media in the tweet |
| no_Video | Integer | number of video media in the tweet|
| no_GIF | Integer | number of GIF media in the tweet |
| no_Other_media | Integer | (maybe it's impossible) if there is an unknown media type, it will be counted |
| no_Hashtags | Integer | Count the number of hashtags in each tweet |
| no_Domains | Integer | Count the number of domains in each tweet |
| no_Links | Integer | Count the number of links in each tweet |
| no_Words_In_Tweet | Integer | Count the number of words in the tweet |

## Labels
| Name | Type | Description |
| --- | --- | --- |
| is_like | bool | indicates that was there a like by the engaging user on the tweet (decided on that is there timestamp exists or not) |
| is_reply | bool | indicates that was there a reply by the engaging user on the tweet (decided on that is there timestamp exists or not) |
| is_retweet | bool | indicates that was there a retweet by the engaging user on the tweet (decided on that is there timestamp exists or not) |
| is_retweet_with_comment | bool | indicates that was there a retweet with comment by the engaging user on the tweet (decided on that is there timestamp exists or not) |

Is there any media in the cell?

In [7]:
df['has_media'] = df['present_media'].notna()

In [8]:
df[['has_media','present_media']]

Unnamed: 0,has_media,present_media
0,True,Video
1,False,
2,False,
3,True,Video
4,False,
...,...,...
80420,False,
80421,False,
80422,False,
80423,True,Photo


## Count the different media types for every tweet

In [9]:
photoCnt = []
videoCnt = []
gifCnt = []
other = []
for d,b in zip(df['present_media'],  df['has_media']):
    if b is True:
        pht = d.count("Photo")
        video = d.count("Video")
        gif  = d.count("GIF")
        photoCnt.append(pht)
        videoCnt.append(video)
        gifCnt.append(gif)        
        if len(d) != int(pht + video + gif):
             other.append(len(d) - int(pht + video + gif))
        else:
             other.append(0)
    else:
        photoCnt.append(0)
        videoCnt.append(0)
        gifCnt.append(0)
        other.append(0)                       

In [10]:
df['no_Photo'] = photoCnt
df['no_Video'] = videoCnt
df['no_GIF'] = gifCnt
df['no_Other_Media'] = other

In [11]:
df[['present_media','no_Photo', 'no_Video', 'no_GIF', 'no_Other_Media']].head(5)

Unnamed: 0,present_media,no_Photo,no_Video,no_GIF,no_Other_Media
0,Video,0,1,0,4
1,,0,0,0,0
2,,0,0,0,0
3,Video,0,1,0,4
4,,0,0,0,0


## Count the number of hashtags in each tweet

In [12]:
df['no_Hashtags'] = df.apply(lambda x : len(str(x['hashtags']).split()) if str(x['hashtags']) != "nan" else 0,axis=1)

In [13]:
df[['hashtags','no_Hashtags']].head(5)

Unnamed: 0,hashtags,no_Hashtags
0,,0
1,,0
2,DB32BD91C2F1B37BE700F374A07FBC61,1
3,,0
4,,0


In [14]:
df[df['no_Hashtags'] != 0][['hashtags', 'no_Hashtags']].head(20)

Unnamed: 0,hashtags,no_Hashtags
2,DB32BD91C2F1B37BE700F374A07FBC61,1
10,9887C2F9C8FFECE3524054D91E871F84,1
12,94838D747A18E1270C051B6DDFCAAE0D,1
16,0EE6C006799F0AFD5655BDC4418B9A07,1
17,CA7AF9A82452F923905E4F8A485DED43\t9E53645466A2...,6
19,DC3CA8E5056079E402ED35A7B1FDEC23,1
20,E2774F85FC328CCF3515AF7B12EF18F0,1
21,9A65AD335F99CB560602444D4F9111C8,1
22,1BBE947EB038868E02C649D57688A100\t7C2090819424...,2
31,CD33508ADA28A80D3A0DD6728CD7A537\t6F5D51E47F9A...,2


## Count the number of links in each tweet

In [15]:
df['no_Links'] = df.apply(lambda x : len(str(x['present_links']).split()) if str(x['present_links']) != "nan" else 0,axis=1)

In [16]:
df[df['no_Links'] != 0][['present_links', 'no_Links']].head(20)

Unnamed: 0,present_links,no_Links
1,BB422AA00380E45F312FD2CAA75F4960,1
2,2423BA02A75DB2189335DDC3FB6B74A1,1
5,BB79CD318A68247B64F0E0BE7AFD5A92,1
8,EF5D0A312E7A9BAEFBEA72A062E5F6CE,1
12,F28E60037FB10D95E6CF3ADD83D67EA4,1
19,AABF8A38DBBEE6D8B6FA3650262AC2D4,1
31,F30118EB13346C80D754EAC2067092D5,1
36,60DE00805B565526B32FF78F6EF0B9F7,1
37,6EA7E07DDEDDBB1AAF7C7D9BFC232CAA,1
43,9B1A416C5A7E6FED4A461F9F2108EB83,1


In [17]:
df[['present_links', 'no_Links']].head(5)

Unnamed: 0,present_links,no_Links
0,,0
1,BB422AA00380E45F312FD2CAA75F4960,1
2,2423BA02A75DB2189335DDC3FB6B74A1,1
3,,0
4,,0


## Count the number of domains in each tweet

In [18]:
df['no_Domains'] = df.apply(lambda x : len(str(x['present_domains']).split()) if str(x['present_domains']) != "nan" else 0,axis=1)

In [19]:
df[df['no_Domains'] != 0][['present_domains', 'no_Domains']].head(20)

Unnamed: 0,present_domains,no_Domains
1,92D397F8E0F1E77B36B8C612C2C51E23,1
2,6D323BE93766E79BE423FAC5C28BE39B,1
5,3896E26D12C903F0A00B6B1BE9A9BEA3,1
8,3183ACF54B4022B25B4157B81C174DD5,1
12,0F510BF067278AAD1ECAECC3380B0162,1
19,3896E26D12C903F0A00B6B1BE9A9BEA3,1
31,178B23A1119D4345DA847C10943997F9,1
36,8F8CD97E2BFE3675BD1333C33D8F662B,1
37,AA1AD16575EB76EB0CD1D2D951DC1F88,1
43,7A52A6DD8470FC7770FDF9610909B635,1


In [20]:
df[['present_domains', 'no_Domains']].head(5)

Unnamed: 0,present_domains,no_Domains
0,,0
1,92D397F8E0F1E77B36B8C612C2C51E23,1
2,6D323BE93766E79BE423FAC5C28BE39B,1
3,,0
4,,0


## Count the number of words in the tweet

It is also worth to mention, that there are separators in the text with ID [101], [131], [102] (but at least 2 separator in each tweet) 

In [21]:
df['no_Words_In_Tweet'] = df.apply(lambda x : len(str(x['text_tokens']).split()) if str(x['text_tokens']) != "nan" else 0,axis=1)

In [22]:
df[["text_tokens",'no_Words_In_Tweet']].head(20)

Unnamed: 0,text_tokens,no_Words_In_Tweet
0,101\t56898\t137\t174\t63247\t10526\t131\t3197\...,31
1,101\t102463\t10230\t10105\t21040\t10169\t12811...,34
2,101\t56898\t137\t11255\t22037\t10263\t168\t111...,89
3,101\t13073\t28757\t106\t100\t14120\t131\t120\t...,22
4,101\t3460\t1923\t6632\t2824\t30368\t2179\t1881...,31
5,101\t46242\t40751\t161\t100062\t10107\t10114\t...,45
6,101\t56898\t137\t10192\t11373\t10500\t131\t219...,13
7,101\t65724\t15619\t22859\t14120\t131\t120\t120...,20
8,101\t100\t45031\t43804\t10121\t10146\t89387\t1...,59
9,101\t56898\t137\t26037\t91678\t168\t15734\t737...,47


Average number of words

In [23]:
np.mean(df['no_Words_In_Tweet'])

48.516990985390116

Minimum number of words

In [24]:
np.min(df['no_Words_In_Tweet'])

2

Max number of words

In [25]:
np.max(df['no_Words_In_Tweet'])

255

## Encode what kind of interactions was the engaging user did with the tweet
Decided on timestamp: if there is timestamp, the interaction has happened (1), else no (0)
- Is_reply 
- Is_retweet
- is _retweet_with_comment
- Is_like

In [26]:
df['is_reply'] = df['reply_timestamp'].notna()
df['is_like'] = df['like_timestamp'].notna()
df['is_retweet_with_comment'] = df['retweet_with_comment_timestamp'].notna()
df['is_retweet'] = df['retweet_timestamp'].notna()

In [27]:
df[['reply_timestamp','is_reply', 'like_timestamp', 'is_like', 'retweet_with_comment_timestamp', 'is_retweet_with_comment','retweet_timestamp' ,'is_retweet']].head(20)

Unnamed: 0,reply_timestamp,is_reply,like_timestamp,is_like,retweet_with_comment_timestamp,is_retweet_with_comment,retweet_timestamp,is_retweet
0,,False,,False,,False,,False
1,,False,,False,,False,,False
2,,False,,False,,False,,False
3,,False,1581305000.0,True,,False,,False
4,,False,,False,,False,,False
5,,False,,False,,False,,False
6,,False,,False,,False,,False
7,,False,1581548000.0,True,,False,,False
8,,False,,False,,False,,False
9,,False,1581548000.0,True,,False,,False


In [28]:
display(df.head())

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,engaged_with_user_follower_count,engaged_with_user_following_count,engaged_with_user_is_verified,engaged_with_user_account_creation,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp,has_media,no_Photo,no_Video,no_GIF,no_Other_Media,no_Hashtags,no_Links,no_Domains,no_Words_In_Tweet,is_reply,is_like,is_retweet_with_comment,is_retweet
0,101\t56898\t137\t174\t63247\t10526\t131\t3197\...,,3C21DCFB8E3FEC1CB3D2BFB413A78220,Video,,,Retweet,76B8A9C3013AE6414A3E6012413CDC3B,1581467323,D1AA2C85FA644D64346EDD88470525F2,737,706,False,1403069820,000046C8606F1C3F5A7296222C88084B,131,2105,False,1573978269,False,,,,,True,0,1,0,4,0,0,0,31,False,False,False,False
1,101\t102463\t10230\t10105\t21040\t10169\t12811...,,3D87CC3655C276F1771752081423B405,,BB422AA00380E45F312FD2CAA75F4960,92D397F8E0F1E77B36B8C612C2C51E23,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1580975391,4DC65AC7BD963DE1F7617C047C33DE99,52366425,2383,True,1230139136,00006047187D0D18598EF12A650E1DAC,22,50,False,1340673962,False,,,,,False,0,0,0,0,0,1,1,34,False,False,False,False
2,101\t56898\t137\t11255\t22037\t10263\t168\t111...,DB32BD91C2F1B37BE700F374A07FBC61,3701848B96AA740528A2B0E247777D7D,,2423BA02A75DB2189335DDC3FB6B74A1,6D323BE93766E79BE423FAC5C28BE39B,Retweet,22C448FF81263D4BAF2A176145EE9EAD,1581257232,5C671539CB41B9807E209349B101E9FF,988,167,False,1530094483,0000648BAA193AE4C625DDF789B57172,251,719,False,1456473671,False,,,,,False,0,0,0,0,1,1,1,89,False,False,False,False
3,101\t13073\t28757\t106\t100\t14120\t131\t120\t...,,18176C6AD2871729384062F073CCE94D,Video,,,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581164292,70B900BE17416923D1E236A38798F202,1228134,5413,False,1378699943,000071667F50BAFEA722A8E8284581E5,18,58,False,1378427564,False,,,,1581305000.0,True,0,1,0,4,0,0,0,22,False,True,False,False
4,101\t3460\t1923\t6632\t2824\t30368\t2179\t1881...,,AF11AF01F842E7F120667B7B0B38676D,,,,Quote,22C448FF81263D4BAF2A176145EE9EAD,1581233650,E94C0E9E8494F3D603F9D1A5C5242E3D,73,299,False,1549054499,00007745A6EE969F1A0F44B10DC17671,268,526,False,1252294800,False,,,,,False,0,0,0,0,0,0,0,31,False,False,False,False


### T4. Extract the Social Network
* Twitter Social Network is directional (follower – following) 
* parse the `engaged_follows_engaging` field: each example gives you an edge
* Create the adjacency matrix representation of the social graph
* If an edge exists between two users, 0 otherwise
* how can you use this information?

In [29]:
# Adjacency matrix

## Format data

**Task: Format data**\
Formatting transformations refer to primarily syntactic modifications
made to the data that do not change its meaning, but might be required
by the modeling tool.


**Output: Reformatted data**\
Some tools have requirements on the order of the attributes, such as
the first field being a unique identifier for each record or the last field
being the outcome field the model is to predict.
It might be important to change the order of the records in the dataset.
Perhaps the modeling tool requires that the records be sorted according
to the value of the outcome attribute. A common situation is that the
records of the dataset are initially ordered in some way but the modeling
algorithm needs them to be in a fairly random order. For example, when
using neural networks it is generally best for the records to be presented
in a random order although some tools handle this automatically without explicit user intervention

In [30]:
# Parse attributes containing tab-separated lists into lists.
df['text_tokens'] = df['text_tokens'].str.split('\t')

def to_hex_list(x):
    output = str(x).split('\t')
#     output = [int(val, 16) for val in str(x).split('\t')] 
    return output

cols_to_process = ['hashtags', 'present_media', 'present_links', 'present_domains']

for col in cols_to_process:  
    df[col] = df[col].apply(lambda x: to_hex_list(x) if isinstance(x, str)  else x)


# Transform raw timestamps into human-readable timestamps.
cols_to_process = ['tweet_timestamp', 'engaging_user_account_creation', 'reply_timestamp', 'retweet_timestamp', 'retweet_with_comment_timestamp', 'like_timestamp']

for col in cols_to_process:  
    df[col] = df[col].apply(lambda x: pd.Timestamp(x, unit='s'))

# Modeling

## T1. Split into train, dev, test
* Sub-sample to create test, non-test datasets
* Optionally split non-test into train and dev
* e.g., to implement k-fold validation

In [31]:
# Train/Test split

from sklearn.model_selection import train_test_split
X_nontest, X_test = train_test_split(df, test_size = 0.1, random_state = 42)
X_train, X_dev = train_test_split(X_nontest, test_size = 0.1, random_state = 42)
print(X_train.shape, X_dev.shape, X_test.shape)

(65143, 37) (7239, 37) (8043, 37)


## Naive Bayes 

## Multivariate Regression

## Neural network approach

### T5. Implement a Baseline
* Implement the neural network approach described in the challenge
paper: https://arxiv.org/abs/2004.13715

# Evaluation

## Eevaluate as a binary classification task in two ways
* Area Under Precision-Recall Curve
  * generate precision-recall pairs for various probability thresholds
    * assumes anything above threshold is predicted as relevant
* Cross-Entropy Loss = Log-Loss (for binary classification)
  * measure how good the predicted probabilities are

### T2.1 Parse test to create the ground truth output file
engaging user id; tweet id; label


In [32]:
# gt.csv

### T2.2: Implement the `read_predictions` function 
from https://recsys-twitter.com/code/snippets

In [33]:
from sklearn.metrics import precision_recall_curve, auc, log_loss

def compute_prauc(pred, gt):
  prec, recall, thresh = precision_recall_curve(gt, pred)
  prauc = auc(recall, prec)
  return prauc

def calculate_ctr(gt):
  positive = len([x for x in gt if x == 1])
  ctr = positive/float(len(gt))
  return ctr

def compute_rce(pred, gt):
    cross_entropy = log_loss(gt, pred)
    data_ctr = calculate_ctr(gt)
    strawman_cross_entropy = log_loss(gt, [data_ctr for _ in range(len(gt))])
    return (1.0 - cross_entropy/strawman_cross_entropy)*100.0

  
ground_truth = read_predictions("gt.csv") # will return data in the form (tweet_id, user_id, labed (1 or 0))
predictions = read_predictions("predictions.csv") # will return data in the form (tweet_id, user_id, prediction)

NameError: name 'read_predictions' is not defined