This notebook computes relevant info and does the following splits:
- It creates a **folder per engagement type:** 
    + `/data/recsys/like`
    + `/data/recsys/retweet`
    + `/data/recsys/reply` 
    + `/data/recsys/quote`

## Relevant info (and its computation)

+ [x] Total data: **786669259 ~ 786.7M** rows 
+ [x] Rows with any interaction: **392589591 ~ 392.6M** rows = 49.91%
+ [x] Not-like interactions (RT, RP, Q): **95239085 ~ 95.2M** rows = 12.11%
+ [x] Like rows: **313167893 ~ 313.2M** rows = 39.81%
+ [x] Retweet rows: **68910112 ~ 68.9M** rows = 8.76%
+ [x] Reply rows: **23326328 ~ 23.3M** rows = 2.96%
+ [x] Quote rows: **5584801 ~ 5.6M** rows = 0.71%
+ [x] Timestamps from **1612396800 (4 February 2021 0:00:00)** to **1614211199 (24 February 2021 23:59:59)**
+ [x] Unique languages: **66** languages 
    + **36.13% english** :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        284223607 ~ 284.2M
    + **17.16% japanese**:&nbsp;&nbsp;&nbsp;&nbsp;                                    134991416 ~ 135.0M
    + **8.37% spanish**: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        65897167 ~ 65.9M
    + **8.01% portuguese**: &nbsp;&nbsp;                                               63011919 ~ 63.0M
    + **6.43% unknown**:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                     50546976 ~ 50.5M
+ [x] Unique tweet ids: **361305728 ~ 361.3M** unique tweets
+ [x] Unique authors: **28470193 ~ 28.5M** unique authors
+ [x] Unique readers: **38180129 ~ 38.2M** unique readers
+ [x] Unique users: **48154911 ~ 48.2M** unique users
+ [ ] Authors but not readers:
+ [ ] Readers but not authors:
+ [x] Authors and readers: **18495411 ~ 18.5M users** that appear as authors and readers
+ [ ] Unique (auth, read)
+ [ ] Number of (auth, read - read, auth)

In [2]:
# !conda install dask distributed -y

In [3]:
import dask

In [4]:
dask.__version__

'2021.04.0'

In [5]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
client

# OPEN DASK DASHBOARD:
# http://161.116.4.126/dask/status

0,1
Client  Scheduler: tcp://127.0.0.1:36093  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 6  Cores: 24  Memory: 125.60 GiB


In [6]:
import gc
import dask.dataframe as dd
import pandas as pd
import numpy as np

In [30]:
column_type={
              'bert': str,'hashtags':str,'tweet_id':str,'media':str,'links':str,'domains':str,'type':str,'language':str,'timestamp':np.uint32,
              'AUTH_user_id':str,'AUTH_follower_count':np.uint32,'AUTH_following_count':np.uint32,'AUTH_verified':bool,'AUTH_account_creation':np.uint32,
              'READ_user_id':str,'READ_follower_count':np.uint32,'READ_following_count':np.uint32,'READ_verified':bool,'READ_account_creation':np.uint32,
              'auth_follows_read':bool,
              'reply_timestamp':'Int32','retweet_timestamp':'Int32','quote_timestamp':'Int32','like_timestamp':'Int32'
            }

In [13]:
# blocksize = None # => csv length = partition size # extremadament lent
dfd = dd.read_csv('/data/recsys/part-*', names=list(column_type.keys()), 
                  header=None, sep='\x01', dtype=column_type)

### Total data

In [8]:
dfd.shape[0].compute()

786669259

### Any interaction

In [5]:
dfd[(dfd.reply_timestamp>0) | (dfd.retweet_timestamp>0) | (dfd.quote_timestamp>0) | (dfd.like_timestamp>0)].shape[0].compute()

392589591

### Not-like interactions

In [21]:
dfd[(dfd.reply_timestamp>0) | (dfd.retweet_timestamp>0) | (dfd.quote_timestamp>0)].shape[0].compute()

95239085

### Like, reply, retweet and quote rows

Create datasets:

```console
mkdir /data/recsys/like
mkdir /data/recsys/reply
mkdir /data/recsys/retweet
mkdir /data/recsys/quote
```

In [None]:
# Split data acording to its engagement type
gc.collect()
dfd[dfd.like_timestamp>0].to_csv('/data/recsys/like/part-*.csv', name_function=lambda i: f"{i:05d}", index=False)
print('Like done')
gc.collect()
dfd[dfd.reply_timestamp>0].to_csv('/data/recsys/reply/part-*.csv', name_function=lambda i: f"{i:05d}", index=False)
print('Reply done')
gc.collect()
dfd[dfd.retweet_timestamp>0].to_csv('/data/recsys/retweet/part-*.csv', name_function=lambda i: f"{i:05d}", index=False)
print('Retweet done')
gc.collect()
dfd[dfd.quote_timestamp>0].to_csv('/data/recsys/quote/part-*.csv', name_function=lambda i: f"{i:05d}", index=False)
print('Quote done')
gc.collect()

In a terminal, these lines print the number of rows per engagement's type:
```console
find /data/recsys/like -type f -exec wc -l {} \; | tqdm --total 6226 | awk '{total += $1} END{print total}'
find /data/recsys/reply -type f -exec wc -l {} \; | tqdm --total 6226 | awk '{total += $1} END{print total}'
find /data/recsys/retweet -type f -exec wc -l {} \; | tqdm --total 6226 | awk '{total += $1} END{print total}'
find /data/recsys/quote -type f -exec wc -l {} \; | tqdm --total 6226 | awk '{total += $1} END{print total}'
```

### Tweets per language

In [8]:
# Read languages only to save ram
dfd = dd.read_csv('/data/recsys/part-*', names=list(column_type.keys()), 
                  header=None, sep='\x01', usecols=['language'],
                  dtype=column_type)
dfd.head(5)

Unnamed: 0,language
0,488B32D24BD4BB44172EB981C1BCA6FA
1,B0FA488F2911701DD8EC5B1EA5E322D8
2,B0FA488F2911701DD8EC5B1EA5E322D8
3,B0FA488F2911701DD8EC5B1EA5E322D8
4,E7F038DE3EAD397AEC9193686C911677


In [None]:
# Count languages and store to disk
gc.collect()
df_counts = dfd.groupby('language').size()
df_counts.name = 'language_counts'
df_counts.to_csv('/data/recsys/stats/language_counts.csv', single_file=True)

In [17]:
df_langs = pd.read_csv('/data/recsys/stats/language_counts.csv')
df_langs.head()

Unnamed: 0,language,language_counts
0,488B32D24BD4BB44172EB981C1BCA6FA,284223607
1,E7F038DE3EAD397AEC9193686C911677,134991416
2,B0FA488F2911701DD8EC5B1EA5E322D8,65897167
3,B8B04128918BBF54E2E178BFF1ABA833,63011919
4,313ECD3A1E5BB07406E4249475C2D6D6,50546976


In [28]:
# !pip install transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

N = 5
# English, Japanese, Spanish, Portuguese, [Unknown?]
for _, row in df_langs[:N].iterrows():
    bert = dfd[dfd.language==row.language].head(1).bert.values[0]
    print(row.language, tokenizer.decode(bert.split('\t')))

488B32D24BD4BB44172EB981C1BCA6FA [CLS] RT @ ErikSolheim : The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. ¶ ¶ [UNK] Stephen Hawking https : / / t. co / RyXH1xaS [UNK] [SEP]
E7F038DE3EAD397AEC9193686C911677 [CLS] 見 て ( [UNK] ｡ [UNK] ｡ [UNK] ) [UNK] ¶ 見 てみて 見 て 見 てみて 皆 見 て ¶ 全 員 見 て ¶ これ 私 [UNK] ( ๑ [UNK] ิ ) [UNK] ¶ ( [UNK] [UNK] ) [UNK]!! https : / / t. co / rmmlUJscX8 [SEP]
B0FA488F2911701DD8EC5B1EA5E322D8 [CLS] La cirugía se realizó el pasado 12 de agosto y se prolongó durante unas 23 horas, con la participación de un equipo de más de 140 especialistas, incluidos cirujanos, enfermeros y otro personal, según explicó el hospital en un comunicado https : / / t. co / zBHCZCo7MO [SEP]
B8B04128918BBF54E2E178BFF1ABA833 [CLS] O pintor chegou agr e falou [UNK] ah já é 11 : 15h né, nem adianta ficar agr, de tarde eu volto [UNK] shiahsishsu aí aí gente, dai - me paciência hoje viu, tô rindo pra não chorar hsishsishsu [SEP]
313ECD3A1E5BB07406E4249475C2D6D6 [CLS

### Min max timestamp

In [29]:
dd.compute(dfd.timestamp.min(), dfd.timestamp.max())

(1612396800, 1614211199)

### Unique AUTHORS

In [None]:
dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type)

# Author count
df_counts = dfd.groupby('AUTH_user_id').size()
df_counts.name = 'AUTH_user_id_counts'
df_counts.to_csv('/data/recsys/stats/AUTH_user_id_counts.csv', single_file=True)

### Unique READERS

In [None]:
# READ count

dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type)

df_counts = dfd.groupby('READ_user_id').size()
df_counts.name = 'READ_user_id_counts'
df_counts.to_csv('/data/recsys/stats/READ_user_id_counts.csv', single_file=True)

### Unique tweets

In [None]:
# tweet id count

dfd = dd.read_csv('/data/recsys/stats/part-*.csv', dtype=column_type)

df_counts = dfd.groupby('tweet_id').size()
df_counts.name = 'tweet_id_counts'
df_counts.to_csv('/data/recsys/stats/tweet_id_counts.csv', single_file=True)

# Sort it
dd.read_csv('/data/recsys/stats/tweet_id_counts.csv').set_index('tweet_id_counts').reset_index()[['tweet_id', 'tweet_id_counts']].to_csv('/data/recsys/stats/tweet_id_counts.csv', index=False, single_file=True)

### User and author

In [None]:
dd.merge(dd.read_csv('/data/recsys/stats/AUTH_user_id_counts.csv'), 
         dd.read_csv('/data/recsys/stats/READ_user_id_counts.csv'), 
         left_on='AUTH_user_id', 
         right_on='READ_user_id').drop(columns=['READ_user_id']).rename(columns={'AUTH_user_id':'user_id'}).to_csv('/data/recsys/stats/BOTH_user_id_counts.csv', single_file=True, index=False)

### Total users

In [None]:
dd.merge(dd.read_csv('AUTH_user_id_counts.csv'), 
         dd.read_csv('READ_user_id_counts.csv'), 
         how='outer',
         left_on='AUTH_user_id', 
         right_on='READ_user_id').drop(columns=['READ_user_id']).rename(columns={'AUTH_user_id':'user_id'}).fillna(0).to_csv('/data/recsys/stats/TOTAL_user_id_counts.csv', single_file=True, index=False)

### Bin stats

In [15]:
auth = dd.read_csv('/data/recsys/stats/AUTH_follower_count.csv').drop(columns='AUTH_following_count')
auth = dd.merge(auth, dd.read_csv('/data/recsys/stats/AUTH_user_id_counts.csv'), on='AUTH_user_id')
auth['bin'] = auth['AUTH_follower_count'].map_partitions(pd.cut, bins=[-np.inf, 240, 588, 1331, 3996, np.inf], labels=[0,1,2,3,4], meta=(None, 'i4'))

In [19]:
auth.groupby('bin').count().compute()

Unnamed: 0_level_0,AUTH_user_id,AUTH_follower_count,AUTH_user_id_counts
bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12434818,12434818,12434818
1,6967162,6967162,6967162
2,4574481,4574481,4574481
3,2867191,2867191,2867191
4,1626540,1626540,1626540


In [20]:
st = auth.groupby('bin').sum().compute()
st

Unnamed: 0_level_0,AUTH_follower_count,AUTH_user_id_counts
bin,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1323608956,78185341
1,2685498396,82969759
2,4033405241,90470509
3,6336041076,111788590
4,66740469706,423255060


In [28]:
100*st.AUTH_user_id_counts/sum(st.AUTH_user_id_counts)

bin
0    0.099388
1    0.105470
2    0.115005
3    0.142104
4    0.538034
Name: AUTH_user_id_counts, dtype: float64

### Create unique for last week

In [33]:
dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type, blocksize=None).partitions[504:]

# Author count
df_counts = dfd.groupby('AUTH_user_id').size()
df_counts.name = 'AUTH_user_id_counts'
df_counts.to_csv('/data/recsys/stats/AUTH_user_id_counts_last_week.csv', single_file=True)

['/data/recsys/stats/AUTH_user_id_counts_last_week.csv']

In [34]:
dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type, blocksize=None).partitions[504:]

# Author count
df_counts = dfd.groupby('AUTH_user_id').last()['AUTH_follower_count']
df_counts.name = 'AUTH_user_id_counts'
df_counts.to_csv('/data/recsys/stats/AUTH_followers_last_week.csv', single_file=True)

['/data/recsys/stats/AUTH_followers_last_week.csv']

In [50]:
dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type, blocksize=None).partitions[504:]

# Author count
df_counts = dfd.groupby('READ_user_id').size()
df_counts.name = 'READ_user_id_counts'
df_counts.to_csv('/data/recsys/stats/READ_user_id_counts_last_week.csv', single_file=True)

['/data/recsys/stats/READ_user_id_counts_last_week.csv']

In [51]:
dfd = dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type, blocksize=None).partitions[504:]

# Author count
df_counts = dfd.groupby('READ_user_id').last()['READ_follower_count']
df_counts.name = 'READ_user_id_counts'
df_counts.to_csv('/data/recsys/stats/READ_followers_last_week.csv', single_file=True)

['/data/recsys/stats/READ_followers_last_week.csv']

In [41]:
auth = dd.read_csv('/data/recsys/stats/AUTH_followers_last_week.csv')
auth = dd.merge(auth, dd.read_csv('/data/recsys/stats/AUTH_user_id_counts_last_week.csv'), on='AUTH_user_id')
auth['bin'] = auth['AUTH_follower_count'].map_partitions(pd.cut, bins=[-np.inf, 240, 588, 1331, 3996, np.inf], labels=[0,1,2,3,4], meta=(None, 'i4'))

In [42]:
auth.groupby('bin').count().compute()

Unnamed: 0_level_0,AUTH_user_id,AUTH_follower_count,AUTH_user_id_counts
bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,716154,716154,716154
1,715580,715580,715580
2,714150,714150,714150
3,714975,714975,714975
4,715106,715106,715106


In [43]:
st = auth.groupby('bin').sum().compute()
st

Unnamed: 0_level_0,AUTH_follower_count,AUTH_user_id_counts
bin,Unnamed: 1_level_1,Unnamed: 2_level_1
0,89974301,901944
1,283661378,1085691
2,649361720,1328294
3,1650975053,1901083
4,48980211925,9244748


In [44]:
100*st.AUTH_user_id_counts/sum(st.AUTH_user_id_counts)

bin
0    0.062368
1    0.075073
2    0.091849
3    0.131456
4    0.639255
Name: AUTH_user_id_counts, dtype: float64

### New authors, new readers

In [45]:
len(dd.read_csv('/data/recsys/sorted/part-*.csv', dtype=column_type, blocksize=None).partitions[504:])

14461760

In [47]:
n = len(dd.merge(dd.read_csv('/data/recsys/stats/AUTH_follower_count.csv'), 
        dd.read_csv('/data/recsys/stats/AUTH_followers_last_week.csv'), 
        on='AUTH_user_id', how='inner'))
n

In [48]:
100 - 100*n/len(dd.read_csv('/data/recsys/stats/AUTH_followers_last_week.csv'))

5.738451019515011

In [52]:
n = len(dd.merge(dd.read_csv('/data/recsys/stats/READ_follower_count.csv'), 
        dd.read_csv('/data/recsys/stats/READ_followers_last_week.csv'), 
        on='READ_user_id', how='inner'))
n

6142340

In [53]:
100 - 100*n/len(dd.read_csv('/data/recsys/stats/READ_followers_last_week.csv'))

0.30416745116116317