# Randomly selecting tweets for Hydration
July 20, 2022

This notebook first imports the no_retweets_set_filtered.tsv file which contains all of the non-retweet data from the Pancaea Lab, version 120 which was uploaded on June 26, 2022 (https://zenodo.org/record/6758164) 

This notebook will randomize the tweets and created "batches" for hydration. 

We will remove the following, already hydrated tweets: 

1. 120,000 tweets that Anish M. already randomly selected 120,000 ("data/test.txt")
2. 1,000,000 tweets that Anish M. randomly selected ("data/1_million_tweets.txt")
3. 10,000,000 tweets Meghan H. randomly selected ("data/10_million_tweets.txt")

This list of tweets will then be hydrated to support PASC tweet topic modeling project

In [1]:
import os

import pandas as pd
import numpy as np
import random
import matplotlib

**Import full twitter dataset from Panacea**

In [2]:
tweets = pd.read_csv("../data/no_retweets_set_filtered.tsv", delimiter = '\t')

In [3]:
len(tweets)

190561814

**Remove duplicate tweets**

In [4]:
tweets = tweets.drop_duplicates()
print(len(tweets))

190538594


**Import previously hydrated tweets**

In [5]:
first_set  = pd.read_csv("../data/test.txt")
second_set = pd.read_csv("../data/1_million_tweets.txt")
third_set = pd.read_csv("../data/10_million_tweets.txt")

**Create list**

In [22]:
first_set_list = first_set['tweet_id'].reset_index(drop = True)
second_set_list = second_set['tweet_id'].reset_index(drop = True)
third_set_list = third_set['tweet_id'].reset_index(drop = True)

In [23]:
hydrated_tweet_sets = [first_set_list, second_set_list, third_set_list]
hydrated_tweet_lists = pd.concat(hydrated_tweet_sets)

In [24]:
print(len(hydrated_tweet_lists))

11120000


In [25]:
hydrated_tweet_lists

0          1220749184489476096
1          1221488900348055552
2          1221828612443267074
3          1221829107194814465
4          1221835146652876800
                  ...         
9999995    1496389288879177728
9999996    1291413793910775808
9999997    1234766456144793600
9999998    1476795986261393409
9999999    1268644385861894145
Name: tweet_id, Length: 11120000, dtype: int64

**Remove hydrated tweets from the larger dataset**

In [41]:
print(len(tweets))
tweets = tweets[~tweets.tweet_id.isin(hydrated_tweet_lists)]
print(len(tweets))

190538594
179418594


**Randomly shuffle tweets**

In [42]:
random.seed(720)
random_sample_tweets = random.sample(range(len(tweets['tweet_id'])), len(tweets['tweet_id']))

In [43]:
random_sample_tweets

[41723394,
 3406980,
 51801891,
 118475004,
 123831248,
 22656462,
 126810388,
 8747814,
 162953142,
 120602893,
 165386652,
 113647818,
 80127383,
 80817300,
 82127867,
 96421427,
 83853952,
 117648053,
 66440728,
 98432074,
 79667770,
 134049205,
 123412295,
 141058384,
 67789482,
 117473258,
 3648367,
 44347414,
 43068064,
 167842185,
 125443550,
 120255284,
 66815943,
 149409149,
 154131123,
 41662710,
 85363606,
 102012596,
 75623676,
 66340766,
 110611512,
 142334139,
 82975778,
 81372041,
 12656441,
 21707497,
 54632638,
 25781273,
 54684534,
 78756418,
 128443299,
 162687619,
 178703232,
 112311363,
 99342069,
 161520670,
 157960651,
 147896077,
 170540797,
 33952995,
 161636435,
 29425743,
 127208881,
 38327403,
 29826311,
 123110542,
 14815617,
 81490103,
 94010315,
 67195955,
 29896940,
 111414066,
 141844697,
 13585672,
 115263988,
 61295835,
 77213431,
 76354550,
 124584444,
 75129599,
 62591359,
 24510512,
 76559343,
 17640092,
 52102939,
 83517366,
 37462668,
 54491975,


In [44]:
len(random_sample_tweets)

179418594

**Retrieve the tweets from the index**

In [45]:
shuffled_tweets = tweets.iloc[random_sample_tweets]
print(len(shuffled_tweets))

179418594


In [46]:
shuffled_tweets[['tweet_id']].drop_duplicates()

Unnamed: 0,tweet_id
44308260,1263073429974126592
3618155,1233405731690422272
55010391,1271172600765583361
125814948,1344390385360113669
131503092,1353241437375229952
...,...
175839463,1463761788932141061
49446277,1266466856447152135
34737125,1257270480052109313
111630456,1327069251421745152


**Save dataset**

In [60]:
shuffled_tweet_ids = shuffled_tweets['tweet_id']

In [61]:
for id, df_i in  enumerate(np.array_split(shuffled_tweet_ids, 4)):
    df_i.to_csv('../data/tweets_to_hydrate_{id}.txt'.format(id=id))