# Twitter Data Sharing

[Download relevant files here](https://melaniewalsh.org/Twitter-Dating-Sharing.zip) or run `git pull` from command line in the "Intro-Cultural-Analytics-Notebooks" directory

<img src="https://cfcdnpull-creativefreedoml.netdna-ssl.com/wp-content/uploads/2017/06/Twitter-featured.png" width=100%>

In this lesson, we're going to learn how to share Twitter data and access Twitter data that has been shared by others with the Python/command line tool [twarc](https://github.com/DocNow/twarc). This tool was developed by a project called [Documenting the Now](https://www.docnow.io/). The DocNow team develops tools and ethical frameworks for social media research.

This lesson presumes that you've already installed and configured twarc (which was covered in [previous lessons](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Collecting-Cultural-Data/Twitter-Data-Collection.html#Install-and-Configure-Twarc)).

# Tweet IDs

Twitter discourages developers and researchers from sharing full Twitter data openly on the web. They instead encourage developers and researchers to share *tweet IDs*:

> [If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute **Tweet IDs**, Direct Message IDs, and/or User IDs.](https://developer.twitter.com/en/developer-terms/policy#4-e)

Tweet IDs are unique identifiers assigned to every tweet. They look like a random string of numbers — 1189206626135355397 — and they can be used to download the full data for that tweet (if the tweet still exists). This is a process called "hydration."

<img src="https://cdn.pixabay.com/photo/2013/07/12/19/24/sapling-154734_960_720.png" width=100%>

Hydration: a young tweet ID sprouts into a full tweet (to be read in David Attenborough's voice)

There are actually two reasons that you might want to dehydrate tweets and/or hydrate tweet IDs:

1) to responsibly share Twitter data with others and/or access Twitter data shared by others

2) to get more information about the Twitter data that *you yourself collected*

If you collected tweets in real time, for example, you collected those tweets immediately after they were published, which means that they will not contain any retweet or favorite count information. Nobody's had time to retweet them yet! So if you'd like to retroactively get retweet and favorite count information about your tweets, then you would want to dehydrate and rehydrate them.

## Dehydrate Tweets

`twarc dehydrate tweets.jsonl > tweet_ids.txt`

To transform your Twitter data into a list of tweet IDs (so that you can share your data openly on the web), you can run the twarc command `twarc dehydrate` with the name of your JSONL file followed by the output operator `>` and the desired name of your tweet ID text file.

> tweet ID —> tweet = hydration <br>
> tweet ID <— tweet = dehydration

Let's dehydrate the Twitter data that we collected a few weeks ago: a JSONL file of 685 tweets that mentioned the general phrase "touch my face" (most responding to public health recommendations that people should avoid touching their faces).

In [1]:
!twarc dehydrate touch_my_face_tweets.jsonl > touch_my_face_tweet_ids.txt

If we `open()` and `.read()` the tweet IDs file that we just created, it looks something like this:

In [10]:
tweet_ids = open("touch_my_face_tweet_ids.txt", encoding="utf-8").read()

In [11]:
print(tweet_ids)

1238475327078313985
1238467443732946945
1238463874539683840
1238455575706513411
1238453760222998530
1238448347955879937
1238445424140263424
1238443747257585664
1238441508312932353
1238436189557755906
1238426216962478080
1238423933516480513
1238416876990070784
1238414444259995649
1238407736141979649
1238399734840205313
1238387752405807106
1238382955283664896
1238366806491922433
1238361593638871040
1238359186787971075
1238353105722339328
1238349153903628290
1238325718620217345
1238322112865226752
1238312880744923137
1238309554733174787
1238305380616331266
1238299973189537795
1238295904601268226
1238295029099180032
1238293633369092096
1238288474555289601
1238275926699507712
1238270017881300993
1238268458158219269
1238266583589535744
1238258980591468544
1238248679519203332
1238245950570803202
1238244206080069633
1238244036709941249
1238243301054017536
1238242232609751040
1238238861236383745
1238237887323398144
1238234177922723840
1238232952657633280
1238226963799703560
1238223396220989440


## Hydrate Tweets

`twarc hydrate tweet_ids.txt > tweets.jsonl`

To transform a list of tweet IDs into full Twitter data, you can run the twarc command `twarc hydrate` with the name of your tweet IDs text file followed by the output operator `>` and the desired name of your JSONL file.

> tweet ID —> tweet = hydration <br>
> tweet ID <— tweet = dehydration

Now let's re-hydrate the Twitter data that we collected a few weeks ago based on the tweet IDs that we just dehydrated.

In [20]:
!twarc hydrate touch_my_face_tweet_ids.txt > touch_my_face_tweets_REHYDRATED.jsonl

In [45]:
tweet_json = open("touch_my_face_REHYDRATED.jsonl", encoding="utf-8").read()

In [46]:
print(tweet_json)

{"created_at": "Thu Mar 12 13:21:21 +0000 2020", "id": 1238092716870979584, "id_str": "1238092716870979584", "full_text": "I\u2019ve never realized how many times a day I touch my face....#corona", "truncated": false, "display_text_range": [0, 67], "entities": {"hashtags": [{"text": "corona", "indices": [60, 67]}], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 431153696, "id_str": "431153696", "name": "Joshua Burrage", "screen_name": "JoshuaBurrage3", "location": "New York City", "description": "\ud83d\udccdNYC | CESD Talent Agency | Bacon Enthusiast @BXTaleMusical @newsies @CatsBroadway Instagram: @joshuaburrage", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 913

## Deleted Tweets & The Right To Be Forgotten

What happens if someone decides to delete their tweet between the time when the tweet is first collected and the time when the tweet is "hydrated"? The deleted tweet will **not** be hydrated. The deleted tweet is no longer be accessible.

To see how many tweets might be gone from our dataset, let's look at how many tweets are included in our rehydrated tweet file vs our original tweet file.

 Mac/Chrome OS

In [21]:
!wc -l touch_my_face_REHYDRATED.jsonl

     675 touch_my_face_REHYDRATED.jsonl


In [22]:
!wc -l touch_my_face_tweets.jsonl

     685 touch_my_face_tweets.jsonl


<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Windows_logo_-_2012_derivative.svg/1024px-Windows_logo_-_2012_derivative.svg.png width=20 align='left'> Windows 

In [None]:
!find /v /c "" touch_my_face_REHYDRATED.jsonl

In [None]:
!find /v /c "" touch_my_face_tweets.jsonl

As you can see, our rehydrated tweet file is missing 10 tweets. Those tweets have either been deleted, been made private, or been suspended.

## Separate Out Deleted Tweets (From Tweet IDs)

`python twarc/utils/tweet_compliance.py tweet_ids.txt > hydrated_tweets.json 2> deleted_tweet_ids.txt`

If you're working from someone else's tweet IDs, you can hydrate these tweet IDs and filter out the tweet IDs that have been deleted/made private/suspended by using the twarc utility `twarc/utils/tweet_compliance.py`, followed by the output operator `>`, a JSONL file name for your hydrated tweets, the number `2`, another output operator `>` and a file name for the deleted tweet IDs.

In [25]:
!python twarc/utils/tweet_compliance.py touch_my_face_tweet_ids.txt > hydrated_tweets.json 2> touch_my_face_deleted_tweets.txt

In [47]:
!wc -l touch_my_face_deleted_tweets.txt

      10 touch_my_face_deleted_tweets.txt


## Find Current Status of Tweets (From Tweet JSONL File)

`python twarc/utils/deletes.py tweest.jsonl > current_status_of_tweets.txt`

If you want to find out the current status of tweets that you've already collected, you can use the twarc utility `twarc/utils/deletes.py` followed by the output operator `>` then the file name for your text file.

In [41]:
!python twarc/utils/deletes.py touch_my_face_tweets.jsonl > current_status_of_tweets.txt



In [49]:
tweet_current_status = open("current_status_of_tweets.txt", encoding="utf-8").read()

In [50]:
print(tweet_current_status)

https://twitter.com/amanda_poops/status/1238475327078313985 TWEET_OK
https://twitter.com/heatherjones333/status/1238467443732946945 TWEET_OK
https://twitter.com/Tits_McDick/status/1238463874539683840 TWEET_OK
https://twitter.com/Van_Firth/status/1238455575706513411 TWEET_OK
https://twitter.com/ChurchGoddess/status/1238453760222998530 TWEET_OK
https://twitter.com/haziqqqaaahzik/status/1238448347955879937 TWEET_OK
https://twitter.com/gaialogia/status/1238445424140263424 TWEET_OK
https://twitter.com/Bluemagicboxes/status/1238443747257585664 TWEET_OK
https://twitter.com/ricflairdahvid/status/1238441508312932353 TWEET_OK
https://twitter.com/InTheNoosphere/status/1238436189557755906 TWEET_OK
https://twitter.com/MissMuggleborn/status/1238426216962478080 TWEET_DELETED
https://twitter.com/CatstreyDave/status/1238423933516480513 TWEET_OK
https://twitter.com/sweetbeesXx/status/1238416876990070784 TWEET_OK
https://twitter.com/commentiquette/status/1238414444259995649 TWEET_OK
https://twitter.com/j

## Update/Enhance Twitter Data with Current Status of Tweets

`python twarc/utils/deletes.py --enhance tweets.jsonl > tweets_with_current_status.jsonl`

In [32]:
!python twarc/utils/deletes.py --enhance touch_my_face_tweets.jsonl > touch_my_face_tweets_CURRENT_STATUS.jsonl



# Where to Find Tweet IDs

DocNow Catalog: https://www.docnow.io/catalog/