# Data Collection and Storage

In this lab, we will collect data from Twitter, store them in a file, and import them back. There are multiple ways one can collect data from Twitter. We can send requests through the official API (if you do not know what an API is, watch [this video](https://youtu.be/s7wmiS2mSXY?t=65)) and parse the response. We can also use web scraping to scrape the content directly through the website itself, but this is out of scope for this course, so we will use the official API instead.

## Contents

* [Before we start](#before-we-start)
* [Collection](#collection)
    * [Tweet](#collection-tweet)
        * [Retrieving a specific tweet](#retrieve-tweet)
        * [Retrieving tweets in bulk](#retrieve-tweet-bulk)
        * [Searching for tweets](#search-tweet)
    * [User](#collection-user)
        * [Retrieving a specific user](#retrieve-user)
        * [Retrieving users in bulk](#retrieve-user-bulk)
        * [Searching for users](#search-user)
* [Storage](#storage)
    * [JSON](#storage-json)
    * [CSV](#storage-csv)
* [Some notes](#some-notes)

## Before we start <a class="anchor" id="before-we-start"></a>

We are going to use Tweepy. Tweepy is an external wrapper package that abstracts certain technical details of using Twitter's API, which is more beginner-friendly. You can watch [this video](https://www.youtube.com/watch?v=TASX3evcgG4) to learn how to install Tweepy with Anaconda. The other packages come with Python 3, so you do not need to worry about them.

While Tweepy makes certain things easier, you still need to obtain an API credentials from Twitter. Tweepy, using your credentials, handles the communication with Twitter on behalf of you and retrieves the data somewhat more suitable to be directly used in Python.

## Collection <a class="anchor" id="collection"></a>

Let us start with importing the packages we will use:

In [1]:
import tweepy # for pulling tweets
import csv # for exporting and importing CSV files
import json # for formatting, exporting, and importing JSON files
import datetime # for date formatting
import html # for unescaping certain characters in the text

There are two types of authentication. OAuth 1a also requires access token and access token secret, and it can be used to retrieve your account's home timeline (`api.home_timeline()`) or even tweet something. Put your API credentials below, so that you can pull data from Twitter:

In [2]:
CONSUMER_KEY = "" # fill this
CONSUMER_SECRET = "" # fill this
ACCESS_TOKEN = "" # fill this (required only for OAuth 1a)
ACCESS_TOKEN_SECRET = "" # fill this (required only for OAuth 1a)

Using these credentials, you authorize Tweepy to retrieve data from Twitter. Since we are only interested in pulling public data, we will simply use OAuth 2, but the OAuth 1a version is also available below.

In [3]:
# OAuth 1a: Authentication with the user context

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

# OAuth 2: Authentication without the user context

# auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
# api = tweepy.API(auth)

### Tweet <a class="anchor" id="collection-tweet"></a>

We are now ready to retrieve tweets. We will focus on retrieveing tweets that have been written before instead of live streaming them. If you are interested in streaming them, you can watch [this tutorial](https://www.youtube.com/watch?v=wlnx-7cm4Gg) and check [Tweepy's StreamListener](http://docs.tweepy.org/en/latest/streaming_how_to.html).

#### Retrieving a specific tweet <a class="anchor" id="retrieve-tweet"></a>

We can retrieve a tweet by its ID using `api.get_status()`. While this may not seem very useful, you can see that many tweet datasets simply list the tweet IDs in order to comply with the policy restrictions (sharing it is much easier as well), so we can use this function to "hydrate" the tweet and necessary data such as the user and the content. Here is an example on retrieveing a tweet using its ID:

In [4]:
a_tweet_id = 1286989864999587840

# Provide the tweet ID and ensure that the whole text is retrieved using 
# tweet_mode="extended."
tweet = api.get_status(a_tweet_id, tweet_mode="extended")

# Keep in mind that if the tweet's text is truncated or it includes an embedded 
# media, its text will include an appended link that is not visible on the Web.

However, we cannot directly use this tweet since it is a big pile of data specific to this tweet. Take a look at this, we can retrieve many different types of data including the user's profile image URL and if the tweet has a possibly sensitive content:

In [5]:
print(tweet)

Status(_api=<tweepy.api.API object at 0x0000021AF2C7FC08>, _json={'created_at': 'Sat Jul 25 11:41:09 +0000 2020', 'id': 1286989864999587840, 'id_str': '1286989864999587840', 'full_text': 'It is important to wear the mask in the right way covering the nose &amp; mouth completely to control the spread of #COVID19. \nParallelly, 2 meters of physical distance &amp; frequent sanitization of hands should be strictly followed. #APFightsCorona #COVID19Pandemic https://t.co/2A4gOsO6ie', 'truncated': False, 'display_text_range': [0, 267], 'entities': {'hashtags': [{'text': 'COVID19', 'indices': [116, 124]}, {'text': 'APFightsCorona', 'indices': [235, 250]}, {'text': 'COVID19Pandemic', 'indices': [251, 267]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1286989068060905473, 'id_str': '1286989068060905473', 'indices': [268, 291], 'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/1286989068060905473/pu/img/11yLalFmTyDhrI7y.jpg', 'media_url_https': 'https://pbs.twimg.com/ext

---

Instead, we can pick the attributes we need. To inspect its contents easier, we can access its [JSON](https://www.w3schools.com/whatis/whatis_json.asp) object and print it with indentations:

In [6]:
tweet_json = tweet._json
print(json.dumps(tweet_json, indent=4, ensure_ascii=False))

{
    "created_at": "Sat Jul 25 11:41:09 +0000 2020",
    "id": 1286989864999587840,
    "id_str": "1286989864999587840",
    "full_text": "It is important to wear the mask in the right way covering the nose &amp; mouth completely to control the spread of #COVID19. \nParallelly, 2 meters of physical distance &amp; frequent sanitization of hands should be strictly followed. #APFightsCorona #COVID19Pandemic https://t.co/2A4gOsO6ie",
    "truncated": false,
    "display_text_range": [
        0,
        267
    ],
    "entities": {
        "hashtags": [
            {
                "text": "COVID19",
                "indices": [
                    116,
                    124
                ]
            },
            {
                "text": "APFightsCorona",
                "indices": [
                    235,
                    250
                ]
            },
            {
                "text": "COVID19Pandemic",
                "indices": [
                    251,
     

---

Now we can pick the attributes we want and retrieve one attribute at a time. For example, we can access the tweet owner's handle (screen name) by `tweet.user.screen_name` (using the tweet object) or `tweet_json["user"]["screen_name"]` (using the tweet's JSON data we obtained above). Note that even if we do not directly use the JSON data, certain attributes such as lists and dictionaries still require us to use brackets. For example, to get the hashtags through the tweet object itself, we need to use `tweet.entities["hashtags"]`. While it may seem like it is slightly more cumbersome, I will prefer to directly deal with the JSON object of the tweet instead, as shown below:

In [7]:
# These are optional basic data manipulations that might come handy for your 
# projects.

def get_tweet_timestamp(created_at):
    # A utility function that returns UNIX timestamp from the created_at attribute
    return int(datetime.datetime.strptime(tweet_json["created_at"], "%a %b %d %H:%M:%S %z %Y").timestamp())

def get_unique_hashtags(hashtags_dict, serialize=False):
    # A utility function that returns UNIX timestamp from the created_at attribute
    hashtags = {hashtag["text"] for hashtag in hashtags_dict}
    if serialize:
        # Returns it as a simple string, hashtags are separated by ","
        return ','.join(hashtags)
    else:
        return hashtags

# Printing the attributes

print("User handle:",tweet_json["user"]["screen_name"])
print("User name:",tweet_json["user"]["name"])
# In this case, name and screen_name correspond to the same value, but most of the 
# time, it is not the case.

print("Date and time:",tweet_json["created_at"])
# Obtaining a UNIX timestamp from the date and time (might come handy for further 
# analyses):
print("UNIX timestamp from date:",get_tweet_timestamp(tweet_json["created_at"]))

print("Tweet:",tweet_json["full_text"])
print("Hashtags:",tweet_json["entities"]["hashtags"])
# If you do not need the indices, this directly gives you the unique hashtags:
print("Unique hashtags:",get_unique_hashtags(tweet_json["entities"]["hashtags"]))
print("Favorite count:",tweet_json["favorite_count"])
print("Retweet count:",tweet_json["retweet_count"])

# If you are going to pull your own data and repeat this process for multiple 
# tweets, you should write a function to simplify the whole process.

User handle: ArogyaAndhra
User name: ArogyaAndhra
Date and time: Sat Jul 25 11:41:09 +0000 2020
UNIX timestamp from date: 1595677269
Tweet: It is important to wear the mask in the right way covering the nose &amp; mouth completely to control the spread of #COVID19. 
Parallelly, 2 meters of physical distance &amp; frequent sanitization of hands should be strictly followed. #APFightsCorona #COVID19Pandemic https://t.co/2A4gOsO6ie
Hashtags: [{'text': 'COVID19', 'indices': [116, 124]}, {'text': 'APFightsCorona', 'indices': [235, 250]}, {'text': 'COVID19Pandemic', 'indices': [251, 267]}]
Unique hashtags: {'COVID19', 'APFightsCorona', 'COVID19Pandemic'}
Favorite count: 385
Retweet count: 106


---

You can find this tweet [here](https://twitter.com/ArogyaAndhra/status/1286989864999587840). You may have noticed that some characters are escaped, so `&` is shown as `&amp;` while line breaks are preserved. For convenience, we can write a function that handles these issues:

In [8]:
def simplify_text(text):
    # Practically replaces line breaks or other whitespace characters with a single 
    # space and then unescapes characters.
    return html.unescape(" ".join(text.split()))

print("Simplified tweet:",simplify_text(tweet_json["full_text"]))

Simplified tweet: It is important to wear the mask in the right way covering the nose & mouth completely to control the spread of #COVID19. Parallelly, 2 meters of physical distance & frequent sanitization of hands should be strictly followed. #APFightsCorona #COVID19Pandemic https://t.co/2A4gOsO6ie


----

#### Retrieving tweets in bulk <a class="anchor" id="retrieve-tweet-bulk"></a>

If we have a list of tweet IDs, we can retrieve them in bulk using `api.statuses_lookup()`. We can provide at most 100 IDs at a time. Using this function, we can quickly hydrate groups of 100.

In [9]:
some_tweet_ids = [1313323997858332672, 1313588420149612544, 1316423894581030913, 1316739871269179392]

# These tweets are retrieved at once.
tweets = api.statuses_lookup(some_tweet_ids, tweet_mode="extended")

for tweet in tweets:
    # For each tweet, we can print its full text.
    print(simplify_text(tweet.full_text))

Who wants a slice of pumpkin, prosciutto, and smoked mozzarella pizza? #food #pizza #baking https://t.co/W9ayf5ARLm
French Bread Hawaiian Pizza, complete with ham, pineapple and gooey mozzarella, gets a fast weeknight makeover with a simple loaf of French bread as the base. No need for takeout! https://t.co/f8dp0euRPs #shockinglydelicious #pizza https://t.co/t64Da8EFC5
A mushroom #pizza is a happy pizza. https://t.co/JP6TzB8sSX
We love a good homemade #pizza but damnit if I'm not terrible at rolling out the #dough 🤦‍♀️😐 https://t.co/IjZ1sP2kvs


#### Searching for tweets <a class="anchor" id="search-tweet"></a>

We can also search for public tweets using certain parameters using `api.search()`. Note that these results are not exhaustive and limited to the last week's tweets. Since Twitter provides the results in smaller chunks (pages), we normally have to handle pagination using page tokens provided with each search result. Luckily, we can use Tweepy's `Cursor` object that abstracts this, so we do not care about pages. 

The code below retrieves 10 tweets that:
* include the word "mask" and the hashtag "covid19,"
* do not include the word "public,"
* are written from Central London, the UK,
* are written in English.

While pagination is abstracted from us, the tweet count per page can be set up to 100.

In [10]:
tweets_searched = tweepy.Cursor(api.search, q='mask #covid19 -public', # query string 
                                geocode="51.5026784,-0.1167149,10km", # latitude, longitude, radius (these values in this case correspond to London, the UK)
                                lang="en", # English
                                count=10, # tweet per page
                                tweet_mode="extended"
                                ).items(10) # retrieves only the first 10 tweets

for result_tweet in tweets_searched:
    # result_tweet is a Status object
    print("*",simplify_text(result_tweet.full_text))
    
print("\nTweets retrieved:",len(tweets_searched))

* With some schools across #Croydon on half-term and some starting next week, we all need to continue to play our part like Loyle Carner to help #KeepCroydonSafe from #Covid19. Protect yourself and others by washing hands often, wearing a face mask and watching your space 👍 https://t.co/k1xqPd7jq9
* I'd say as many as 50% of the people I see, don't know how to wear a mask. #covid19
* RT @asjadnazir: Please take #COVID19 seriously and stay safe. The cases are spiking all over the world and beginning to rise again. Wearing…
* RT @nisusmedical: #COVID19 extensive review of preexisting emerging mask designs, discovered elastomeric #N95 #facemasks (#eN95s) best alte…
* RT @nisusmedical: #COVID19 #PPE Essential Workers Need Better Masks #N95 #K95 #facemasks not being used in #US #UK elsewhere - if you’re a…
* In this time of the pandemic, we can still fight by wearing a mask and washing our hands to stop the spread of the virus. 🌐 https://t.co/LWDMOhU99C #coronavirusuk #COVID19 #Coronavirus 

---

A query string can also include some operators as well. For example, you can search for tweets whose owners are verified, or tweets that mention a specific user. You can check your options [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-rule). 

### User <a class="anchor" id="collection-user"></a>

#### Retrieving a specific user <a class="anchor" id="retrieve-user"></a>

We can retrieve a user using `api.get_user()`. It accepts both user IDs and user handles.

In [11]:
# You can either provide a user ID or a user handle (screen_name)
user = api.get_user(25073877)
# user = api.get_user("realDonaldTrump")

print(user.name)

Donald J. Trump


#### Retrieving users in bulk <a class="anchor" id="retrieve-user-bulk"></a>

Using `api.lookup_users()`, just like tweets, we can retrieve users in bulk as well (100 at a time). We can use their handles or IDs, but cannot mix them.

In [13]:
some_user_handles = ["realDonaldTrump", "JoeBiden"]
some_user_ids = [25073877, 939091]

# If you have their handles:
users = api.lookup_users(screen_names=some_user_handles, tweet_mode="extended")
# If you have their IDs:
# users = api.lookup_users(user_ids=some_user_ids, tweet_mode="extended")

for user in users:
    print(user.name)

Donald J. Trump
Joe Biden


#### Searching for users <a class="anchor" id="search-user"></a>

Using `api.search_users()`, we can retrieve at most 1000 users in total (at most 20 users at once). Again, we can use `Cursor` to not deal with pagination. Note that this function requires a user context, so you need OAuth 1a for this. It seems like the results are user-specific.

In [14]:
users_searched =  tweepy.Cursor(api.search_users, q='social computing', # query string 
                                full_text=True
                                ).items(10) # retrieves only the first 10 users

for result_user in users_searched:
    # result_user is a User object
    print("*",result_user.screen_name,"-",result_user.description)
    
print("\nUsers retrieved:",len(users_searched))

* ACM_CSCW - ACM Conference on Computer Supported Cooperative Work. Tweets by Krithika (@ResearcherKrit), Konstantinos (@ademon) and Lindsay (@linguangst) #CSCW2020
* IBMcloud - Built for your business, #IBMCloud has the tools, data & APIs to make AI real now. Follows IBM Social Computing Guidelines.
* QatarComputing - Qatar Computing Research Institute. Artificial Intelligence- Data Analytics-Social Computing-Cyber Security-Arabic Language Technologies.@QatarFoundation Member
* soccompjournal - Top Web periodical covering Social Computing, Enterprise 2.0, social media, and other Web 2.0 topics.
* sig_chi - Official account for ACM SIGCHI Conference: CHI (said 'kai'). #CHI2021 is in Yokohama, Japan, May 8-13. Content managed by Social Media team: SM, HY, RS, and SK
* socomputing - Social Computing is a company of experts in Information Access (Enterprise Search), Dynamic Mapping, Social Networks, Collective Intelligence
* socialMPI - Social computing research at the Max Planck Institut

## Storage <a class="anchor" id="storage"></a>

Once we retrieve data, we can store them in a file to share or use later. Two very common file formats for this purpose are JSON and CSV. Let us see how we can export those four tweets we had hydrated before.

### JSON <a class="anchor" id="storage-json"></a>

If the data we want to store is in JSON format, we might prefer to directly store it in JSON format. For example, We can save the whole tweet data or even multiple tweets in a file:

In [15]:
# Obtaining JSON objects in a list
tweets_json = [tweet._json for tweet in tweets]
# Saving it in a JSON file
with open("some_tweets.json", 'w', encoding="UTF-8") as outfile:
    # You can remove the indent parameter to make it more compact.
    json.dump(tweets_json, outfile, ensure_ascii=False, indent=4)

Now, we can import these tweets whenever we want:

In [16]:
with open("some_tweets.json", 'r', encoding="UTF-8") as file:
    tweets_json = json.load(file)

print(tweets_json)

[{'created_at': 'Tue Oct 06 03:43:36 +0000 2020', 'id': 1313323997858332672, 'id_str': '1313323997858332672', 'full_text': 'Who wants a slice of pumpkin, prosciutto, and smoked mozzarella pizza?\n#food #pizza #baking https://t.co/W9ayf5ARLm', 'truncated': False, 'display_text_range': [0, 91], 'entities': {'hashtags': [{'text': 'food', 'indices': [71, 76]}, {'text': 'pizza', 'indices': [77, 83]}, {'text': 'baking', 'indices': [84, 91]}], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1313323994574196737, 'id_str': '1313323994574196737', 'indices': [92, 115], 'media_url': 'http://pbs.twimg.com/media/EjndNvuXkAEYisl.jpg', 'media_url_https': 'https://pbs.twimg.com/media/EjndNvuXkAEYisl.jpg', 'url': 'https://t.co/W9ayf5ARLm', 'display_url': 'pic.twitter.com/W9ayf5ARLm', 'expanded_url': 'https://twitter.com/McBrideWriter/status/1313323997858332672/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 2048, 'h': 2048, 'resize': 

---

### CSV <a class="anchor" id="storage-csv"></a>

We may also want to save these tweets in a more readable and tabular format, using fewer attributes:

In [17]:
with open("some_tweets.csv", 'w', encoding="UTF-8") as outfile:
    writer = csv.writer(outfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL, lineterminator='\n')
    # Writing the headers
    writer.writerow(["screen_name", 
                     "user_location", 
                     "follower_count", 
                     "timestamp", 
                     "text", 
                     "hashtags", 
                     "favorite_count", 
                     "retweet_count"])
    
    # Writing each tweet on a line
    for tweet_json in tweets_json:
        writer.writerow([tweet_json["user"]["screen_name"], 
                         tweet_json["user"]["location"], 
                         tweet_json["user"]["followers_count"], 
                         get_tweet_timestamp(tweet_json["created_at"]), 
                         simplify_text(tweet_json["full_text"]), 
                         get_unique_hashtags(tweet_json["entities"]["hashtags"], serialize=True), 
                         tweet_json["favorite_count"], 
                         tweet_json["retweet_count"]])

Importing this CSV file back is simple as well:

In [18]:
with open("some_tweets.csv", 'r', encoding="UTF-8") as file:
    reader = csv.reader(file)
    tweets_csv = list(reader)
    
for tweet in tweets_csv:
    print(tweet)

['screen_name', 'user_location', 'follower_count', 'timestamp', 'text', 'hashtags', 'favorite_count', 'retweet_count']
['McBrideWriter', 'Louisville, Kentucky', '11304', '1601955816', 'Who wants a slice of pumpkin, prosciutto, and smoked mozzarella pizza? #food #pizza #baking https://t.co/W9ayf5ARLm', 'baking,food,pizza', '209', '16']
['Shockinglydlish', 'Malibu | Southern California', '5078', '1602018859', 'French Bread Hawaiian Pizza, complete with ham, pineapple and gooey mozzarella, gets a fast weeknight makeover with a simple loaf of French bread as the base. No need for takeout! https://t.co/f8dp0euRPs #shockinglydelicious #pizza https://t.co/t64Da8EFC5', 'shockinglydelicious,pizza', '29', '20']
['CookedBest', '', '1830', '1602694889', 'A mushroom #pizza is a happy pizza. https://t.co/JP6TzB8sSX', 'pizza', '83', '8']
['LonelyPinesFarm', 'Quilcene, WA', '251', '1602770223', "We love a good homemade #pizza but damnit if I'm not terrible at rolling out the #dough 🤦\u200d♀️😐 https://

----

Some emojis may not be shown here correctly, but you would see that they are preserved without a problem when you re-write them to another file. Also note that, for analysis purposes, using Pandas package would be more practical. Next time, we will use Pandas to import a CSV file as a DataFrame object.

## Some notes <a class="anchor" id="some-notes"></a>

* There are many things you can do with the API. Check [Tweepy](http://docs.tweepy.org/) and [Twiter API](https://developer.twitter.com/en/docs/twitter-api/v1) documentations fore more information.
* Like many other platforms, Twitter's policies about what you can retrieve or especially what you can do with it have become even stricter. It is best to read [the developer policies and agreements](https://developer.twitter.com/en/developer-terms) to not get in trouble.
* While Tweepy has a long-standing history, you may not find a reliable wrapper (or a wrapper at all) for every API. If you are interested, once you get the hang of it, it is a good idea to use [Requests package](https://requests.readthedocs.io/en/master/) and learn how to use APIs without a wrapper. 
* APIs usually evolve. While they mostly have transition periods to ease the process, if you do not follow these changes, one day you may realize that your code does not work anymore.