# Twitter HOWTO

## Overview

This document is an overview of how to use NLTK to collect and process Twitter data. It was written as an IPython notebook, and if you have IPython installed, you can download [the source of the notebook](https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/twitter.ipynb) from the NLTK GitHub repository and run the notebook in interactive mode.

Most of the tasks that you might want to carry out with 'live' Twitter data require you to authenticate your request by registering for API keys. This is usually a once-only step. When you have registered your API keys, you can store them in a file on your computer, and then use them whenever you want. We explain what's involved in the section [First Steps](#first_steps).

If you have lready obtained Twitter API keys as part of some earlier project, [storing your keys](#store_keys) explains how to save them to a file that NLTK will be able to find. Alternatively, if you just want to play around with the Twitter data that is distributed as part of NLTK, head over to the section on using the [`twitter-samples` corpus reader](#corpus_reader).

Once you have got authentication sorted out, we'll show you [how to use NLTK's `Twitter` class](#simple). This is made as simple as possible, but deliberately limits what you can do.  

## <a name="first_steps">First Steps</a>

As mentioned above, in order to collect data from Twitter, you first need to register a new *application* &mdash; this is Twitter's way of referring to any computer program that interacts with the Twitter API. As long as you save your registration information correctly, you should only need to do this once, since the information should work for any NLTK code that you write. You will need to have a Twitter account before you can register. Twitter also insists that [you add a mobile phone number to your Twitter profile](https://support.twitter.com/articles/110250-adding-your-mobile-number-to-your-account-via-web) before you will be allowed to register an application.

These are the steps you need to carry out.

### <a name="api_keys">Getting your API keys from Twitter</a>

1. Sign in to your Twitter account at https://apps.twitter.com. You should then get sent to a screen that looks something like this:
<img src="images/twitter_app1.tiff" width="600px">
Clicking on the **Create New App** button should take you to the following screen:
<img src="images/twitter_app2.tiff" width="600px">
The information that you provide for **Name**, **Description** and **Website** can be anything you like.

2. Make sure that you select **Read and Write** access for your application (as specified on the *Permissions* tab of Twitter's Application Management screen):
<img src="images/twitter_app3.tiff" width="600px">

3. Go to the tab labeled **Keys and Access Tokens**. It should look something like this, but with actual keys rather than a string of Xs:
<img src="images/twitter_app4.png" width="650px">
As you can see, this will give you four distinct keys: consumer key, consumer key secret, access token and access token secret.

### <a name="store_keys">Storing your keys</a>

1. Create a folder named `twitter-files` in your home directory. Within this folder, use a text editor to create a new file called `credentials.txt`. Make sure that this file is just a plain text file. In it, you should create which you should store in a text file with the following structure:
```
app_key=YOUR CONSUMER KEY  
app_secret=YOUR CONSUMER SECRET  
oauth_token=YOUR ACCESS TOKEN  
oauth_token_secret=YOUR ACCESS TOKEN SECRET
```
Type the part up to and includinge the '=' symbol exactly as shown. The values on the right-hand side of the '=' &mdash; that is, everything in caps &mdash; should be cut-and-pasted from the relevant API key information shown on the Twitter **Keys and Access Tokens**. Save the file and that's it.

2. It's going to be important for NLTK programs to know where you have stored your
   credentials. We'll assume that this folder is called `twitter-files`, but you can call it      anything you like. We will also assume that this folder is where you save any files            containing tweets that you collect. Once you have decided on the name and location of this 
   folder, you will need to set the `TWITTER` environment variable to this value. 

   On a Unix-like system (including MacOS), you will set the variable something like this:
   ```bash
   export TWITTER="/path/to/your/twitter-files"
   ```
   Rather than having to give this command each time you start a new session, it's advisable      to add it to your shell's configuration file, e.g. to `.bashrc`.

   On a Windows machine, right click on “My Computer” then select `Properties > Advanced >        Environment Variables > User Variables > New...` 

   One important thing to remember is that you need to keep your `credentials.txt` file          private. So do **not** share your `twitter-files` folder with anyone else, and do **not**      upload it to a public repository such as GitHub.

3. Finally, read through Twitter's [Developer Rules of the Road](https://dev.twitter.com/overview/terms/policy). As far as these rules are concerned, you count as both the application developer and the user.

### <a name="twython">Install Twython</a>

The NLTK Twitter package relies on a third party library called [Twython](https://twython.readthedocs.org/). Install Twython via [pip](https://pip.pypa.io):
```bash
$ pip install twython
```

or with [easy_install](https://pythonhosted.org/setuptools/easy_install.html):

```bash
$ easy_install twython
```
We're now ready to get started. The next section will describe how to use the `Twitter` class to talk to the Twitter API.

*More detail*:
Twitter offers are two main authentication options. OAuth 1 is for user-authenticated API calls, and allows sending status updates, direct messages, etc, whereas OAuth 2 is for application-authenticated calls, where read-only access is sufficient. Although OAuth 2 sounds more appropriate for the kind of tasks envisaged within NLTK, it turns out that access to Twitter's Streaming API requires OAuth 1, which is why it's necessary to obtain *Read and Write* access for your application.

## <a name="simple">Using the simple `Twitter` class</a>

### Dipping into the Public Stream

The `Twitter` class is intended as a simple means of interacting with the Twitter data stream. Later on, we'll look at other methods which give more fine-grained control. 

The Twitter live public stream is a sample (approximately 1%) of all Tweets that are currently being published by users. They can be on any topic and in any language. In your request, you can give keywords which will narrow down the Tweets that get delivered to you. Our first example looks for Tweets which include either the word *love* or *hate*. We limit the call to finding 10 tweets. When you run this code, it will definitely produce different results from those shown below!

In [1]:
from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='love, hate', limit=10) #sample from the public stream

RT @jerkfuI: everything i love will kill me
Very bored in school this only the first week and I'm already bored woooow I really hate school that much ;)
Now go listen to the weeknds album then come back and tell me you love me  https://t.co/qQMl52aZxC
RT @Anmol_arts: Laugh as much as you breath and love as long as you live

#CNWKIsTheBest
RT @inkedrosesx_: thank u for accompanying me the whole day bc i needed a shoping therapy 😝💞 love ü 😘 @AldricLCM
Written 10 Tweets


The next example filters the live public stream by looking for specific user accounts. In this case, we 'follow' two news organisations, namely `@CNN` and `@BBCNews`. [As advised by Twitter](https://dev.twitter.com/streaming/reference/post/statuses/filter), we use *numeric userIDs* for these accounts. If you run this code yourself, you'll see that Tweets are arriving much more slowly than in the previous example. This is because even big new organisations don't publish Tweets that often.

A bit later we will show you how to use Python to convert usernames such as `@CNN` to userIDs such as `759251`, but for now you might find it simpler to use a web service like [TweeterID](http://tweeterid.com) if you want to experiment with following different accounts than the ones shown below.

In [2]:
tw = Twitter()
tw.tweets(follow=['759251', '612473'], limit=10) # see what CNN and BBC are talking about

@CNN it's a start
@BBCNews ...if the M6 is the backbone of Britain what does that make the M25?....The Arsehole?
@BBCNews ok, just lost my vote within an afternoon. We have homeless, services are being cut, yet he wants open door for refugees!!!
@CNN starting to cover Bernie fairly, unlike @MSNBC which tries to denigrate him.
@BBCNews @itvnews @Channel4News Look at how he got with a few journalists and they will push him all the more.I, not for one minute believe
Written 10 Tweets


### Saving Tweets to a File

By default, the `Twitter` class will just print out Tweets to your computer terminal. Although it's fun to view the Twitter stream zipping by on your screen, you'll probably want to save some tweets in a file. We can tell the `tweets()` method to save to a file by setting the flag `to_screen` to `False`. 

The `Twitter` class will look at the value of your environmental variable `TWITTER` to determine which folder to use to save the tweets, and it will put them in a date-stamped file with the prefix `tweets`. 

In [3]:
tw = Twitter()
tw.tweets(to_screen=False, limit=25)

Writing to /Users/ewan/twitter-files/tweets.20150912-181422.json
Written 25 Tweets


So far, we've been taking data from the live public stream. However, it's also possible to retrieve past tweets, for example by searching for specific keywords, and setting `stream-False`:

In [6]:
tw.tweets(keywords='hilary clinton', stream=False, limit=10)

RT @pattonoswalt: Dear Hilary Clinton's aides: Announcing she plans to show "more heart and humor" is what you say about a Terminator you'v…
RT @mzagorski: ARREST THE FOLLOWING APES: Hilary Rodent Clinton, Nancy Pelosi, Eric Holder, Bill Clinton, Harry Reid, Susan Rice, Barack Hu…
They got these niggas hitting it like Hilary clinton, fam. How could they fuck up something like that
Hilary WINS if #StopESEA PASSES #StopHillary2016 #StopHR5 #WakeUpAmerica #Corrupt ShannonJoyRadio EnragedNY … … … … http://t.co/5IV46WvPMc
Hilary Clinton is such a fake. I'd almost rather have Bush, or even Michelle Bachmann as the POTUS http://t.co/QJoUktCBbb
Written 10 Tweets


## Onwards and Upwards

In this section, we'll look at how to get more fine-grained control over processing Tweets. To start off, we will import a bunch of stuff from the `twitter` package.

In [None]:
from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile

In the following example, you'll see the line
``` python
oauth = credsfromfile()
```
This gets hold of your stored API key information. The function `credsfromfile()` by default looks for a file called `credentials.txt` in the directory set by the environment variable `TWITTER`, reads the contents and returns the result as a dictionary. We then pass this dictionary as an argument when initializing our client code. We'll be using two classes to wrap the clients: `Streamer` and `Query`; the first of these calls [the Streaming API](https://dev.twitter.com/streaming/overview) and the second calls Twitter's [Search API](https://dev.twitter.com/rest/public) (also called the REST API). 

*More detail*: For more detail, see this blog post on [The difference between the Twitter Firehose API, the Twitter Search API, and the Twitter Streaming API](http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/)


After initializing a client, we call the `register()` method to specify whether we want to view the data on a terminal or write it to a file. Finally, we call a method which determines the API endpoint to address; in this case, we use `sample()` to get a random sample from the the Streaming API.

In [8]:
oauth = credsfromfile()
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.sample()

I filtri fotografici diventano plug-in dell'app Foto di #Windows10Mobile! http://t.co/geJhxGeU5M http://t.co/BwxvgVcfuB
RT @RepublicanSwine: Rick Perry Throws In The Towel https://t.co/uhmBDcxMST via @sharethis
RT @RealDubs: Commodo Gantz Kahn - Volume 1 - 28.08.15 by DEEP MEDi MUSIK http://t.co/8DY75SOXIY #dubstep #reddit
命を狙ってるからですかね？
RT @SoDamnTrue: when september hits

me: *goes out in jeans, boots, and a sweater when it’s still 90 degrees* it’s fall
Written 10 Tweets


The next example is similar, except that we call the `filter()` method with the `track` parameter followed by a string literal. The string is interpreted as a list of search terms where [comma indicates a logical OR](https://dev.twitter.com/streaming/overview/request-parameters#track). The terms are treated as case-insensitive.

In [15]:
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.filter(track='refugee, germany')

@SamirTalwar sounds like rhetoric used by far right minority nazi party in Germany- which everyone with a brain thinks of as despicable.
Germany's Merkel sees need to cooperate with Russia on Syria http://t.co/bFeldHWimw #News
RT @MsIntervention: Attacks on #Kurds in #Germany and Switzerland leave many injured. We can't even keep them safe in our own backyard! htt…
RT @oslerpryintitre: Germany says 'significant progress' made at Ukraine meeting
RT @TR_Foundation: PICTURE: A Syrian refugee from Aleppo and his one month old daughter, after arriving in Lesbos on a dinghy yesterday htt…
Written 10 Tweets


Whereas the Streaming API lets us access near real-time Twitter data, the Search API lets us query for past Tweets. In the following example, `search_tweets()` returns a generator over individual Tweets. Although Twitter delivers Tweets as [JSON](http://www.json.org) objects, the Python client encodes them as dictionaries.

In [40]:
client = Query(**oauth)
tweets = client.search_tweets(keywords='nltk', limit=10)
tweet = next(tweets)
from pprint import pprint
pprint(tweet, depth=1)

{'contributors': None,
 'coordinates': None,
 'created_at': 'Sun Sep 13 10:47:01 +0000 2015',
 'entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 643013003386953732,
 'id_str': '643013003386953732',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'metadata': {...},
 'place': None,
 'possibly_sensitive': False,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="http://ifttt.com" rel="nofollow">IFTTT</a>',
 'text': 'NLTK Convert Tree to Array? http://t.co/0aVOrrrwEi',
 'truncated': False,
 'user': {...}}


Twitter's own documentation [provides a useful overview of all the fields in the JSON object](https://dev.twitter.com/overview/api/tweets) and it may be helpful to look at this [visual map of a Tweet object](http://www.scribd.com/doc/30146338/map-of-a-tweet).

Since each Tweet is converted into a Python dictionary, it's straightforward to just show a selected field, such as the value of the `'text'` key.

In [41]:
for tweet in tweets:
    print(tweet['text'])

NLTK Convert Tree to Array? http://t.co/6tUKhyHSCE http://t.co/HySUTalb9w #Python via @dv_geek
NLTK Bowman? Call us
My @Quora answer to Is NLTK suitable for big data? http://t.co/l0woNs7K78
Learning about Natural Language Processing with #Python and #NLTK.
Is Their Dependable Coffee Machine Toward the Net? NLtK
RT @clauersen: Hey #DST4L - the book on natural language processing in #python that @libcce has been pointing out. Enjoy :) http://t.co/y0m…
We compared #nltk vs #opennlp - see results: http://t.co/mQV3Fhgwyo
english.pickle resource in NLTK not found (Tried every possible solution online that I can find) http://t.co/rcKYaYFREY
Twitter sentiment analysis using Python and NLTK
http://t.co/PRHIy7vRB3


In [42]:
client = Query(**oauth)
client.register(TweetWriter())
client.user_tweets('timoreilly', 10)

Writing to /Users/ewan/twitter-files/tweets.20150913-120951.json


Given a list of user IDs, the following example shows how to retrieve the screen name and other information about the users.

In [48]:
userids = ['759251', '612473', '15108702', '6017542', '2673523800']
client = Query(**oauth)
user_info = client.user_info_from_id(userids)
for info in user_info:
    name = info['screen_name']
    followers = info['followers_count']
    following = info['friends_count']
    print("{}, followers: {}, following: {}".format(name, followers, following))

CNN, followers: 19533857, following: 1103
BBCNews, followers: 4855021, following: 106
ReutersLive, followers: 301498, following: 51
BreakingNews, followers: 7872261, following: 541
AJELive, followers: 1100, following: 19


A list of user IDs can also be used as input to the Streaming API client.

In [49]:
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.statuses.filter(follow=userids)

RT @BBCSport: Mo Farah wins the #GreatNorthRun for the second successive year with a time of 59:23: http://t.co/uUg0ed4hug http://t.co/DedJ…
@BBCNews good luck lol
@BBCNews Try giving him a kiss..
RT @BBCSport: WICKET - Big wicket  - Marsh gets his 4th, Stokes trapped LBW for 42.

Eng 85-7, 20 overs... http://t.co/ATgI8bho6o http://t.…
RT @5liveSport: Mo Farah WINS #GreatNorthRun for the second successive year!

Reaction http://t.co/tpgSHCDWrr http://t.co/6GEopIaqWN
Written 10 Tweets


To store data that Twitter sents by the Streaming API, we register a `TweetWriter` instance.

In [50]:
client = Streamer(**oauth)
client.register(TweetWriter(limit=10))
client.statuses.sample()

Writing to /Users/ewan/twitter-files/tweets.20150913-122650.json
Written 10 Tweets


Here's the full signature of the `Tweetwriter`'s `__init__()` method:
```python
def __init__(self, limit=2000, date_limit=None, stream=True,
                 fprefix='tweets', subdir='twitter-files', repeat=False,
                 gzip_compress=False)
```
If the `repeat` parameter is set to `True`, then the writer will write up to the value of `limit` in file `file1`, then open a new file `file2` and write to it until the limit is reached, and so on indefinitely. The parameter `gzip_compress` can be used to compress the files once they have been written.

## <a name="corpus_reader">Using a Tweet Corpus</a>

NLTK's Twitter corpus currently contains a sample of 20k Tweets (`twitter_samples`)
retrieved from the Twitter Streaming API, together with another 10k which are divided according to sentiment into negative and positive.

In [52]:
from nltk.corpus import twitter_samples
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

We follow standard practice in storing full Tweets as line-separated
JSON. These data structures can be accessed via `tweets.docs()`. However, in general it
is more practical to focus just on the text field of the Tweets, which
are accessed via the `strings()` method.

In [71]:
strings = twitter_samples.strings('tweets.20150430-223406.json')
for string in strings[:15]:
    print(string)

RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY
RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…
RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1
RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…
RT @Nigel_Farage: Make sure you tune in to #AskNigelFarage tonight on BBC 1 at 22:50! #UKIP http://t.co/ogHSc2Rsr2
RT @joannetallis: Ed Milliband is an embarrassment. Would you want him representing the UK?!  #bbcqt vote @Conservatives
RT @abstex: The FT is backing the Tories. On an unrelated note, here's a photo of FT leader writer Jonathan Ford (next to Boris) http://t.c…
RT @NivenJ1: “@George_Osborne: Ed Mi

The default tokenizer for Tweets (`casual.py`) is specialised for 'casual' text, and
the `tokenized()` method returns a list of lists of tokens.

In [70]:
tokenized = twitter_samples.tokenized('tweets.20150430-223406.json')
for toks in tokenized[:5]:
    print(toks)

['RT', '@KirkKus', ':', 'Indirect', 'cost', 'of', 'the', 'UK', 'being', 'in', 'the', 'EU', 'is', 'estimated', 'to', 'be', 'costing', 'Britain', '£', '170', 'billion', 'per', 'year', '!', '#BetterOffOut', '#UKIP']
['VIDEO', ':', 'Sturgeon', 'on', 'post-election', 'deals', 'http://t.co/BTJwrpbmOY']
['RT', '@LabourEoin', ':', 'The', 'economy', 'was', 'growing', '3', 'times', 'faster', 'on', 'the', 'day', 'David', 'Cameron', 'became', 'Prime', 'Minister', 'than', 'it', 'is', 'today', '..', '#BBCqt', 'http://t.co…']
['RT', '@GregLauder', ':', 'the', 'UKIP', 'east', 'lothian', 'candidate', 'looks', 'about', '16', 'and', 'still', 'has', 'an', 'msn', 'addy', 'http://t.co/7eIU0c5Fm1']
['RT', '@thesundaypeople', ':', "UKIP's", 'housing', 'spokesman', 'rakes', 'in', '£', '800k', 'in', 'housing', 'benefit', 'from', 'migrants', '.', 'http://t.co/GVwb9Rcb4w', 'http://t.co/c1AZxcLh…']


### Extracting Parts of a Tweet

If we want to carry out other kinds of analysis on Tweets, we have to work directly with the file rather than via the corpus reader. For demonstration purposes, we will use the same file as the one in the preceding section, namely  `tweets.20150430-223406.json`. The `abspath()` method of the corpus gives us the full pathname of the relevant file. If your NLTK data is installed in the default location on a Unix-like system, this pathname will be `'/usr/share/nltk_data/corpora/twitter_samples/tweets.20150430-223406.json'`.

In [59]:
from nltk.corpus import twitter_samples
input_file = twitter_samples.abspath("tweets.20150430-223406.json")

The function `json2csv()` takes as input a file-like object consisting of Tweets as line-delimited JSON objects and returns a file in CSV format. The third parameter of the function lists the fields that we want to extract from the JSON. One of the simplest examples is to extract just the text of the Tweets (though of course it would have been even simpler to use the `strings()` method of the corpus reader).

In [60]:
from nltk.twitter.util import json2csv
with open(input_file) as fp:
    json2csv(fp, 'tweets_text.csv', ['text'])

We've passed the filename `'tweets_text.csv'` as the second argument of `json2csv()`. Unless you provide a complete pathname, the file will be created in the directory where you are currently executing Python.

If you open the file `'tweets_text.csv'`, the first 5 lines should look as follows:

```
RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY
RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…
RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1
RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…
```

However, in some applications you may want to work with Tweet metadata, e.g., the creation date and the user. As mentioned earlier, all the fields of a Tweet object are described in [the official Twitter API](https://dev.twitter.com/overview/api/tweets). 

The third argument of `json2csv()` can specified so that the function selects relevant parts of the metadata. For example, the following will generate a CSV file including most of the metadata together with the id of the user who has published it.

In [63]:
with open(input_file) as fp:
    json2csv(fp, 'tweets.20150430-223406.tweet.csv',
            ['created_at', 'favorite_count', 'id', 'in_reply_to_status_id', 
            'in_reply_to_user_id', 'retweet_count', 'retweeted', 
            'text', 'truncated', 'user.id'])

In [64]:
for line in open('tweets.20150430-223406.tweet.csv').readlines()[:5]:
    print(line)

created_at,favorite_count,id,in_reply_to_status_id,in_reply_to_user_id,retweet_count,retweeted,text,truncated,user.id

Thu Apr 30 21:34:06 +0000 2015,0,593891099434983425,,,0,False,RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP,False,107794703

Thu Apr 30 21:34:06 +0000 2015,0,593891099548094465,,,0,False,VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY,False,557422508

Thu Apr 30 21:34:06 +0000 2015,0,593891099388846080,,,0,False,RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…,False,3006692193

Thu Apr 30 21:34:06 +0000 2015,0,593891100429045760,,,0,False,RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1,False,455154030



The first nine elements of the list are attributes of the Tweet, while the last one, `user.id`, takes the user object associated with the Tweet, and retrieves the attributes in the list (in this case only the id). The object for the Twitter user is described in the  [Twitter API for users](https://dev.twitter.com/overview/api/users).

The rest of the metadata of the Tweet are the so-called [entities](https://dev.twitter.com/overview/api/entities) and [places](https://dev.twitter.com/overview/api/places). The following examples show how to get each of those entities. They all include the id of the Tweet as the first argument, and some of them include also the text for clarity.

In [None]:
from nltk.twitter.util import json2csv_entities
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.hashtags.csv',
                        ['id', 'text'], 'hashtags', ['text'])
    
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.user_mentions.csv',
                        ['id', 'text'], 'user_mentions', ['id', 'screen_name'])
    
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.media.csv',
                        ['id'], 'media', ['media_url', 'url'])
    
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.urls.csv',
                        ['id'], 'urls', ['url', 'expanded_url'])
    
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.place.csv',
                        ['id', 'text'], 'place', ['name', 'country'])

with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.place_bounding_box.csv',
                        ['id', 'name'], 'place.bounding_box', ['coordinates'])

Additionally, when a Tweet is actually a retweet, the original tweet can be also fetched from the same file, as follows:

In [None]:
with open(input_file) as fp:
    json2csv_entities(fp, 'tweets.20150430-223406.original_tweets.csv',
                        ['id'], 'retweeted_status', ['created_at', 'favorite_count', 
                        'id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweet_count',
                        'text', 'truncated', 'user.id'])

Here the first id corresponds to the retweeted Tweet, and the second id to the original Tweet.

### Using Dataframes

Sometimes it's convenient to manipulate CSV files as tabular data, and this is made easy with the [Pandas](http://pandas.pydata.org/) data analysis library. `pandas` is not currrently one of the dependencies of NLTK, and you will probably have to install it specially.

Here is an example of how to read a CSV file into a `pandas` dataframe. We use the `head()` method of a dataframe to just show the first 5 rows.

In [68]:
import pandas as pd
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
tweets.head(5)

Unnamed: 0_level_0,created_at,favorite_count,in_reply_to_status_id,in_reply_to_user_id,retweet_count,retweeted,text,truncated,user.id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
593891099434983425,Thu Apr 30 21:34:06 +0000 2015,0,,,0,False,RT @KirkKus: Indirect cost of the UK being in ...,False,107794703
593891099548094465,Thu Apr 30 21:34:06 +0000 2015,0,,,0,False,VIDEO: Sturgeon on post-election deals http://...,False,557422508
593891099388846080,Thu Apr 30 21:34:06 +0000 2015,0,,,0,False,RT @LabourEoin: The economy was growing 3 time...,False,3006692193
593891100429045760,Thu Apr 30 21:34:06 +0000 2015,0,,,0,False,RT @GregLauder: the UKIP east lothian candidat...,False,455154030
593891100768784384,Thu Apr 30 21:34:07 +0000 2015,0,,,0,False,RT @thesundaypeople: UKIP's housing spokesman ...,False,187547338


Using the dataframe it is easy, for example, to first select Tweets with a specific user ID and then retrieve their `'text'` value.

In [69]:
tweets.loc[tweets['user.id'] == 557422508]['text']

id
593891099548094465    VIDEO: Sturgeon on post-election deals http://...
593891101766918144    SNP leader faces audience questions http://t.c...
Name: text, dtype: object

## Expanding a list of Tweet IDs

Because the Twitter Terms of Service place severe restrictions on the distribution of Tweets by third parties, a workaround is to instead distribute just the Tweet IDs, which are not subject to the same restrictions. The method `expand_tweetids()` sends a request to the Twitter API to return the full Tweet (in Twitter's terminology, a *hydrated* Tweet) that corresponds to a given Tweet ID. 

Since Tweets can be deleted by users, it's possible that certain IDs will only retrieve a null value. For this reason, it's safest to use a `try`/`except` block when retrieving values from the fetched Tweet. 

In [72]:
from nltk.compat import StringIO
ids_f =\
    StringIO("""\
    588665495492124672
    588665495487909888
    588665495508766721
    588665495513006080
    588665495517200384
    588665495487811584
    588665495525588992
    588665495487844352
    88665495492014081
    588665495512948737""")
    
oauth = credsfromfile()
client = Query(**oauth)
hydrated = client.expand_tweetids(ids_f)

for tweet in hydrated:
    try:
        id_str = tweet['id_str']
        print('id: {}\ntext: {}\n'.format(id_str, tweet['text']))
    except IndexError:
        pass 

Counted 10 Tweet IDs in <_io.StringIO object at 0x1120aadc8>.
id: 588665495508766721
text: RT @30SecFlghts: Yep it was bad from the jump https://t.co/6vsFIulyRB

id: 588665495487811584
text: @8_s2_5 おかえりなさいまし

id: 588665495492124672
text: O link http://t.co/u8yh4xdIAF por @YouTube é o tweet mais popular hoje na minha feed.

id: 588665495487844352
text: RT @dam_anison: 【アニサマ2014 LIVEカラオケ⑤】
μ'sのライブ映像がDAMに初登場！それは「それは僕たちの奇跡」！
μ's結成から5年間の"キセキ"を噛み締めながら歌いたい！
→http://t.co/ZCAB7jgE4L #anisama http:…

id: 588665495513006080
text: @null 188049573

id: 588665495525588992
text: 坂道の時に限って裏の車がめっちゃ車間距離近づけて停めてくるから死ぬかと思った

id: 588665495512948737
text: Christina Grimmie #RisingStar
17

id: 588665495487909888
text: Dolgun Dudaklı Kadınların Çok İyi Bildiği 14 Şey http://t.co/vvEzTlqWOv http://t.co/dsWke4uXQ3



Although we provided the list of IDs as a string in the above example, the standard use case is to pass a file-like object as the argument to `expand_tweetids()`. 