# CPS600 - Python Programming for Finance 
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

## Scraping & Crawling, Cont'd

###  October 23, 2018

We'll look at some more tools for pulling data down off the web.

**REGEX**

Let's have a brief look at *regular expressions*. This is another tool that will serve you well in wrangling real-world data, particularly text data. This discussion of scraping is as good a place as any for a review of regexes.

In [55]:
tedURL = "https://www.washingtonpost.com/wp-srv/national/longterm/unabomber/manifesto.text.htm?noredirect=on"
tedMan = requests.get(tedURL)
tedParags =  [x.text_content().strip() for x in lh.fromstring(tedMan.text).xpath('//p')]
tedText = 'PARAG'.join(tedParags)

We didn't really need that last part, it was just for fun. Now we have a really long string. Let's look for patterns in it.

>A regular expression (or RE) specifies a set of strings that matches it

See the rest of the documentation [here](https://docs.python.org/3/library/re.html). Let's look at some examples. We inserted those "PARAG" strings. What follows them?

This example below uses a *lookahead* pattern - we don't extract the "PARAG", but the thing that follows it.

In [None]:
import re
parags = re.findall('(?<=PARAG).*',tedText)
parags

That's pretty close to what I wanted. Note that most of these are paragraphs with the author's original numbering.

In [None]:
parags[15]

Why does it just stop right there? From the documentation:

> `.` matches any character except a newline.

That explains it. So what if we wanted just the first word following a "PARAG"? We could do, for instance

In [None]:
import re
parags = re.findall('(?<=PARAG)\S*',tedText)
parags

Not bad. What about the next couple of words? (**exercise**). There is (almost) no limit to what you can express with REs.

**Scrapy**

Let's look at a tool for automating web scraping.
>Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Let's step through the tutorial.

1. Create a new terminal in Jupyter (or otherwise open a new terminal)
2. Enter `scrapy startproject tutorial` (or another name for your tutorial project)
3. Copy or type the below code into a text file that we'll call `quotes_spider.py`. Create that file in `tutorial/tutorial/spiders`

In [51]:
# Class definition for your first scrapy spider
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename) # old-fashioned string parsing

Remarks on these definitions:
* `name` identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

* `start_requests()` must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

* `parse()` a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The `parse()` method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

4. Go to the top-level directory (`tutorial`) and run `scrapy crawl quotes`.
5. Look around (`ls`) the directory. Examine the new files.

What you'll notice about this example is that we really didn't do any parsing. Let's fix that by updating our spider. Replace the parse method definition in `quotes_spider.py` with the new one below:

(*Remark* you can use the scrapy shell to play with the methods used below: `scrapy shell "some_url.com"`)

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('small.author::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract(),
        }


Finally, run the command `scrapy crawl quotes -o quotes.json` to extract data from these pages and store the parsed results in the file `quotes.json`.

Then take a look (e.g. `cat quotes.json`).

We are sort of scraping, but there is no *crawling* to speak of. What does that mean, crawling? Let's see how to follow links.

Replace the contents of `quotes_spider.py` with the code below. Note that this one also uses the `start_urls` shortcut (which you can read about on the tutorial page).

In [53]:
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Notice what has changed: we have extracted links and included logic to move to the next one in the list. More precisely...
>Now, after extracting the data, the `parse()` method looks for the link to the next page, builds a full absolute URL using the `urljoin()` method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What else can you do? Here is one example of a slightly more advanced spider that scrapes author information:

In [None]:
import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

From the tutorial:
>This spider will start from the main page, it will follow all the links to the authors pages calling the `parse_author` callback for each of them, and also the pagination links with the `parse` callback as we saw before.

**Twitter!**

Before anything, you will have to [sign up for developer access](https://developer.twitter.com/en/apply-for-access).

We'll use *Python Twitter Tools* from *sixohsix* to get some data from Twitter. In order to use this notebook, you must install the python package `twitter`

In order to make this work, you will need to store your keys and tokens in a dicionary called "auth_dict"

We are going to need the `twitter` package. You can download the package [here](https://github.com/sixohsix/twitter) and use `easy_install`. But you can also find it in [*PyPi*](https://pypi.org/), i.e. you can install it using `pip`.

To go the latter route, just make sure you conda environment is activated and then go like this:

` pip install twitter `

In [3]:
from twitter import *

In [4]:
import json
with open('auth_dict','r') as f:
    twtr_auth = json.load(f)
    
# To make it more readable, lets store
# the OAuth credentials in strings first.
CONSUMER_KEY = twtr_auth['consumer_key']
CONSUMER_SECRET = twtr_auth['consumer_secret']
OAUTH_TOKEN = twtr_auth['token']
OAUTH_TOKEN_SECRET = twtr_auth['token_secret']
    
# Then, we store the OAuth object in "auth"
auth = OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)
# Notice that there are four tokens - you need to create these in the
# Twitter Apps dashboard after you have created your own "app".

Here we create a `twitter` API wrapper object.

In [6]:
# We now create the twitter search object.
t = Twitter(auth=auth)

But what is it, exactly?

In [None]:
help(t)

Let's get some *statuses* from my timeline.

In [8]:
my_tmln = t.statuses.home_timeline()

[Here](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html) is the Twitter.com documentation on statuses.

Here I get the timeline of another user, Zed Shaw. This might not work for you if you are not a follower of his.

In [18]:
zeds_tmln = t.statuses.user_timeline(screen_name="zedshaw")

Therefore, here is another account.

In [19]:
ppoly_tmln = t.statuses.user_timeline(screen_name="primalpoly")

In [None]:
ppoly_tmln[0]

OK, now we'll break it down, starting from a more basic request.

In [10]:
docs_example = """GET https://api.twitter.com/1.1/search/tweets
                .json?q=%23freebandnames&since_id=2401261998405
                1000&max_id=250126199840518145&result_type=mixed&count=4"""

Note that we used triple quotes (i.e. 3 double quotes) for this multiline string. It is still just a string. What is going on in this string?

The first piece of information in the string *docs_example* that stands out is "freebandnames". That's a search term, but what is its effect? It is preceded by "%23" to indicate *hashtags*. See [this](https://brajeshwar.github.io/entities/) for more.

Rmk: [There is no such thing as plaintext.](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

Next, we see "since_id" - we are searching for tweets tweeted since a given tweet ID number.

The "max_id" gives the other bound - we want no tweets with IDs greater than this value.

We are asking for "mixed" results - both popular and recent

Count refers to results per page.

**But we don't want to build such a string every time we do a query, and that is one reason for the API wrapper provided by `twitter`**

In [21]:
frebandnames = t.search.tweets(q='#freebandnames', since_id=24012619984051000,
                result_type='mixed', count=4)

In [None]:
frebandnames.keys()

In [23]:
freeb = frebandnames['statuses']

In [None]:
freeb[0]['user']['screen_name']

**More Examples**

In [None]:
# Get your "home" timeline
t.statuses.home_timeline()

# Get a particular friend's timeline
t.statuses.user_timeline(screen_name="zedshaw")

# to pass in GET/POST parameters, such as `count`
t.statuses.home_timeline(count=5)

# to pass in the GET/POST parameter `id` you need to use `_id`
t.statuses.oembed(_id=1234567890)

# Update your status
t.statuses.update(
    status="Here is another tweet.")

# Send a direct message
#t.direct_messages.new(
#    user="primalpoly",
#    text="Geoffy-baby, big fan of your work.") # Try not to spam him, guys

# Get the members of tamtar's list "Things That Are Rad"
t.lists.members(owner_screen_name="buffer", slug="the-buffer-team")

# An *optional* `_timeout` parameter can also be used for API
# calls which take much more time than normal or twitter stops
# responding for some reason:
#t.users.lookup(
#    screen_name=','.join(A_LIST_OF_100_SCREEN_NAMES), _timeout=1)

# Rmk: A_LIST_OF_100_SCREEN_NAMES is not defined. Why not fix that?

# Overriding Method: GET/POST
# you should not need to use this method as this library properly
# detects whether GET or POST should be used, Nevertheless
# to force a particular method, use `_method`
t.statuses.oembed(_id=1234567890, _method='GET')

In [None]:
# Search for the latest tweets about #pycon
t.search.tweets(q="#pycon")

# Search for the latest tweets about #pycon, using extended mode
t.search.tweets(q="#pycon", tweet_mode='extended')

We can search trends by geographic region. The Yahoo! Where On Earth ID for the entire world is 1, for example. Look [here](https://developer.twitter.com/en/docs/trends/trends-for-location/api-reference/get-trends-place) for more.

**Streaming**

In [None]:
# Rmk: this cell will not run!
twitter_stream = TwitterStream(auth=OAuth(...))
iterator = twitter_stream.statuses.sample()

for tweet in iterator:
    #...do something with this tweet...

In [None]:
# Create a *streaming* connection (not RESTful, different from Search).
t_stream = TwitterStream(auth=auth)


# Get an *iterator* object from the twitter wrapper

tweeterator = t_stream.statuses.sample()


# The loop below simply prints randomly selected new tweets
# until we reach the threshold of "tweet_count"

tweet_count = 100
for tweet in tweeterator:
    tweet_count -= 1
    print(json.dumps(tweet))  
    if tweet_count <= 0:
        break 

In [26]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Let's get the trends (top 50) for the whole world, and then for the US.

In [None]:
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = t.trends.place(_id=WORLD_WOE_ID)
us_trends = t.trends.place(_id=US_WOE_ID)

print(world_trends)
print('\n')
print(us_trends)

That doesn't look so good, so now we "pretty print" it using the JSON module. The builtin *print* gives better output when its input is a JSON object (rather than, say, a string or list object).

In [None]:
list(us_trends)[0]['trends'][4]['name']

In [None]:
#print(json.dumps(world_trends, indent=1))
print('\n')
print(json.dumps(us_trends, indent=1))

Now that we have both results in memory, we can combine them. For example, we find the intersection of the top world trends and the top USA trends.

In [None]:
# Why bother with "set" here?
world_trends_set = set([trend['name'] for trend in world_trends[0]['trends']])

us_trends_set = set([trend['name'] for trend in us_trends[0]['trends']]) 

common_trends = world_trends_set.intersection(us_trends_set) # intersection is a set method

print(common_trends)

Next, let's collect search results in a loop. Note that we are already able to do this with the Search API and that the Streaming API will be needed for larger volume.

The *count* parameter represents the number of tweets we want *per page*.

This is important because the Twitter REST API uses *cursoring* to organize large search results in *pages*. The next example shows how to use the cursor. Read more about it [here](https://developer.twitter.com/en/docs/basics/cursoring)

In [None]:
# Set the next variable to any trending topic
# or anything else for that matter.
q = '"Super Bowl"' 

# Same as before - the number of tweets we want *per page*.
count = 1000

# We do the search call.
search_results = t.search.tweets(q=q, count=count)

# Remember 'status' refers to the actual content of a tweet.
# (as opposed to the metadata)
statuses = search_results['statuses']


# Iterate through 5 more batches of results by following the cursor.
# The use of the underscore "_" below is a convention in python 
# indicating to the reader that the variable will not be used for
# anything within the loop.
for _ in range(5):
    print("Length of statuses ", len(statuses))
    # Remember "try...except"? Here it is in action:
    try:
        next_results = search_results['search_metadata']['next_results']
    # No more results when next_results doesn't exist
    except KeyError as e: 
        break
        
    # Create a dictionary from next_results, which has the following form:
    # ?max_id=313519052523986943&q=NCAA&include_entities=1
    kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ]) 
    
    search_results = t.search.tweets(**kwargs) #Another API call
    statuses += search_results['statuses'] #Appending to statuses results

# Show one sample search result by slicing the list...
print(json.dumps(statuses[0], indent=1))

Note the "except" line in the cell above. Including "as e" after the "KeyError" lets you access the attributes of the error - we didn't do that, but you might find a need for it. Inside the indent, you could manipulate the object e and get its attributes, such as e.args.

Instead, we simply break out of the loop.

In the next example, we extract text, screen names and hashtags from tweets.

In [None]:
len(statuses)

In [None]:
status_texts = [ status['text'] for status in statuses ]

screen_names = [ user_mention['screen_name'] for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

# Compute a collection of all words from all tweets
words = [ w for t in status_texts for w in t.split() ]

# Explore the first 5 items for each...

print(json.dumps(status_texts[0:5], indent=1))
print(json.dumps(screen_names[0:5], indent=1))
print(json.dumps(hashtags[0:5], indent=1))
print(json.dumps(words[0:5], indent=1))

We can create a basic frequency distribution from words in tweets - first glimpse at NLP

In [None]:
from collections import Counter

# Recall that you can loop through any
# iterator in python - such as a list
# of lists!
for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10

In [None]:
# A function for computing lexical diversity
def lexical_diversity(tokens):
    return 1.0*len(set(tokens))/len(tokens) 

# A function for computing the average number of words per tweet
def average_words(statuses):
    total_words = sum([ len(s.split()) for s in statuses ]) 
    return total_words/len(statuses)

print(lexical_diversity(words))
print(lexical_diversity(screen_names))
print(lexical_diversity(hashtags))
print(average_words(status_texts))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# First, we get the frequencies in order.
word_counts = sorted(Counter(words).values(), reverse=True)

# We can plot it along log-scaled axes
plt.loglog(word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")
plt.show()

Here that is in `bokeh`:

In [None]:
from bokeh.plotting import figure, show, output_notebook
output_notebook()

p = figure(title="log axis example", y_axis_type="log", x_axis_type='log',
           y_range=(.9, 400))

p.line(range(len(word_counts)), word_counts, legend="word counts",
       line_color="blue")
show(p)

We can also do histograms.

In [None]:
for label, data in (('Words', words), 
                    ('Screen Names', screen_names), 
                    ('Hashtags', hashtags)):

    # Build a frequency map for each set of data
    # and plot the values
    c = Counter(data)
    plt.hist(c.values())
    
    # Add a title and y-label ...
    plt.title(label)
    plt.ylabel("Number of items in bin")
    plt.xlabel("Bins (number of times an item appeared)")
    
    # ... and display as a new figure
    plt.show()

**Exercise** Do the above in `bokeh`

And now for something completely different...finding the most popular retweets

In [None]:
retweets = [
            # We are building a list of tuples
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'],
             status['retweeted_status']) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

In [None]:
retweets

Looking up users who have retweeted a 'status'.

First, let's pick one of the retweets. We will find the id of the original tweet from the 'retweeted_status' node (the last element of our tuple).

In [None]:
retweets[0][-1]['id']

This is the id of the retweeted tweet. We will use this to search Twitter again.

In [None]:
# Note that we are doing another search here.
rtwtd_id = 1047848885073403904 # Fill this in
_retweets = t.statuses.retweets(id=rtwtd_id)
print([r['user']['screen_name'] for r in _retweets])

In [None]:
counts = [count for count, _, _, _ in retweets]

plt.hist(counts)
plt.title("Retweets")
plt.xlabel('Bins (number of times retweeted)')
plt.ylabel('Number of tweets in bin')

print(counts)
plt.show()

**Twitter Functions**

We want to reuse the code from the cells above. Also, we want to write clean and readable code when we implement longer and more complex procedures. Here is a function we have already used:

In [24]:
def oauth_login():
    # Replace credentials below with appropriate values - 
    # this works in the TwitterAPIcontd notebook because
    # we defined the credentials above
    CONSUMER_KEY = CONSUMER_KEY
    CONSUMER_SECRET = CONSUMER_SECRET
    OAUTH_TOKEN = OAUTH_TOKEN
    OAUTH_TOKEN_SECRET = OAUTH_TOKEN_SECRET
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
    CONSUMER_KEY, CONSUMER_SECRET)
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

Next, we can wrap the lines that give us trends in a function.

In [25]:
def twitter_trends(twitter_api, woe_id):
    # Prefix ID with the underscore for query string parameterization.
    # Without the underscore, the twitter package appends the ID value
    # to the URL itself as a special-case keyword argument.
    return twitter_api.trends.place(_id=woe_id)

Likewise, we define a function for the looped twitter search.

In [26]:
def twitter_search(twitter_api, q, max_results=200, **kw):
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    statuses = search_results['statuses']
    # Enforce a reasonable limit
    max_results = min(1000, max_results)
    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError as e: # No more results when next_results doesn't exist
            break
        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=') 
                       for kv in next_results[1:].split("&") ])
        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']
        if len(statuses) > max_results:
            break
    
    return statuses

You will likely encounter errors in mining Twitter data. Here is a function to automate the handling of certain errors. See *Mining the Social Web* for more details.

In [61]:
import sys
import time
from twitter.api import TwitterHTTPError
from urllib.error import URLError
from http.client import BadStatusLine

def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw):
    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
        if wait_period > 3600: # Seconds
            print('Too many retries. Quitting.', file=sys.stderr)
            raise e
        if e.e.code == 401:
            return None
        elif e.e.code == 404:
            print('Encountered 404 Error (Not Found)', file=sys.stderr)
            return None
        elif e.e.code == 429:
            print('Encountered 429 Error (Rate Limit Exceeded)', file=sys.stderr)
            if sleep_when_rate_limited:
                print("Retrying in 15 minutes...ZzZ...", file=sys.stderr)
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print('...ZzZ...Awake now and trying again.', file=sys.stderr)
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print('Encountered %i Error. Retrying in %i seconds' % (e.e.code, wait_period), file=sys.stderr)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e

    # End of nested helper function

    wait_period = 2
    error_count = 0
    while True:
        try:
            return twitter_api_func(*args, **kw)
        except TwitterHTTPError as e:
            error_count = 0
            wait_period = handle_twitter_http_error(e, wait_period)
            if wait_period is None:
                return
        except URLError as e:
            error_count += 1
            print("URLError encountered. Continuing.", file=sys.stderr)
            if error_count > max_errors:
                print("Too many consecutive errors...bailing out.", file=sys.stderr)
                raise
        except BadStatusLine as e:
            error_count += 1
            print >> sys.stderr, "BadStatusLine encountered. Continuing."
            if error_count > max_errors:
                print("Too many consecutive errors...bailing out.", file=sys.stderr)
                raise

In [62]:
response = make_twitter_request(t.users.lookup, screen_name="SocialWebMining")

We will want to write responses to disk, on the fly, so that we can collect many observations for later analysis. In *Mining the Social Web*, the Mongo DB database program is recommended. Here is another way (that also can be adapted so that it writes to a database).

First, let's wrap the extraction of tweet "[entities](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object)" in a function which takes a list of statuses as input.

In [64]:
def extract_tweet_entities(statuses):
    if len(statuses) == 0:
        return [], [], [], [], []
    screen_names = [ user_mention['screen_name'] 
                    for status in statuses
                        for user_mention in status['entities']['user_mentions'] ]
    hashtags = [ hashtag['text']
                    for status in statuses
                        for hashtag in status['entities']['hashtags'] ]
    urls = [ url['expanded_url']
                    for status in statuses
                        for url in status['entities']['urls'] ]
    symbols = [ symbol['text']
                    for status in statuses
                        for symbol in status['entities']['symbols'] ]
    if status['entities'].has_key('media'):
        media = [ media['url']
            for status in statuses
                for media in status['entities']['media'] ]
    else:
        media = []
    return screen_names, hashtags, urls, media, symbols

The above function is a good template of sorts. You can use it to develop your own function for extracting other information from a *list of statuses*. 

Let's define two more functions, then we'll sample from the Twitter stream, writing results as we go. The functions below will take statuses, extract some information, and then write the results into a CSV file using Pandas.


In [65]:
def extract_tweet_basics(status):
    screen_name = None
    tweet_ID = None
    text = None
    if 'user' in status:
        screen_name = status['user']['screen_name'] 
        tweet_ID = status['id']
        text = status['text']
    return screen_name, tweet_ID, text

In [66]:
def tweet_to_csv(file, status):
    screen_name, tweet_ID, text = extract_tweet_basics(status)
    df = pd.DataFrame([[screen_name,tweet_ID,text]], columns=['screen_name','tweet_ID','text'])
    with open(file, 'a') as f:
        df.to_csv(f,header=False, index=False)

Finally, we stick this in the streaming loop from last time.

In [71]:
# Create a *streaming* connection (not RESTful, different from Search).
t_stream = TwitterStream(auth=auth)


# Get an *iterator* object from the twitter wrapper

tweeterator = t_stream.statuses.sample()

# Create a CSV file with column names
# but no data (yet).
import pandas as pd
df = pd.DataFrame(columns=['screen_name','tweet_ID','text'])
df.to_csv('my_csv.csv', index=False)


# The loop below will grab a new tweet,
# extract some basic info, put that info
# in a dataframe object, then use that
# dataframe object to append one row to
# the existing CSV file, 'my_csv.csv'.

tweet_count = 100
for tweet in tweeterator:
    tweet_count -= 1
    tweet_to_csv('my_csv.csv', tweet)  
    if tweet_count <= 0:
        break 