# Data Acquisition

Modern Data sources typically deliver their data over an API. In this exercise, we will first acquire some tweets and work with them in various ways.

We'll be working with [`tweepy`](https://github.com/tweepy/tweepy), an API for acquiring Data from Twitter.

The best place to start is the [documentation](http://docs.tweepy.org/en/latest/). Take a look at the Getting Started page.

To begin, install `tweepy` using pip. You need only do it once (but doing it more than once won't hurt anything). To install it, run the next cell.

In [None]:
!pip install tweepy

|  <span style="font-size:16px;font-weight:normal;">The next step is to obtain Twitter credentials for being able to acccess the API.<br /><br />Visit the [Twitter Developer Site](https://developer.twitter.com/) to create a twitter "application." <br /><br /> Go to the keys and tokens tab (shown alongside) and copy-paste the Consumer API keys as well as the Access token & access token secret into the cell below. <br /><br />Careful! `consumer_secret` and `access_token_secret` should be protected like passwords because they can be used by the API to <em>send out tweets or direct messages</em> on your behalf (this is what Twitter bots do!). Invalidate the token info after submitting your notebook (by regenerating a new set of token values on the keys and tokens page).</span> | ![alt text](TwitterDev.png "Title") |
|:--- | --- |


In [None]:
consumer_key = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
consumer_secret = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
access_token = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'
access_token_secret = '▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣'

In [None]:
import tweepy

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

API = tweepy.API(auth)

In [None]:
public_tweets = API.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

## Did you get some tweets?

Good! The API documentation is [here](http://docs.tweepy.org/en/latest/api.html).
We will use the API for finding information about your favorite Twitter user: things like their latest tweets, their followers, ...

In [None]:
# First enter the name of a prominent user, say @nytimes.
user = '@nytimes'

In [None]:
# Tweets from your prominent user
user_tweets = API.user_timeline(user)
for tweet in user_tweets:
    print(tweet.text)

In [None]:
# Each user_tweet is actually an object! Use dir(...) to find its attributes
# dir(tweet)

In [None]:
# Some followers of your favorite user
user_friends = API.friends(user)
for friend in user_friends:
    print(friend.screen_name)

# TO-DO

1. Have the students exercise some functions of the API and the data it returns.
1. Create a DataFrame with attributes of a user: (id, name, recent_tweets). Note that `id` and `name` will be a string whereas `recent_tweets` will be a list of strings.
1. Iterate through the top 20 `friends` and add _their_ data into the above DataFrame. Now you will have 21 rows in the DataFrame.
1. Iterate through the top 5 `friends` of each `friend` but don't allow duplicates in the DataFrame.

# More TO-DO

1. Add a new column `sentiment` to the above DataFrame.
1. Download [AFINN Data](https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-en-165.txt) and convert it to a Python Dictionary `sentiment_dict`.
1. Accumulate the sentiment score for each word in the tweet text; if a word is not in `sentiment_dict` its weight should be considered zero.
1. Who (in your DataFrame) has the highest sentiment score and who has the least?

# Even more TO-DO

1. Add new columns `recent_hashtags` and `recent_mentions` to the DataFrame. Some of these attributes are not directly available in the API. You will have to examine the text for words that start with '#' and '@' respectively. The `recent_hashtags` and `recent_mentions` attributes for each user should be a dictionary of words and counts. For example, with `user = '@nytimes'` last weekend `#syria` might have been a popular recent hashtag.
1. What was the most popular hashtag?
1. Who was the most frequent mention in your DataFrame?