# What Can You Learn from Cross-Site Aggregation? -- Teacher Version

#### This exercise is part of the *Teaching Privacy* curriculum, which you can find at https://teachingprivacy.org.

Full background for the exercise can be found in <a href="https://teachingprivacy.org/module-3-information-is-valuable/" target="_blank">Teaching Privacy Module 3: Information Is Valuable</a>.

In this exercise, you will aggregate data about a character named Kai Peroc from Tumblr and Twitter:<br>
https://twitter.com/kaiperoc<br>
https://kaiperoc.tumblr.com


## Part 1: Install the API Libraries

### Tumblr API

To access the Tumblr API, you will use the PyTumblr library. To install it, run the following in your terminal:

**sudo pip install oauth oauth2 pytumblr**

If this doesn't work, check the <a href="https://github.com/tumblr/pytumblr" target="_blank">PyTumblr</a> readme file for the most up-to-date installation instructions. 

*For more on getting and using **pip**, check out <a href="https://pip.pypa.io/en/stable/" target="_blank">the documentation</a> or <a href="https://www.w3schools.com/python/python_pip.asp" target="_blank">the w3schools tutorial</a>.*



### Twitter API

To access the Twitter API, you will use the Tweepy library. To install it, try running the following in your terminal:

**pip install tweepy**

If this doesn't work, check the readme file at https://github.com/tweepy/tweepy for the most up-to-date installation instructions. 



### Import

Now run the cell below to import the appropriate packages.

In [2]:
import pytumblr
import tweepy
from tweepy import TweepError
import json
import re

## Part 2: Create Tumblr App and Get Access Tokens

### Create a Tumblr App

1. Go to https://www.tumblr.com/oauth/apps. You will need to create a Tumblr account if you don't have one already or don't want to use your existing account for this purpose.
2. Click on 'Register application'.
3. Fill out the required fields. For the 'Application website' and 'Callback URL', you can use a placeholder (such as https://teachingprivacy.org); note that the URLs must match.



### Obtain Tumblr API Keys

Go into the app you created. You should see your oauth consumer key and your secret key.

Go to https://api.tumblr.com/console and copy and paste these keys into the appropriate fields to get your oauth token and oauth secret key.

When using APIs that require tokens and keys for authentication, it is common practice to keep your keys in a separate JSON file, to protect yourself and the app's users. Your key file should not be posted in public repositories, and you should *never* share your keys. 

Create a new file called **tumblr_keys.json** with the format below. Paste your keys between the empty quotation marks. (Make sure you don't overwrite the quotation marks!) <br> <br>
{ <br>
   "api_key": "" <br>
   "api_secret":  "", <br>
   "oauth_token": "", <br>
   "oauth_secret": "" <br>
}


## Part 3: Create Twitter App and Get Access Tokens

### Create a Twitter App

1. Go to https://apps.twitter.com and click 'Sign In'. If you don't have a Twitter account or don't want to use your current Twitter account, you will need to create one.
2. Click on 'Create New App'.
3. Give your app a Name, Description, and Website. For the website, you can use a placeholder (such as https://teachingprivacy.org).

### Obtain Twitter API Keys 

Create a new text file named **twitter_keys.json** with the following format:

{ <br>
   "consumer_key":"", <br>
   "consumer_secret":  "", <br>
   "access_token": "", <br>
   "access_token_secret": "" <br>
}
<br>
1. Go into the app you created in the previous step and go to the 'Keys and Access Tokens' tab. 
2. Copy and paste the tokens and keys for the corresponding variables into your JSON file. <br>
    a. You will need to click 'Create my access token' when you first create your app. <br>
    b. Make sure you copy and paste the tokens *inside* the quotation marks.



## Part 4: Assign and Verify Your API Keys

### Assign the Keys

Run the cell below to assign your keys to the **keys** variable.


In [3]:
twitter_keys_file = 'twitter_keys.json'
tumblr_keys_file = 'tumblr_keys.json'
with open(twitter_keys_file) as file:
    twitter_keys = json.load(file)
with open(tumblr_keys_file) as file:
    tumblr_keys = json.load(file)

### Establish Your Tumblr API

Now that you have set up your keys, run the cell below to establish your Tumblr API.

In [4]:
client = pytumblr.TumblrRestClient(
    tumblr_keys['api_key'],
    tumblr_keys['api_secret'],
    tumblr_keys['oauth_token'],
    tumblr_keys['oauth_secret'],
)
client.info()

{'user': {'blogs': [{'admin': True,
    'ask': True,
    'ask_anon': True,
    'ask_page_title': 'Ask me anything',
    'can_send_fan_mail': True,
    'can_submit': True,
    'can_subscribe': False,
    'description': 'The Office, music, and friends. #CarpeDiem',
    'drafts': 0,
    'facebook': 'N',
    'facebook_opengraph_enabled': 'N',
    'followed': False,
    'followers': 394,
    'is_blocked_from_primary': False,
    'likes': 8807,
    'messages': 4,
    'name': 'ratchetmessiah',
    'posts': 16636,
    'primary': True,
    'queue': 0,
    'share_likes': True,
    'submission_page_title': 'Submit a post',
    'submission_terms': {'accepted_types': ['text',
      'photo',
      'quote',
      'link',
      'video'],
     'guidelines': '',
     'tags': [],
     'title': 'Submit a post'},
    'subscribed': False,
    'title': 'Swagalicious Fergalicious Blog',
    'total_posts': 16636,
    'tweet': 'N',
    'twitter_enabled': False,
    'twitter_send': False,
    'type': 'public',
 

### Check Twitter Keys

Run the cell below to check whether you have set up the Twitter keys correctly.

In [5]:
try:
    auth = tweepy.OAuthHandler(twitter_keys["api_key"], twitter_keys["api_secret"])
    auth.set_access_token(twitter_keys["access_token"], twitter_keys["access_token_secret"])
    api = tweepy.API(auth)
    print("You have correctly set up your API keys. Your username is:", api.auth.get_username())
except TweepError as e:
    print("Tweepy found an error. Revisit your keys.json file and make sure you have the correct keys.")

You have correctly set up your API keys. Your username is: ImKarloss


## Part 5: Gather Some Data

Now that you've set up the API, let's see what data you can get from someone's social media accounts.


### Gather Tumblr Posts

With the help of the <a href="https://github.com/tumblr/pytumblr" target="_blank">PyTumblr</a> documentation, grab all of Kai Peroc's Tumblr posts and store them in the array provided. 

The example in the cell below contains a line that uses regular expressions (regex) to remove the html tags in posts.


In [8]:
regex = re.compile('<.*?>')
old = "This string has an <html tag>."
new = re.sub(regex, '', old)
print(new)

This string has an .


*Hints:*<br>
*1. Look for the .posts method in the documentation.*<br>
*2. When looking for posts, note that not all types of posts will have the same tags. Look in the 'body' or 'caption' tags.*<br>
*3. To get all the posts, store them in a variable like **tumblr_call** when you call the .posts method and iterate through them with a for loop.*<br>



In [9]:
tumblr_posts = []
tumblr_call = client.posts("kaiperoc")
for post in tumblr_call['posts']:
    if 'body' in post:
        clean_post = re.sub(regex, '', post['body'])
        tumblr_posts.append(clean_post)
    else:
        clean_post = re.sub(regex, '', post['caption'])
        tumblr_posts.append(clean_post)
tumblr_posts

['Just got an iPhone 6S; text me!\xa0605-475-6961',
 'Woo! Accepted to UC Berkeley Class of 2019! Proud to be a bear! \u202a#\u200eBerkeleyBound\u202c']

### Gather Twitter Posts

Now, with help from <a href="http://tweepy.readthedocs.io/en/v3.5.0/" target="_blank">the Tweepy documentation</a>, grab Kai Peroc's most-recent tweets and store them in the array provided.

*Hint: Look for a method to return the user timeline under 'API Reference': http://docs.tweepy.org/en/v3.5.0/api.html#timeline-methods *

In [12]:
twitter_posts = []
twitter_call = api.user_timeline(screen_name="kaiperoc", count=100)
for tweet in twitter_call:
    twitter_posts.append(tweet._json['text'])
twitter_posts

['Fellow incoming #berkeley #classof19 there is a great sandwich spot on Shattuck called The Sandwich Spot! http://t.co/OJXDIbU8Mp',
 "Can't wait to see it live in person!  https://t.co/5e1kePRzXK",
 "Cut out of work early last night to go to the A's game. Totally worth it! #athletics #stomper http://t.co/zOLEhPmvHD",
 'I guess the question Tilt asked all those years ago is finally answered. It was condemned. https://t.co/U8Coat3L1w #berkeleypier',
 "So proud to be a part of the class of '19! #berkeleybound https://t.co/3NwpZS7B0o",
 '@TheBerkStaff @Student_Store I love walking around this beautiful campus on a gorgeous summer day! So many school colors on show! #gobears',
 'Ugh, parking around campus. #amiright #berkeley',
 'Good stuff! https://t.co/1zTyjHoKot',
 'Have you heard about CRISPR?! It could be a cure of EVERYTHING! #science #is #rad  https://t.co/J6BKr5GH6V',
 'omgosh so scary! What if you were in the presence of this guy?! #LionsTigersAndBears #OhMy! https://t.co/5bMClgYw

## Part 6: Aggregate the Data

Now that you have Kai's data from two different social media accounts, it's time to aggregate.

In [11]:
agg_posts = []
for tweet in twitter_posts:
    agg_posts.append(tweet)
for tumblr_post in tumblr_posts:
    agg_posts.append(tumblr_post)

agg_posts

['Fellow incoming #berkeley #classof19 there is a great sandwich spot on Shattuck called The Sandwich Spot! http://t.co/OJXDIbU8Mp',
 "Can't wait to see it live in person!  https://t.co/5e1kePRzXK",
 "Cut out of work early last night to go to the A's game. Totally worth it! #athletics #stomper http://t.co/zOLEhPmvHD",
 'I guess the question Tilt asked all those years ago is finally answered. It was condemned. https://t.co/U8Coat3L1w #berkeleypier',
 "So proud to be a part of the class of '19! #berkeleybound https://t.co/3NwpZS7B0o",
 '@TheBerkStaff @Student_Store I love walking around this beautiful campus on a gorgeous summer day! So many school colors on show! #gobears',
 'Ugh, parking around campus. #amiright #berkeley',
 'Good stuff! https://t.co/1zTyjHoKot',
 'Have you heard about CRISPR?! It could be a cure of EVERYTHING! #science #is #rad  https://t.co/J6BKr5GH6V',
 'omgosh so scary! What if you were in the presence of this guy?! #LionsTigersAndBears #OhMy! https://t.co/5bMClgYw

## Part 7: Brainstorm and Reflect

Now that you have data about Kai from two different social media sites, what inferences can you make about xem, about xir interests, likes and dislikes, affiliations, and personality?

Compare the data from the two streams.
- What similarities do you see? What do those similarities tell you about Kai?
- How does the information on each site supplement or contextualize the other?
- Are there any notable differences or contradictions?

How would the *aggregated* data be more useful than data from a single site for...
- An advertiser deciding what products to market to Kai?
- A criminal looking for easy targets?
- A potential employer Kai had applied to?

In which of those situations would it be most useful to automate the process via API?


**Teacher Note: Student answers might touch on:**
- Persistent likes/interests across time and across both sites (e.g. motivational memes, genetics) are more promising for advertisers.
- More precise data about Kai's college, year, and probable age from Twitter contextualizes the data from Tumblr, so advertisers know his demographic.
- Advertisers generally do not employ humans to scan individual profiles; they need a lot of data about many people at once to choose targets by algorithm and to compare similar characteristics among those their ad is/isn't successful with.
- Home location data from Twitter and shot of valuables on Tumblr provides target for a thief (API is potentially useful to automate the search for targets, for a computer-savvy criminal).
- Contradiction between what he says about his absence from school on the two sites shows he hasn't grown up when it comes to prevaricating -- useful for an employer to know (also reinforces how embarrassed he is about the blobberbitis, and anything embarrassing is a gold mine for advertisers).
