# Data Sources
<font size="4">
We retrieved our data from two sources: the We The People API and from Twitter profile pages.
</font>

# Contents
<font size="4">
<ol>
<li><a href="#We-The-People-API">We The People API</a></li>
<li><a href="#Webscraping-Twitter">Twitter</a></li>
</ol>

# We The People API
<font size="4">
Using the <a href="https://petitions.whitehouse.gov/developers">We the People API</a>, we were able to extract petition data for several hundred petitions. The information we retrieved includes:
(1) the petition ID, (2) date created, (3) deadline date, (4) petition title, (5) issues addressed, (6) petition type, (7) signature count, (8) the petition itself, and (9) the URL to the petition.

In [8]:
# LIBRARIES
from fastcache import clru_cache
import datetime
import requests
import numpy as np
import re
import pandas as pd

sandkey="TgltV3qLpU4EPb6LJF7Lx7wRJ6CMMVe5l7YBctBC"
prodkey="ETjU0uiiXFfA9AqvBUFooOEx2OmBdeq0nquzM1k4"

<font size="4">
The following are the functions we used to retrieve the data we desired.

In [9]:
# FUNCTION
@clru_cache(maxsize=128,typed=False)
def sandbox_test(limit="15"):
    base="https://api.whitehouse.gov/v1/petitions.json?limit="+limit
    request_get=requests.get(base+"&api_key="+prodkey)
    request_json=request_get.json()
    return(request_json)

tmp = sandbox_test("")

In [10]:
# UNIX TIME CONVERSION CODE BORROWED FROM STACK
def extract_field(query):
    pet_id=[item["id"] for item in query]
    pet_created=[datetime.datetime.fromtimestamp(item["created"]).strftime('%Y-%m-%d %H:%M:%S') for item in query]
    pet_deadline=[datetime.datetime.fromtimestamp(item["deadline"]).strftime('%Y-%m-%d %H:%M:%S') for item in query]
    pet_title=[item["title"] for item in query]
    pet_issues=[", ".join([re.sub("amp;","",issue["name"]) for issue in item["issues"]]) for item in query]
    pet_type=[[category["name"] for category in item["petition_type"]] for item in query]
    pet_count=[item["signatureCount"] for item in query]
    pet_body=[re.compile("&(amp;)*#039;").sub("'",item["body"]).replace("\n","") for item in query]
    pet_url=[item["url"] for item in query]
    pet_list=[pet_id,pet_created,pet_deadline,pet_title,pet_issues,pet_type,pet_count,pet_body,pet_url]
    pet_array=np.array([np.array(column).reshape(100,1) for column in pet_list]).T.reshape(100,9)
    pet_df=pd.DataFrame(pet_array,columns=["id","created","deadline","title","issues","type","count","body","url"])
    return(pet_df)

<font size="4">
This is the dataframe we constructed.

In [None]:
extract_field(tmp["results"]).head()

<img src="http://i65.tinypic.com/2nqqlqd.png">

In [None]:
petitions = extract_field(tmp["results"])
# petitions.to_csv("petitions.csv")

# Webscraping Twitter
<font size="4">
We also webscraped the 20 most recent tweets, straight from 20 political sources on Twitter, a site where many retrieve their political news. We manually compiled the URL and name of profile in the <a href="data/twitter_links.csv">twitter_links.csv</a> file.
<br><br>
<b>We scared the following profiles</b>: Donald Trump, Betsy Devos, Kellyanne Conway, Mike Pence, Mitt Romney, Jeb Bush, Milo Yiannopoulos, Sarah Palin, Ted Cruz, Jerry Brown, Jill Stein, Barack Obama, Joe Biden, Bernie Sanders, Hillary Clinton, Robert Reich, Justin Trudeau, Nate Silver, NYT Politics, CNN Politics, FOX Politics, Post Politics, We the People

In [1]:
# LIBRARIES
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [3]:
# IMPORT DATA
twitter = pd.read_csv("data/twitter_links.csv")
twitter.head(10)

Unnamed: 0,profile,url
0,Donald Trump,https://twitter.com/POTUS
1,Betsy Devos,https://twitter.com/BetsyDeVosED
2,Kellyanne Conway,https://twitter.com/KellyannePolls
3,Mike Pence,https://twitter.com/mike_pence
4,Mitt Romney,https://twitter.com/MittRomney
5,Jeb Bush,https://twitter.com/JebBush
6,Milo Yiannopoulos,https://twitter.com/DontGoAwayM4d
7,Sarah Palin,https://twitter.com/SarahPalinUSA
8,Ted Cruz,https://twitter.com/tedcruz
9,Jerry Brown,https://twitter.com/JerryBrownGov


In [4]:
# VIEW ALL OF THE SOURCES WE ARE SCRAPING FROM
print twitter["profile"].values

['Donald Trump' 'Betsy Devos' 'Kellyanne Conway' 'Mike Pence' 'Mitt Romney'
 'Jeb Bush' 'Milo Yiannopoulos' 'Sarah Palin' 'Ted Cruz' 'Jerry Brown'
 'Jill Stein' 'Barack Obama' 'Joe Biden' 'Bernie Sanders' 'Hillary Clinton'
 'Robert Reich' 'Justin Trudeau' 'Nate Silver' 'NYT Politics'
 'CNN Politics' 'FOX Politics' 'Post Politics' 'We the People']


<font size="4">
The following is the function we used to retrieve the data we desired.

In [5]:
# FUNCTION
def get_tweet_bag(twitter_url):
    # PARSE PROFILE
    this_request = requests.get(twitter_url).text
    abc_soup = BeautifulSoup(this_request, "html.parser")

    # GRAB DATA FOR 20 TWEETS
    twenty_tweet_data = abc_soup.find_all("div", {"class", "js-tweet-text-container"})
    
    # GET THE 20 TWEETS FOR ONE PERSON
    twenty_tweets = [x.find_all("p")[0].text for x in twenty_tweet_data]
    twenty_tweets = [x.encode("ascii", "replace") for x in twenty_tweets]

    # CREATE BAG OF WORDS
    tweet_bag = " ".join(twenty_tweets)
    
    return(tweet_bag)

In [6]:
# LIST COMP TO RETRIEVE 20 TWEETS FROM EACH SOURCE
tweet_bags = [get_tweet_bag(x) for x in twitter["url"]]

<font size="4">
This is the dataframe we constructed.

In [7]:
# ASSEMBLING DATAFRAME
twitter["tweet_bags"] = tweet_bags
# twitter.to_csv("data/twitter_data.csv")
twitter.head(10)

Unnamed: 0,profile,url,tweet_bags
0,Donald Trump,https://twitter.com/POTUS,"FBI Director Comey: fmr. DNI Clapper ""right"" t..."
1,Betsy Devos,https://twitter.com/BetsyDeVosED,"""At the end of the day we should measure every..."
2,Kellyanne Conway,https://twitter.com/KellyannePolls,Congratulations @erictrump & @LaraLeaTrump on ...
3,Mike Pence,https://twitter.com/mike_pence,.@POTUS showed true leadership in his #JointAd...
4,Mitt Romney,https://twitter.com/MittRomney,I'm a fan of proposed Deputy Treasury Secretar...
5,Jeb Bush,https://twitter.com/JebBush,Such an unnecessary distraction given all the ...
6,Milo Yiannopoulos,https://twitter.com/DontGoAwayM4d,http://bit.ly/2mRyeJq? via /r/KiA #gamergate K...
7,Sarah Palin,https://twitter.com/SarahPalinUSA,The best things happen while fishing. Love thi...
8,Ted Cruz,https://twitter.com/tedcruz,Add your name if you agree -- no US funding fo...
9,Jerry Brown,https://twitter.com/JerryBrownGov,"California is Not Turning Back, Not Now, Not E..."
