# Scraping Tweets Using Jupyter Notebooks #

This notebook is dedicated to printing an output that we'll use for a library called "Twitterscraper." This package uses CL for data collection. We'll load in the data back into this notebook. 

https://github.com/taspinar/twitterscraper
    
Once the data from twitterscraper is loaded, for the last portion, we can merge all of the cities' dataframe into one large dataframe for analysis using the function combine_data(). 

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import json      # library for working with JSON-formatted text strings
import pprint as pp    # library for cleanly printing Python data structures
import seaborn as sns
import twitterscraper as ts
from twitterscraper import query_tweets #library downloaded
import os as os

import subprocess #this enables us to pass CL code directly from Jupyter Notebooks 
from subprocess import Popen

## Creating a Twitterscraper Command ## 

The code below scrapes Twitter accounts from each city, scrapes *all* of their tweets, and makes one big JSON file. Rather than pasting the command into the CL, this function uses "subprocess" (a standard library already with Python) to pass the command directly through Jupyter Notebooks. 


In [12]:
def json_to_df(json_file):
    with open(json_file) as f:
        data = json.load(f)
        
    d = {'username': [x['username'] for x in data],
        'time': [x['timestamp'] for x in data],
        'tweet': [x['text'] for x in data],
        'likes': [x['likes'] for x in data],
        'replies': [x['replies'] for x in data],
        'user_ID' : [x['screen_name'] for x in data]}
    
    
    return pd.DataFrame.from_dict(d)

def combine_data(*data_frames): #this will allow us to merge dataframes "*" allows us to pass X dataframes
    return pd.concat(data_frames)

def setup_files(city):
    path_to_output_file = city + ".txt" #we'll get both txt and json, but just ignore txt
    path_to_data_file = city + ".JSON"
    
    if os.path.exists(path_to_output_file):
        os.remove(path_to_output_file)
        
    if os.path.exists(path_to_data_file):
        os.remove(path_to_data_file)
    
    #w+ is standard lexicon for saving files
    return open(path_to_output_file,'w+'), path_to_data_file

def buildQuery(accounts):
    scraper_query = ''
    
    #this builds our search query
    for index, each_account in enumerate (accounts):
        next_index = index + 1 #this is so that we don't have an extra "OR" at the end, it "knows" the last thing
        if next_index > len(accounts) - 1: 
            scraper_query = scraper_query + "from:"+ each_account
        else:
            scraper_query = scraper_query + "from:"+ each_account + " OR "
            
    return scraper_query
            
def scrape(accounts, city):
    outputFile, dataFile = setup_files(city)
    
    twitterscraper_query = buildQuery(accounts)
    arguments = ["twitterscraper", twitterscraper_query, "--lang", "en", "-o", dataFile]
    print (arguments)
    
    p = Popen(arguments, stdout=outputFile, stderr=outputFile, universal_newlines=True)
    p.wait() # Wait for sub process to finish before moving on to make frame 
    
    output, errors = p.communicate()
    print (errors)
    outputFile.close()
    
    return json_to_df(dataFile)

Example use below. 

In [9]:
seattle_accounts = ["SeattleOPCD", "CityofSeattle", "seattledot", "SeattleOSE", "kcmetrobus"] #add accounts to scrape here

seattle_output = scrape(seattle_accounts, "Seattle") # Will create panda frame from JSON output, "Seattle" will title JSON file

['twitterscraper', 'from:SeattleOPCD OR from:CityofSeattle OR from:seattledot OR from:SeattleOSE OR from:kcmetrobus', '--lang', 'en', '-o', 'Seattle.JSON']
None


In [343]:
seattle_output['city'] = 'Seattle, WA' #manually entering in the city column
seattle_output

Unnamed: 0,username,time,tweet,likes,replies,user_ID,city
0,King County Metro 🚏 🚌🚎⛴🚐,2015-05-06T23:45:54,Event Reroute - Transit service will be rerout...,0,0,kcmetrobus,"Seattle, WA"
1,seattledot,2015-05-06T23:03:19,Seeing additional delays heading south into th...,1,0,seattledot,"Seattle, WA"
2,King County Metro 🚏 🚌🚎⛴🚐,2015-05-06T23:00:32,Metro’s Wednesday PM Commute: Expect possible ...,0,0,kcmetrobus,"Seattle, WA"
3,seattledot,2015-05-06T22:58:46,No visual but hearing reports of a collision o...,0,0,seattledot,"Seattle, WA"
4,seattledot,2015-05-06T22:14:15,No visual but hearing reports of a collision o...,0,0,seattledot,"Seattle, WA"
...,...,...,...,...,...,...,...
305,King County Metro 🚏 🚌🚎⛴🚐,2018-02-24T03:01:23,Metro’s Friday PM Commute: Earlier transit se...,0,0,kcmetrobus,"Seattle, WA"
306,King County Metro 🚏 🚌🚎⛴🚐,2018-02-24T02:01:15,Metro’s Friday PM Commute: Expect possible tr...,0,0,kcmetrobus,"Seattle, WA"
307,King County Metro 🚏 🚌🚎⛴🚐,2018-02-24T01:48:41,The last two trips of route 158 are late due t...,0,0,kcmetrobus,"Seattle, WA"
308,King County Metro 🚏 🚌🚎⛴🚐,2018-02-24T01:28:37,Transit Alert - Route 342 to the Shoreline P&R...,0,1,kcmetrobus,"Seattle, WA"


In [8]:
denver_accounts = ["DenverCityGov", "SustainableDen", "DenverDOTI", "denverparksrec", "DenverCPD", "DDPHE"]

denver_output = scrape(denver_accounts, "Denver")

denver_output['city'] = 'Denver, CO'
denver_output

['twitterscraper', 'from:DenverCityGov OR from:SustainableDen OR from:DenverDOTI OR from:denverparksrec OR from:DenverCPD OR from:DDPHE', '--lang', 'en', '-o', 'Denver.JSON']
None


Unnamed: 0,username,time,tweet,likes,replies,user_ID,city
0,Denver Parks & Recreation,2008-12-23T11:03:33,isn't sleeping,0,0,denverparksrec,"Denver, CO"
1,Denver Dept of Transportation & Infrastructure,2011-02-16T21:05:07,"Denver residents recycled more than 30,000 lbs...",0,0,DenverDOTI,"Denver, CO"
2,Denver Parks & Recreation,2011-02-14T19:58:19,Check out this great article in Parks and Recr...,0,0,denverparksrec,"Denver, CO"
3,Denver Parks & Recreation,2011-02-10T17:38:42,Denver Parks and Recreation would like to than...,0,0,denverparksrec,"Denver, CO"
4,Denver Parks & Recreation,2011-02-08T00:34:06,New from DPR: Museum Celebrates Buffalo Bill’s...,0,0,denverparksrec,"Denver, CO"
...,...,...,...,...,...,...,...
294,Denver Dept of Transportation & Infrastructure,2018-11-09T17:01:33,Need to rake leaves this weekend? LeafDrop sit...,4,0,DenverDOTI,"Denver, CO"
295,Denver Public Health & Environment,2018-11-09T01:44:46,We are so happy to partner with @MaxfundShelte...,4,0,DDPHE,"Denver, CO"
296,Denver Dept of Transportation & Infrastructure,2018-11-08T22:00:07,Thanks for reaching out! The Denver Moves Bike...,1,0,DenverDOTI,"Denver, CO"
297,Denver Dept of Transportation & Infrastructure,2018-11-08T19:30:44,Transitioning out your fall decor soon? Most o...,4,0,DenverDOTI,"Denver, CO"


Now, we'll combine these two datasets. 

In [10]:
combine_data(seattle_output, denver_output)


Unnamed: 0,username,time,tweet,likes,replies,user_ID,city
0,seattledot,2014-08-23T23:58:01,Incident cleared on sb Boren Av near Madison St.,0,0,seattledot,
1,seattledot,2014-08-23T23:27:12,2nd Ave between James and Washington St is now...,0,0,seattledot,
2,seattledot,2014-08-23T23:25:00,Right lane blocked on southbound Boren Av near...,0,0,seattledot,
3,seattledot,2014-08-23T23:09:29,Montlake Bridge re-opened to vehicle traffic,0,0,seattledot,
4,seattledot,2014-08-23T23:04:55,Montlake Bridge closed to vehicle traffic,0,0,seattledot,
...,...,...,...,...,...,...,...
294,Denver Dept of Transportation & Infrastructure,2018-11-09T17:01:33,Need to rake leaves this weekend? LeafDrop sit...,4,0,DenverDOTI,"Denver, CO"
295,Denver Public Health & Environment,2018-11-09T01:44:46,We are so happy to partner with @MaxfundShelte...,4,0,DDPHE,"Denver, CO"
296,Denver Dept of Transportation & Infrastructure,2018-11-08T22:00:07,Thanks for reaching out! The Denver Moves Bike...,1,0,DenverDOTI,"Denver, CO"
297,Denver Dept of Transportation & Infrastructure,2018-11-08T19:30:44,Transitioning out your fall decor soon? Most o...,4,0,DenverDOTI,"Denver, CO"
