# Getting Twitter Data for other Users #

This notebook focuses on getting Twitter data (all tweets) from influential users. 

    -Leonardo Di Caprio
    -Cato Institute
    -Greenpeace
   

This first portion of the notebook is dedicated to printing an output that we'll use for a library called "Twitterscraper." This package uses CL for data collection. We'll load in the data back into this notebook. 

https://github.com/taspinar/twitterscraper
    
Once the data from twitterscraper is loaded, for the last portion, we'll then merge all of their data into one large dataset for analysis. 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import json      # library for working with JSON-formatted text strings
import pprint as pp    # library for cleanly printing Python data structures
import seaborn as sns
import twitterscraper as ts
from twitterscraper import query_tweets #library downloaded
import os as os

import subprocess #this enables us to pass CL code directly from Jupyter Notebooks 
from subprocess import Popen

## Creating a Twitterscraper Command ## 

The code below scrapes Twitter accounts from each city, scrapes *all* of their tweets, and makes one big JSON file. Rather than pasting the command into the CL, this function uses "subprocess" (a standard library already with Python) to pass the command directly through Jupyter Notebooks. 


In [7]:
def json_to_df(json_files):
    data_frames = []
    
    for file in json_files:
        print (file)
        with open(file) as f:
            data = json.load(f)
        
        d = {'username': [x['username'] for x in data],
        'time': [x['timestamp'] for x in data],
        'tweet': [x['text'] for x in data],
        'likes': [x['likes'] for x in data],
        'replies': [x['replies'] for x in data],
        'user_ID' : [x['screen_name'] for x in data]}
    
        data_frames.append(pd.DataFrame.from_dict(d))
    return data_frames

def combine_data(data_frames): #this will allow us to merge dataframes "*" allows us to pass X dataframes
    return pd.concat(data_frames)


def buildQuery(accounts):
    scraper_query = ''
    
    #this builds our search query
    for index, each_account in enumerate (accounts):
        next_index = index + 1 #this is so that we don't have an extra "OR" at the end, it "knows" the last thing
        if next_index > len(accounts) - 1: 
            scraper_query = scraper_query + "from:"+ each_account
        else:
            scraper_query = scraper_query + "from:"+ each_account + " OR "
            
    return scraper_query

def launch(command, output):
    print (command)
    
    outputFile = open(output, 'w+')
    p = Popen(command, stdout=outputFile, stderr=outputFile, universal_newlines=True)
    output, errors = p.communicate()
    #p.wait() # Wait for sub process to finish before moving on to make frame 
    
    if errors:
        print (errors)
    outputFile.close()
            
def scrape(accounts):
    data_files = []
    
    for user in accounts:
        path_to_output_file = user + ".txt" #we'll get both txt and json, but just ignore txt
        path_to_data_file = user + ".JSON"
        data_files.append(path_to_data_file)
        
        query = 'from: ' + user
        command = ["twitterscraper", query, 
                   "--lang", "en", "--all", "-ow", "-p", "40", "-o", path_to_data_file]
        launch(command, path_to_output_file)
 
    return data_files 

Below, I created a list of all the accounts I wish to scrape (I broke it up into 3 "searches" because this process is extremely time-consuming). However, using "scrape()" you can input all the accounts, it'll just an hour or so to get all the data.

In [8]:
comp_accounts = ["LeoDiCaprio", "Greenpeace", "CatoInstitute"]
                             
comp_accounts_output = scrape(comp_accounts) 

['twitterscraper', 'from: LeoDiCaprio', '--lang', 'en', '--all', '-ow', '-p', '40', '-o', 'LeoDiCaprio.JSON']
['twitterscraper', 'from: Greenpeace', '--lang', 'en', '--all', '-ow', '-p', '40', '-o', 'Greenpeace.JSON']
['twitterscraper', 'from: CatoInstitute', '--lang', 'en', '--all', '-ow', '-p', '40', '-o', 'CatoInstitute.JSON']


## Converting JSONs to DataFrames ##

json_to_df() takes the json list output above and converts all the data into a list of dataframes. 

In [10]:
dataframe_1 = json_to_df(comp_accounts_output)


LeoDiCaprio.JSON
Greenpeace.JSON
CatoInstitute.JSON


In [11]:
merge1 = combine_data(dataframe_1)
#merge2 = combine_data(dataframe_2)
#merge3 = combine_data(dataframe_3)

frames = [merge1]

result = pd.concat(frames)
result

to_keep = ["LeoDiCaprio", "Greenpeace", "CatoInstitute"]

final_results = result[~result['user_ID'].isin(to_keep) == False] # the code above got all mentions & replies
final_results.head(5)

Unnamed: 0,username,time,tweet,likes,replies,user_ID
629,Leonardo DiCaprio,2011-10-21T19:52:58,#SaveTigersNow RT @World_Wildlife: Tragic imag...,325,164,LeoDiCaprio
23,Greenpeace,2008-12-17T15:35:14,@creativemuffin I like talking to people who w...,0,0,Greenpeace
140,Greenpeace,2015-05-01T16:04:05,Investing in fossil fuels means you profit fro...,43,2,Greenpeace
163,Greenpeace,2018-07-03T22:26:58,"For the past 8 hours, 12 climbers have stopped...",510,10,Greenpeace
288,Greenpeace,2009-09-08T13:37:24,RT @AslihanTumer: some more photos from #antia...,0,0,Greenpeace


In [12]:
final_results.to_csv("Other Users Results.csv")