## downloading data from twitter


tweepy

Tweepy is a library that will help you to connect to Twitter API.

https://medium.com/@wilamelima/mining-twitter-for-sentiment-analysis-using-python-a74679b85546

http://docs.tweepy.org/en/3.7.0/api.html


In [2]:
import pandas as pd
import tweepy
import jsonpickle

# Consume:
CONSUMER_KEY    = 'INSERT YOUR'
CONSUMER_SECRET = 'INSERT YOUR'

# Access:
ACCESS_TOKEN  = 'INSERT YOUR'
ACCESS_SECRET = 'INSERT YOUR'

# Setup access API
def connect_to_twitter_OAuth():
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    
    api = tweepy.API(auth)
    return api
 
# Create API object
api = connect_to_twitter_OAuth()  

### scrape by ashtag

So, let’s start by building a function that will:

Create a json file that will hold all the tweets
Access Twitter API, query it and return the tweets
Save the tweets into the file we just created

The function will accept as parameters:

filepath: where the file should be saved and it’s name

api: the api object we created earlier

query: the query that will be used by Twitter to retrieve the tweets

max_tweets: your developer account has a limit of how many requests you can do each 15 minutes.

The function return the number of rweets downloaded.


In [23]:

def get_save_tweets(filepath, api, query='', id='', max_tweets=1000, lang='en', mode ='w'):

    tweetCount = 0

    #Open file and save tweets
    with open(filepath, mode) as f:
        
        # if id passed search id's timeline
        if id:
             try:    
                for tweet in tweepy.Cursor(api.user_timeline,id=id,lang=lang).items(max_tweets):
                    #Convert to JSON format
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False) + '\n')
                    tweetCount += 1
                    
             except: return(0)   
            
        # else try query
        else:
             try:                
                for tweet in tweepy.Cursor(api.search,q=query,lang=lang).items(max_tweets): 
                    #Convert to JSON format
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False) + '\n')
                    tweetCount += 1
                    
             except: return(0)                      
        
    f.close()    
    return (tweetCount) #Display how many tweets we have collected


We download 1000 tweets for astag #innovation 

They will be taken among the most recents.


In [13]:
query = '#innovation'  
filename = 'tweets.json'

# Get those tweets
nrec=get_save_tweets(filename, api, query)

print("Downloaded {0} tweets".format(nrec))

Downloaded 563 tweets


Other case: we want to get 10 tweets from a list of companies we previously prepared:

the file  twitter_co.csv contains data for 85 companies size 1001-10000 employees and twitter url

twitter id can be extracted by stripping the leading https://www.twitter.com/ from url.



In [None]:
  
filename = 'comptweets.json'

comptwitter = pd.read_csv('twitter_co.csv')

twitter_urls = comptwitter['twitter_url']

for twitter_url in twitter_urls:
    twitter_id= twitter_url.replace("https://www.twitter.com/","")
# Get those tweets
    nrec = get_save_tweets(filename, api, id = twitter_id,max_tweets=10, mode = 'a') 
    print("Downloaded {0} tweets for account {1}".format(nrec, twitter_id))                       




Downloaded 10 tweets for account intel
Downloaded 10 tweets for account toyota
Downloaded 10 tweets for account cisco
Downloaded 10 tweets for account nhk_pr
Downloaded 10 tweets for account cleanharbors_hr
Downloaded 10 tweets for account stericycle_inc
Downloaded 10 tweets for account officedepot
Downloaded 10 tweets for account mednax
Downloaded 10 tweets for account dicks
Downloaded 10 tweets for account ciena
Downloaded 4 tweets for account nttdataamericas
Downloaded 10 tweets for account gehealthcare
Downloaded 10 tweets for account ebay
Downloaded 10 tweets for account medtronic
Downloaded 10 tweets for account yahoo
Downloaded 10 tweets for account finisar
Downloaded 10 tweets for account pitneybowes
Downloaded 10 tweets for account bnbuzz
Downloaded 10 tweets for account bw
Downloaded 10 tweets for account staples
Downloaded 10 tweets for account oracle
Downloaded 10 tweets for account fiserv
Downloaded 10 tweets for account cainc
Downloaded 10 tweets for account att
Downloade

## reading and cleaning tweets

Last step we create a function that makes some basic cleanings

and loads into a dataframe the tweets we saved previously

tweets_to_df()

    

In [15]:
def tweets_to_df(path):
    
    tweets = list(open(path, 'rt'))
    
    text = []
    weekday = []
    month = []
    day = []
    hour = []
    hashtag = []
    url = []
    favorite = []
    reply = []
    retweet = []
    follower = []
    following = []
    user = []
    screen_name = []

    for t in tweets:
        t = jsonpickle.decode(t)
        
        # Text
        text.append(t['text'])
        
        # Decompose date
        date = t['created_at']
        weekday.append(date.split(' ')[0])
        month.append(date.split(' ')[1])
        day.append(date.split(' ')[2])
        
        time = date.split(' ')[3].split(':')
        hour.append(time[0]) 
        
        # Has hashtag
        if len(t['entities']['hashtags']) == 0:
            hashtag.append(0)
        else:
            hashtag.append(1)
            
        # Has url
        if len(t['entities']['urls']) == 0:
            url.append(0)
        else:
            url.append(1)
            
        # Number of favs
        favorite.append(t['favorite_count'])
        
        # Is reply?
        if t['in_reply_to_status_id'] == None:
            reply.append(0)
        else:
            reply.append(1)       
        
        # Retweets count
        retweet.append(t['retweet_count'])
        
        # Followers number
        follower.append(t['user']['followers_count'])
        
        # Following number
        following.append(t['user']['friends_count'])
        
        # Add user
        user.append(t['user']['name'])

        # Add screen name
        screen_name.append(t['user']['screen_name'])
        
    d = {'text': text,
         'weekday': weekday,
         'month' : month,
         'day': day,
         'hour' : hour,
         'has_hashtag': hashtag,
         'has_url': url,
         'fav_count': favorite,
         'is_reply': reply,
         'retweet_count': retweet,
         'followers': follower,
         'following' : following,
         'user': user,
         'screen_name' : screen_name
        }
    
    return pd.DataFrame(data = d)

In [16]:
tweets_df = tweets_to_df('tweets.json')

In [17]:
tweets_df.head()

Unnamed: 0,day,fav_count,followers,following,has_hashtag,has_url,hour,is_reply,month,retweet_count,screen_name,text,user,weekday
0,5,0,476,270,1,0,13,0,Apr,0,DavidTimis,Young Leaders @thinkBDPST #Hungary #tech #inno...,David Timis,Fri
1,5,0,96,161,1,1,13,0,Apr,0,pcminetti,New York’s Fashion Startup Scene Is Having a R...,pam minetti,Fri
2,5,0,545,675,1,1,13,0,Apr,0,Arioneo_off,[RACING] Training on the track with #EQUIMETRE...,Arioneo,Fri
3,5,0,422,412,1,1,13,0,Apr,0,SivaPrasadh_G,The 30 Technologies of the Next Decate\n\n#AI ...,Siva Prasadh .G,Fri
4,5,0,3275,156,1,1,13,0,Apr,0,abunchofdata,Microsoft and BMW launch open platform to supp...,A bunch of data,Fri
