## Pulling Data with Tweepy

**By:** _Hongshen Lee_

In [None]:
import os
import time
import pandas as pd
from datetime import datetime
import tweepy
# I've put my API keys in a .py file called API_keys.py
from API_keys import api_key, api_key_secret, access_token, access_token_secret

In [None]:
# Authenticate the Tweepy API
auth = tweepy.OAuthHandler(api_key,api_key_secret)
auth.set_access_token(access_token, access_token_secret)

# Once the rate limit is hit, we will be notified that we must wait 15 mins (900 secs)
api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True, compression=True)

## Brief Introduction of Projects
Apple has released its cross era product with **Apple Silicon M1**. I'm curious about what this new product looks like and what its reputation is. Using the API excuse provided by Tweepy, I grab the relevant records for further analysis.

### Fields
For each record, it conains fields: 

- id_str: unique id for each tweet
- username: twitter's screen name in twitter
- location: twitter's location infot in twitter
- following: the number of people twitter follows
- followers: the number of people following the twitter
- total_tweers: the number of total tweets published by this twitter
- favorite_count: the number of the favorite clicks of this tweet
- retweet_count: the number of the retweets of this tweet.
- text: the context of this tweet
- source: twitter use what to publish this tweet.

### Discussion

For this dataset, serval interesting problems could be:

- How people like or not the new product with new chips from Apple ?
- Do active and influential people (with more followers and total tweets) like new products?
- Do people's preferences have anything to do with their geographical location ？
- Do people's preferences have anything to do with their mobilephones (coule be inferred by source field) ？


## Search Relevant Tweets

Due to the limited number of API calls one can make using a basic and free developer account, (~900 calls every 15 minutes before your access is denied) Following methods created a function that extract 2,500 tweets per run once every 15 minutes

Ref:https://medium.com/python-in-plain-english/scraping-tweets-with-tweepy-python-59413046e788

In [None]:
file_name="apple_m1_data_new.csv"
def init_file():
    db_tweets = pd.DataFrame(columns=['id_str','username', 'location', 'following',
                                      'followers', 'total_tweets', 'favorite_count',
                                      'retweet_count', 'text', 'source'])
    db_tweets.to_csv(file_name, index=False)
    

In [None]:
def write_down_records(tweet_list):
    db_tweets = pd.DataFrame(columns=['id_str','username', 'location', 'following',
                                      'followers', 'total_tweets', 'favorite_count',
                                      'retweet_count', 'text', 'source'])
    for tweet in tweet_list:
        id_str = tweet.id_str
        username = tweet.user.screen_name
        location = tweet.user.location
        following = tweet.user.friends_count
        followers = tweet.user.followers_count
        total_tweets = tweet.user.statuses_count
        retweet_count = tweet.retweet_count
        favorite_count = tweet.favorite_count
        try:
            text = tweet.full_text
            source = tweet.source
        except AttributeError:  # A Retweet
            text = tweet.retweeted_status.full_text
            source = tweet.retweeted_status.source
        # Add the 11 variables to the empty list - ith_tweet:
        ith_tweet = [id_str,username, location, following, followers, total_tweets,
                 retweet_count, favorite_count, text, source]
        db_tweets.loc[len(db_tweets)] = ith_tweet
    db_tweets.to_csv(file_name, mode='a', header=False, index=False)

In [None]:
def scrape_tweets(search_words, date_since):
    # Define a for-loop to generate tweets at regular intervals
    # We cannot make large API call in one go. Hence, let's try T times

    program_start = time.time()
    # Collect tweets using the Cursor object
    # .Cursor() returns an object that you can iterate or loop over to access the data collected.
    count=1;
    for tweets in tweepy.Cursor(api.search, q=search_words, lang="en", rpp=100,count=100, 
                        since=date_since, tweet_mode='extended').pages():
        # We will time how long it takes to scrape tweets for each run:
        start_run = time.time()
        # Store these tweets into a python list
        tweet_list = [tweet for tweet in tweets]
        noTweets = len(tweet_list)
        # Wrtie down the records to file
        write_down_records(tweet_list)
        # Run ended:
        duration_run = round((time.time() - start_run) / 60, 2)
        print('Time take for {} to complete is {} mins,scraped {} tweets'.format(count,duration_run, noTweets))
        count=count+1
        # time.sleep(920)  # 15 minute sleep time

    # End
    program_end = time.time()
    print('Scraping has completed!')
    print('Total time taken to scrap is {} minutes.'.format(round(program_end - program_start) / 60, 2))

### Key words to search
Some keywords to search relevant tweets: like apple m1, mba, macbook air, apple chip.
But these tweets are not necessarily related to our topic "Apple's new chip", so setting a since time can effectively improve the relevance of tweets to our topic.

In [None]:
init_file()

In [None]:
search_words = "#M1 -filter:retweets"
date_since = "2020-11-10"
# Call the function scraptweets
scrape_tweets(search_words, date_since)

In [None]:
search_words = "MBA OR MACBOOK AIR -filter:retweets"
date_since = "2020-11-10"
# Call the function scraptweets
scrape_tweets(search_words, date_since)

In [133]:
search_words = "#Apple chip -filter:retweets"
date_since = "2020-11-10"
# Call the function scraptweets
scrape_tweets(search_words, date_since)

Time take for 1 to complete is 0.01 mins,scraped 100 tweets
Time take for 2 to complete is 0.01 mins,scraped 100 tweets
Time take for 3 to complete is 0.01 mins,scraped 100 tweets
Time take for 4 to complete is 0.01 mins,scraped 100 tweets
Time take for 5 to complete is 0.0 mins,scraped 59 tweets
Scraping has completed!
Total time taken to scrap is 0.55 minutes.
