## Notebook's Table of Contents<a name="TOC"></a>
<br>
<b>This companion notebook is meant to build on the scraping article and article notebook as it covers more scenarios that may come up and provides more examples.</b>

0. [Credentials and Authorization](#Section0)
<br>Setting up credentials and authorization in order to utilize Tweepy
1. [Getting More Information From Tweets](#Section1)
<br>How to scrape more information from tweets such as favorite count, retweet count, if they're replying to someone else, if turned on the coordinates of where the tweet came from, etc.
2. [Getting User Information From Tweets](#Section2)
<br>How to scrape user information from tweets such as their follower count, total amount of tweets, if they're a verified user, location of where account is registered, etc.
3. [Scraping Tweets With Advanced Queries](#Section3)
<br>How to scrape for tweets using deeper queries such as searching by language of tweets, tweets within a certain location, tweets within specific date ranges, top tweets, etc.
4. [Putting It All Together](#Section4)
<br>Showcasing how you can mix and match the methods shown above to create queries that'll fulfill your data needs.

## Imports for Notebook

In [12]:
# Pip install Tweepy if you don't already have the package
# !pip install tweepy

# Imports
import tweepy
import pandas as pd
import time

## 0. Credentials and Authorization<a name="Section0"></a>
[Return to Table of Contents](#TOC)
<br>Tweepy requires credentials before you can utilize its API. The below code helps setup the notebook for authorization. I already have an an article covering setting up Tweepy and getting credentials [here](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) if further instructions are needed.

You don't necessarily have to create a credentials file, however if you find youself sharing Tweepy code to other parties I recommend it so you don't accidentally share your credentials. Otherwise skip the below cell and just enter your credentials in and have them hardcoded below.

In [2]:
# Credentials 

consumer_key = "TroyIuC1l3i3laNlwl5mg"
consumer_secret = "qYRSTEHzHhTsL0CBcMnXxjqeY5UQ6U4C0kNvmPSG4K4"
access_token = "1230084602-UtUB4QdlhNkv1aLqrjS3eYoZ96APon5IhjOqBFt"
access_token_secret = "wscl1nV8gFO4kMhtJy7DjhQKkpJHB1fW5Jzb4RXZq8"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

## 1. Getting More Information From Tweets<a name="Section1"></a>
[Return to Table of Contents](#TOC)
<br>List of information available in tweet object with Tweepy. This is not an exhaustive list but does contain a majority of the available information. If you want an exhaustive list of everything contained in the tweet object there's documentation [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object) describing all the attributes. 

String versions of Id's (e.g., id_str, in_reply_to_status_id_str) are used instead to best keep data integrity as there is a possibility for Id's stored as integers to be cut off.

* tweet.user <b>User information is covered in part 2 in greater detail</b><br><br>

* tweet.full_text: <b>Text content of tweet when API is told to pull all contents of tweets that have more than 140 characters</b><br><br>

* tweet.text: Text content of tweet
* tweet.created_at: Date tweet was created
* tweet.id_str: Id of tweet
* tweet.user.screen_name: Username of tweet's author
* tweet.coordinates: Geographic location as reported by user or client. May be null that is why extract_coordinates function below was created
* tweet.place: Indicates place associated with tweet where user signed up with like Las Vegas, NV. May be null that so extract_place function below was created
* tweet.retweet_count: Count of retweets
* tweet.favorite_count: Count of favorites
* tweet.lang: Indicates a BCP 47 language identifier corresponding to machine detected language of tweet text.
* tweet.source: Source where tweet was posted through. Ex: Twitter Web Client
* tweet.in_reply_to_status_id_str: If a tweet is a reply, the original tweet's id. Can be null if tweet is not a reply
* tweet.in_reply_to_user_id_str: If a tweet is a reply, string representation of original tweet's user id
* tweet.is_quote_status: If tweet is a quote tweet

In [3]:
# Function created to extract coordinates from tweet if it has coordinate info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_coordinates(row):
    if row['Tweet Coordinates']:
        return row['Tweet Coordinates']['coordinates']
    else:
        return None

# Function created to extract place such as city, state or country from tweet if it has place info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_place(row):
    if row['Place Info']:
        return row['Place Info'].full_name
    else:
        return None

In [4]:
def scrape_user_tweets(username, max_tweets):
    # Creation of query method using parameters
    tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets)

    # List comprehension pulling chosen tweet information from tweets iterable object
    # Add or remove tweet information you want in the below list comprehension
    tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates,
                   tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang,
                   tweet.source, tweet.in_reply_to_status_id_str, 
                    tweet.in_reply_to_user_id_str, tweet.is_quote_status,
                    ] for tweet in tweets]

    # Creation of dataframe from tweets_list
    # Add or remove columns as you remove tweet information
    tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info',
                                                 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id',
                                                  'Replied Tweet User Id Str', 'Quote Status Bool'])
    
    # Checks if there are coordinates attached to tweets, if so extracts them
    tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)
    
    # Checks if there is place information available, if so extracts them
    tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)
    
    # Uncomment/comment below lines to decide between creating csv or excel file 
    tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)
#     tweets_df.to_excel('{}-tweets.xlsx'.format(username), index = False)

In [5]:
# Creating example username to scrape from
username = 'elonmusk'

# Max recent tweets pulls x amount of most recent tweets from that user
max_tweets = 1000

# Function will scrape username, attempt to pull max_tweet amount, and create csv/excel file from data.
scrape_user_tweets(username,max_tweets)