### Twitter extraction and first round of cleaning
This notebook aims to retrieve tweets from the *Twitter API* using `tweepy` library and then make a first round of cleaning them (e.g. *drop duplicates*, *sort it* by date, apply some *regex*) and stored them in a csv.

**Working on it...**

In [1]:
import pandas as pd
import numpy as np
import tweepy

import os
from tqdm import tqdm
from datetime import datetime
import time

# My module
import my_email

In [4]:
# Hiding secret API keys in Environment Variables
consumer_key = os.environ.get('CONSUMER_KEY')
consumer_secret = os.environ.get('CONSUMER_SECRET')

access_token = os.environ.get('ACCESS_TOKEN')
access_token_secret = os.environ.get('ACCESS_TOKEN_SECRET')

bearer_token = os.environ.get('BEARER_TOKEN')

In [5]:
query = 'Bitcoin OR BTC OR #Bitcoin OR #BTC OR $Bitcoin OR $BTC'
# Path where the set of tweets will be stored to play with them
file_path = 'C:/Users/Javi/Desktop/cryptocurrency_predictor/data/twitter/tweets.csv'

Definning some functions:

In [6]:
# Functions

def connect_to_twitter_OAuth2(consumer_key=consumer_key, consumer_secret=consumer_secret):
    """Sets a connection to the twitter API.
    
    Parameters
    ----------
    consumer_key : set by default
    consumer_secret : set by default
    """
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    api = tweepy.API(auth)
    return api


def retrieve_tweets(api, since_id=None, max_id=None):
    """
    It returns a twitter object with 100 tweets of a specific api response.
    
    Parameters
    ----------
    api : api connection (required)
    since_id : if given, it returns tweets with an ID greater than that (newer)
    max_id : if given, it returns tweets with an ID less or equal than that (older) (max. 7 days prior)
    """
    return api.search(q=query,
                      lang='en',
                      result_type='recent',
                      count=100,
                      since_id=since_id,
                      max_id=max_id,
                      tweet_mode='extended')


def extract_tweet_atributes(tweet_object):
    """It returns a Pandas DataFrame with a tweet per row and its attributes per column."""
    
    tweets_list = []
    
    for tweet in tweet_object:
        # Iterates over each tweet and gets its attributes
        tweet_id = tweet.id   # Unique tweet identifier
        text = tweet.full_text   # Sring, text of the tweet
        screen_name = tweet.user.screen_name   # String, username
        followers = tweet.user.followers_count   # Number of followers
        retweet_count = tweet.retweet_count   # Number of retweets
        favorite_count = tweet.favorite_count   # Number of favorites
        created_at = tweet.created_at   # UTC time tweet created
        source = tweet.source   # Utility used to post the tweet
        reply_to_status = tweet.in_reply_to_status_id   # If reply: orginal tweet's ID
        reply_to_user = tweet.in_reply_to_screen_name   # If reply: original tweet's screenname
        # Append attributes to list
        tweets_list.append({'tweet_id':tweet_id,
                            'text':text, 
                            'screen_name':screen_name,
                            'followers':followers,
                            'retweet_count':retweet_count, 
                            'favorite_count':favorite_count, 
                            'created_at':created_at, 
                            'source':source,
                            'reply_to_status':reply_to_status,
                            'reply_to_user':reply_to_user})
    # Creates a DataFrame
    df = pd.DataFrame(tweets_list, columns=['tweet_id',
                                            'text',
                                            'screen_name',
                                            'followers',
                                            'retweet_count',
                                            'favorite_count', 
                                            'created_at',
                                            'source',
                                            'reply_to_status',
                                            'reply_to_user'])
    return df


def first_cleaning(df):
    """It returns a DataFrame after dropping duplicates (subset=['tweet_id']) and sorting it (by='tweet_id')
    
    Parameters
    ----------
    df : Pandas DataFrame to clean
    """
    df_no_dup = df.drop_duplicates(subset=['tweet_id'], ignore_index=True)
    cleaned_df = df_no_dup.sort_values(by='tweet_id', ignore_index=True)
    return cleaned_df

**API rate limits:** Maximum of 450 requests per 15 minutes. Endpoint: Recent Search

In [2]:
# Main functions

def main_retrieval(file_path, last_id=None):
    """
    Main retrieval function.
    It saves a DataFrame to a csv in a given path after three rounds of retrieving tweets.
    After each 450 requests it sleeps 15 min.
    
    And it returns the last tweet id.
    
    Parameters
    ----------
    file_path : file where the DataFrame will be stored (append mode)
    last_id : if given, it retrieves tweets only with a greter ID (older)
    """
    # Set a connection to the api
    api = connect_to_twitter_OAuth2()
    # Set some required variables
    number_of_requests = 450
    dfs = []
    # Main loop
    for i in tqdm(range(number_of_requests)):
        if last_id:
            crypto_tweets = retrieve_tweets(api, since_id=last_id)
            df = extract_tweet_atributes(crypto_tweets)
            # Set a new last_id. Next iteration starts taking tweets from it on
            last_id = df['tweet_id'].max()
            dfs.append(df)
        # It's the first iteration and there is no last_id yet
        else:
            crypto_tweets = retrieve_tweets(api)
            df = extract_tweet_atributes(crypto_tweets)
            # Set the first last_id. Next iteration starts taking tweets from it on
            last_id = df['tweet_id'].max()
            dfs.append(df)

        df = pd.concat(dfs, ignore_index=True)
        df = first_cleaning(df)
        # Saves df to a csv in the file_path, ignoring index, appending it, and not writting column names each time
        df.to_csv(file_path, sep=',', index=False, mode='a', header=False)

    return last_id



def long_term_retrieval(file_path, iterations=15, last_id=None):
    """
    It aims to be retrieving tweets for a long period, 10-12 hours.
    
    Parameters
    ----------
    file_path : file where the DataFrame will be stored (append mode).
    iterations : number of main_retrieval function calls. 15 iterations -> 11 hours period.
    last_id : if given, it retrieves tweets only with a greter ID (older).
    """
    lap = 0
    
    while lap <= iterations:
        # Try to retrieve tweets or sends an email if it cannot. It does not break the loop
        try:
            last_id = main_retrieval(file_path=file_path, last_id=last_id)
        except:
            print('Error!')
            my_email.error_email()
        # Release the counter and break the loop if necessary
        lap += 1
        if lap > iterations:
            break
        print(f'{iterations - lap} laps to go.')
        # Check if it's the last lap
        if lap == iterations:
            my_email.last_lap_reminder()
        # Checks the battery and sends an email if its low
        if my_email.check_battery() < 20:
            my_email.warning()            
        # Time info
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(f'Getting some sleep @ {current_time}...')
        
        # Getting some sleep til next main retrieval
        time.sleep(40 * 60)
        print('*' * 50)

In [26]:
# This cell was run before the latest commit
# The error should be solved by now
long_term_retrieval(file_path, last_id=1358907754711289869)

100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:12<00:00,  1.78it/s]


I'm sleeping @ 00:34:30...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:11<00:00,  1.79it/s]


I'm sleeping @ 00:53:42...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:09<00:00,  1.80it/s]


6 laps to go @ 01:12:53...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:03<00:00,  1.85it/s]


I'm sleeping @ 02:06:57...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [03:59<00:00,  1.88it/s]


I'm sleeping @ 02:25:57...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [03:59<00:00,  1.88it/s]


5 laps to go @ 02:44:58...


 43%|██████████████████████████████████▍                                             | 194/450 [01:45<02:19,  1.84it/s]


TweepError: Failed to send request: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/search/tweets.json?q=Bitcoin+OR+BTC+OR+%23Bitcoin+OR+%23BTC+OR+%24Bitcoin+OR+%24BTC&lang=en&result_type=recent&count=100&since_id=1358968031821459456&tweet_mode=extended (Caused by SSLError(SSLError("bad handshake: SysCallError(10054, 'WSAECONNRESET')")))

In [62]:
# Weaknesses:
# 1. It gave an error: number of requests exceded :(

I'm afraid the method `since_id` from `api.search()` function doesn't work quite as expected :(. It seems that it's able to retrieve tweets just **one hour old**.

Therefore, there's gonna always be a period of time where data is missing (between each time I run the *main* cell) unless the script is continuously running (for 10/14 days or so) :(((.

### Truncated tweets
Texts over 140 characters are truncated. There could be a solution, adding `tweet_mode='extended` parameter when calling my "retrive_tweets" function. <br>
Let's see it in action!

AND IT WORKS!!! We got the full text of the tweet! Take that Twitter!
It doesn't work for retweets though.

In [11]:
# First look
print(df.shape)
df.head(3)

(41972, 10)


Unnamed: 0,tweet_id,text,screen_name,followers,retweet_count,favorite_count,created_at,source,reply_to_status,reply_to_user
0,1358854569841819649,"RT @nwoodfine: Okay, fess up. Who are the hodl...",razvanprt1,76,34,0,2021-02-08 19:05:50,Twitter for iPhone,,
1,1358854569611042816,RT @Alisaalora1: Sponsored Post\nGo with somet...,AriaAnalia,138,3,0,2021-02-08 19:05:50,Twitter Web App,,
2,1358854569074323458,5 lucky winners to get $50 worth of #BTC in @l...,leleco_lisboa,852,0,0,2021-02-08 19:05:50,Twitter for Android,,


In [13]:
# First cleaning
df = first_cleaning(df)
df.shape

(9933, 10)

In [25]:
df['tweet_id'].max()

1358887736845959169

In [80]:
new_df['created_at'].max()

Timestamp('2021-02-02 13:45:48')

In [81]:
# It should have started grabing tweets from 13:45, not 14:27.
# At least started taking tweets 1 hour before I ran it (15:28)
third_df['created_at'].min()

Timestamp('2021-02-02 14:27:37')