### Twitter extraction and first round of cleaning
This notebook aims to retrieve tweets from the *Twitter API* using `tweepy` library and then make a first round of cleaning them (e.g. *drop duplicates*, *sort it* by date, apply some *regex*) and stored them in a csv.

**Working on it...**

In [1]:
import pandas as pd
import numpy as np
import tweepy

import os
from tqdm import tqdm
from datetime import datetime
import time

# My module
import my_email
import config

In [2]:
# Hiding secret API keys in Environment Variables
consumer_key = config.CONSUMER_KEY
consumer_secret = config.CONSUMER_SECRET

access_token = config.ACCESS_TOKEN
access_token_secret = config.ACCESS_TOKEN_SECRET

bearer_token = config.BEARER_TOKEN

In [3]:
query = 'Bitcoin OR BTC OR #Bitcoin OR #BTC OR $Bitcoin OR $BTC'

In [4]:
# Check access to the API
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
if(api.verify_credentials):
    print("Access granted :)")
else:
    print("Access denied :(")

Access granted :)


Definning some functions:

In [5]:
# Functions

def connect_to_twitter_OAuth2(consumer_key=consumer_key, consumer_secret=consumer_secret):
    """Sets a connection to the twitter API.
    
    Parameters
    ----------
    consumer_key : set by default
    consumer_secret : set by default
    """
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    api = tweepy.API(auth)
    return api


def retrieve_tweets(api, since_id=None, max_id=None):
    """
    It returns a twitter object with 100 tweets of a specific api response.
    
    Parameters
    ----------
    api : api connection (required)
    since_id : if given, it returns tweets with an ID greater than that (newer)
    max_id : if given, it returns tweets with an ID less or equal than that (older) (max. 7 days prior)
    """
    return api.search(q=query,
                      lang='en',
                      result_type='mixed',
                      count=100,
                      since_id=since_id,
                      max_id=max_id,
                      tweet_mode='extended')


def extract_tweet_atributes(tweet_object):
    """It returns a Pandas DataFrame with a tweet per row and its attributes per column."""
    
    tweets_list = []
    
    for tweet in tweet_object:
        # Iterates over each tweet and gets its attributes
        tweet_id = tweet.id   # Unique tweet identifier
        text = tweet.full_text   # Sring, text of the tweet
        screen_name = tweet.user.screen_name   # String, username
        followers = tweet.user.followers_count   # Number of followers
        retweet_count = tweet.retweet_count   # Number of retweets
        favorite_count = tweet.favorite_count   # Number of favorites
        created_at = tweet.created_at   # UTC time tweet created
        source = tweet.source   # Utility used to post the tweet
        # Append attributes to list
        tweets_list.append({'tweet_id':tweet_id,
                            'text':text, 
                            'screen_name':screen_name,
                            'followers':followers,
                            'retweet_count':retweet_count, 
                            'favorite_count':favorite_count, 
                            'created_at':created_at, 
                            'source':source})
    # Creates a DataFrame
    df = pd.DataFrame(tweets_list, columns=['tweet_id',
                                            'text',
                                            'screen_name',
                                            'followers',
                                            'retweet_count',
                                            'favorite_count', 
                                            'created_at',
                                            'source'])
    return df


def first_cleaning(df):
    """It returns a DataFrame after dropping duplicates (subset=['tweet_id']) and sorting it (by='tweet_id')
    
    Parameters
    ----------
    df : Pandas DataFrame to clean.
    """
    df_no_dup = df.drop_duplicates(subset=['tweet_id'], ignore_index=True)
    cleaned_df = df_no_dup.sort_values(by='tweet_id', ignore_index=True)
    return cleaned_df


**API rate limits:** Maximum of 450 requests per 15 minutes. Endpoint: Recent Search

In [6]:
# Main functions

def main_retrieval(file_path, last_id=None):
    """
    Main retrieval function.
    It makes 450 requests.
    It saves a DataFrame to a csv in a given path.
    
    Returns 
    -------
    + Last tweet id.
    + DataFrame length
    
    Parameters
    ----------
    file_path : file where the DataFrame will be stored (append mode)
    last_id : if given, it retrieves tweets only with a greter ID (older)
    """
    # Set a connection to the api
    api = connect_to_twitter_OAuth2()
    # Set some required variables
    number_of_requests = 450
    dfs = []
    # Main loop
    for i in tqdm(range(number_of_requests)):
        
        crypto_tweets = retrieve_tweets(api, since_id=last_id)
        df = extract_tweet_atributes(crypto_tweets)
        # Set a new last_id. Next iteration starts taking tweets from it on
        last_id = df['tweet_id'].max()
        dfs.append(df)

    df = pd.concat(dfs, ignore_index=True)
    df = first_cleaning(df)
    last_id = df['tweet_id'].max()
    # Saves df to a csv in the file_path, ignoring index, appending it, and not writting column names each time
    df.to_csv(file_path, sep=',', index=False, mode='a', header=False)

    return last_id, len(df)



def long_term_retrieval(file_path, iterations=25, last_id=None):
    """
    It aims to be retrieving tweets for a long period, 10 hours.
    
    Parameters
    ----------
    file_path : file where the DataFrame will be stored (append mode).
    iterations : number of main_retrieval function calls. 15 iterations -> 11 hours period.
    last_id : if given, it retrieves tweets only with a greter ID (older).
    """
    lap = 0
    while lap <= iterations:
        # Try to retrieve tweets or sends an email if it cannot. It does not break the loop
        try:
            # Set the next last_id and the length of the DataFrame that just added to the csv
            last_id, length = main_retrieval(file_path=file_path, last_id=last_id)
            print(f'{length} new rows added to the csv.')
        except:
            print('Error!')
            my_email.error_email()
        # Release the counter and break the loop if necessary
        lap += 1
        if lap > iterations:
            break
        print(f'{(iterations + 1) - lap} laps to go.')
        # Check if it's the last lap
        if lap == iterations:
            my_email.last_lap_reminder()
        # Checks the battery and sends an email if its low
        if my_email.check_battery() < 20:
            my_email.warning()            
        # Time info
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(f'Getting some sleep @ {current_time}...')
        # Getting some sleep til next main retrieval
        time.sleep(20 * 60)
        print('*' * 50)
    print('Done :D\nEnjoy it!')

In [7]:
# Path where the set of tweets will be stored to play with them
file_path = 'C:/Users/Javi/Desktop/cryptocurrency_predictor/data/twitter/raw_tweets.csv'

In [8]:
time.sleep(12 * 60)

In [None]:
long_term_retrieval(file_path, iterations=50, last_id=1368248674711576580)

100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:49<00:00,  1.55it/s]


991 new rows added to the csv.
50 laps to go.
Getting some sleep @ 18:35:58...
**************************************************


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:43<00:00,  1.59it/s]


1108 new rows added to the csv.
49 laps to go.
Getting some sleep @ 19:00:43...
**************************************************


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [04:42<00:00,  1.60it/s]


1068 new rows added to the csv.
48 laps to go.
Getting some sleep @ 19:25:26...
**************************************************


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [05:19<00:00,  1.41it/s]


1253 new rows added to the csv.
47 laps to go.
Getting some sleep @ 19:50:47...


In [None]:
# Weaknesses: 
# 1. It gave an error: number of requests exceded :(

I'm afraid the method `since_id` from `api.search()` function doesn't work quite as expected :(. It seems that it's able to retrieve tweets just **one hour old**.

Therefore, there's gonna always be a period of time where data is missing (between each time I run the *main* cell) unless the script is continuously running (for 10/14 days or so) :(((.

### Truncated tweets
Texts over 140 characters are truncated. There could be a solution, adding `tweet_mode='extended'` parameter when calling my "retrive_tweets" function. <br>
Let's see it in action!

AND IT WORKS!!! We got the full text of the tweet! Take that Twitter!
It doesn't work for retweets though.

### First look at the data!

In [2]:
file_path = 'C:/Users/Javi/Desktop/cryptocurrency_predictor/data/twitter/raw_tweets.csv'
columns = ['tweet_id',
           'text',
           'screen_name',
           'followers',
           'retweet_count',
           'favorite_count', 
           'created_at',
           'source']

data = pd.read_csv(file_path, names=columns)

In [3]:
print(data.info())
data.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 798963 entries, 0 to 798962
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tweet_id        798963 non-null  int64 
 1   text            798963 non-null  object
 2   screen_name     798963 non-null  object
 3   followers       798963 non-null  int64 
 4   retweet_count   798963 non-null  int64 
 5   favorite_count  798963 non-null  int64 
 6   created_at      798963 non-null  object
 7   source          788591 non-null  object
dtypes: int64(4), object(4)
memory usage: 48.8+ MB
None


Unnamed: 0,tweet_id,text,screen_name,followers,retweet_count,favorite_count,created_at,source
798958,1368248669804269568,RT @iamZatoshi: 🔥🐉 BITCOIN SV GIVEAWAY 🐉🔥 \n\n...,AirdropHunter86,122,3432,0,2021-03-06 17:14:38,Twitter for Android
798959,1368248672295731200,RT @VentureCoinist: Bitcoin pumping because......,CryptoEuclid_,1,38,0,2021-03-06 17:14:38,Twitter for iPhone
798960,1368248672467570689,RT @NitroExOfficial: Our first event is #Airdr...,Siddhes26270348,31,12444,0,2021-03-06 17:14:39,Twitter for Android
798961,1368248673365192707,@ThuanCapital millionaires they don't care abo...,akevinhao,7,0,0,2021-03-06 17:14:39,Twitter for iPhone
798962,1368248674711576580,"""Supporting NitroEx Exchange for my own future...",Ayush81223104,29,0,0,2021-03-06 17:14:39,Twitter Web App


In [4]:
data['tweet_id'].max()

1368248674711576580