### Twitter extraction and first round of cleaning
This notebook aims to retrieve tweets from the *Twitter API* using `tweepy` library and then make a first round of cleaning them (e.g. *drop duplicates*, *sort it* by date, apply some *regex*) and stored them in a csv.

**Working on it...**

In [1]:
import pandas as pd
import numpy as np
import tweepy

import os
from tqdm import tqdm
from datetime import datetime
import time

In [2]:
# Hiding secret API keys in Environment Variables
consumer_key = os.environ.get('CONSUMER_KEY')
consumer_secret = os.environ.get('CONSUMER_SECRET')

access_token = os.environ.get('ACCESS_TOKEN')
access_token_secret = os.environ.get('ACCESS_TOKEN_SECRET')

bearer_token = os.environ.get('BEARER_TOKEN')

In [3]:
query = 'Bitcoin OR BTC OR #Bitcoin OR #BTC OR $Bitcoin OR $BTC'

In [4]:
# Functions

def connect_to_twitter_OAuth2(consumer_key=consumer_key, consumer_secret=consumer_secret):
    """Sets a connection to the twitter API.
    
    Parameters
    ----------
    consumer_key : set by default
    consumer_secret : set by default
    """
    auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
    api = tweepy.API(auth)
    return api


def retrieve_tweets(api, since_id=None, max_id=None):
    """
    It returns a twitter object with 100 tweets of a specific api response.
    
    Parameters
    ----------
    api : api connection (required)
    since_id : if given, it returns tweets with an ID greater than that (newer)
    max_id : if given, it returns tweets with an ID less or equal than that (older) (max. 7 days prior)
    """
    return api.search(q=query, lang='en', result_type='recent', count=100, since_id=since_id, max_id=max_id)


def extract_tweet_atributes(tweet_object):
    """It returns a Pandas DataFrame with a tweet per row and its attributes per column."""
    
    tweets_list = []
    
    for tweet in tweet_object:
        # Iterates over each tweet and gets its attributes
        tweet_id = tweet.id   # Unique tweet identifier
        text = tweet.text   # Sring, text of the tweet
        screen_name = tweet.user.screen_name   # String, username
        followers = tweet.user.followers_count   # Number of followers
        retweet_count = tweet.retweet_count   # Number of retweets
        favorite_count = tweet.favorite_count   # Number of favorites
        created_at = tweet.created_at   # UTC time tweet created
        source = tweet.source   # Utility used to post the tweet
        reply_to_status = tweet.in_reply_to_status_id   # If reply: orginal tweet's ID
        reply_to_user = tweet.in_reply_to_screen_name   # If reply: original tweet's screenname
        # Append attributes to list
        tweets_list.append({'tweet_id':tweet_id,
                            'text':text, 
                            'screen_name':screen_name,
                            'followers':followers,
                            'retweet_count':retweet_count, 
                            'favorite_count':favorite_count, 
                            'created_at':created_at, 
                            'source':source,
                            'reply_to_status':reply_to_status,
                            'reply_to_user':reply_to_user})
    # Creates a DataFrame
    df = pd.DataFrame(tweets_list, columns=['tweet_id',
                                            'text',
                                            'screen_name',
                                            'followers',
                                            'retweet_count',
                                            'favorite_count', 
                                            'created_at',
                                            'source',
                                            'reply_to_status',
                                            'reply_to_user'])
    return df

**API rate limits:** Maximum of 450 requests per 15 minutes. Endpoint: Recent Search

In [5]:
# Main

# Set a connection to the api
api = connect_to_twitter_OAuth2()
# Set some required variables
number_of_requests = 450
count = 0
laps = 2
last_id = None
dfs = []
# First loop
while count <= laps:
    # Second loop
    for i in tqdm(range(number_of_requests)):
        
        crypto_tweets = retrieve_tweets(api)
        df = extract_tweet_atributes(crypto_tweets)
        dfs.append(df)
    
    print(f'I\'ve got {len(dfs)} dataframes in my list so far.')
    # It releases the counter and break the loop if necessary
    count += 1
    if count == laps:
        break
    # Time info
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print(f'I\'m sleeping @ {current_time}...')
    # Script getting some sleep til next 450 requests window
    time.sleep(15 * 60)
    
print('Done! :D')
df = pd.concat(dfs, ignore_index=True)

100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [06:49<00:00,  1.10it/s]


I've got 450 dataframes in my list so far.
I'm sleeping now...


100%|████████████████████████████████████████████████████████████████████████████████| 450/450 [07:17<00:00,  1.03it/s]


I've got 900 dataframes in my list so far.
I'm sleeping now...


In [6]:
# Weaknesses to improve:
# 1. Set an if statement (or whatever) so the tweet retrieval function can include a since_id parameter:
#     This will allow getting tweets from the last time the function was executed on.
# 2. Data should be stored in a csv (or csvs) instead of a pandas df:
#     We can achieve this by "df.to_csv()" or directly storing tweets in a csv by "with open(.csv, a+)"
# 3. First round of cleaning:
#     The function gathers 45.000 tweets per 15 min (lap)
#     We noticed that most of them are duplicates. 
#     Which means there are certain ranges of time when there are not 45.000 new bitcoin tweets per 15 mins (not even close),
#     therefore we end up with a tone of duplicate, useless tweets.
#     Create a function that removes them, sort them by date ("created_at")
#     and apply some "re" on them to remove links, #, etc. (even before storing them on a csv file)

In [17]:
df = df.sort_values(by='created_at', ignore_index=True).drop_duplicates(subset=['tweet_id'], ignore_index=True)

In [19]:
df.head()

Unnamed: 0,tweet_id,text,screen_name,followers,retweet_count,favorite_count,created_at,source,reply_to_status,reply_to_user
0,1356375993620049925,RT @LesangT: Elon Musk just got asked about Bi...,chocboipeter,503,19057,0,2021-02-01 22:56:51,Twitter for Android,,
1,1356375994932867072,"RT @luizMilfont: ""My husband used to worry abo...",MoreKoolaidPlz,116,6,0,2021-02-01 22:56:52,Twitter for Android,,
2,1356376000129425409,RT @genesimmons: I’m not recommending any of t...,ShitCinc,48,2731,0,2021-02-01 22:56:53,Twitter for iPhone,,
3,1356376001069133825,@Alts_Anonymous @f2pool_official Because they ...,Sil_Brazile,39,0,0,2021-02-01 22:56:53,Twitter for Android,1.356066e+18,Alts_Anonymous
4,1356375999450124288,RT @buddhistblaire: Let’s make DOGE the next B...,holdthedoge1,165,1,0,2021-02-01 22:56:53,Twitter for Android,,


In [46]:
# Getting the last id
df[df['created_at'] == df['created_at'].max()][::-1][:1]['tweet_id'].values

array([1356383500916690944], dtype=int64)

In [48]:
# Path where first set of tweets will be stored to play with them
file_path = 'C:/Users/Javi/00_raw_data/data_tfm/tweet_set.csv'

df.to_csv(file_path, sep=',', index=False)

In [2]:
you = None
bool(you)

False