# 01. Data ingestion 
In this notebook, we will search some twitter message. Then we will call the twitter developper api to download the tweets that we are intrested in

# Pre-requise: Create twitter developer account and generate api credentials
Step 1: Apply for a Twitter Developer Account

Go to the Twitter developer site to apply for a developer account. Here, you have to select the Twitter user responsible for this account. It should probably be you or your organization. Here’s what this page looks like:



In [1]:
import tweepy as tw
import json
import pandas as pd
import pyarrow.parquet as pq
import s3fs
from pyarrow import fs
import pyarrow as pa
import os
from datetime import date

# Download tweets
## Step1. Setup the twitter api credential


In [2]:
consumer_key = "changeMe"
consumer_secret = "changeMe"
access_token = "changeMe"
access_token_secret = "changeMe"

## Step2. Create an instance of the twitter client

In [3]:
client_auth = tw.OAuthHandler(consumer_key, consumer_secret)
client_auth.set_access_token(access_token, access_token_secret)
api = tw.API(client_auth, wait_on_rate_limit=True, retry_count=5, retry_delay=1)

In [4]:
try:
    api.verify_credentials()
    print("Authentication OK")
except Exception as e:
    print("Error during authentication")

Authentication OK


## Step3. Get tweets that you are intrested in 

We have created the twitter client, now we are ready to search the tweets messages. Here we use the search_tweets to find all the tweets that contains certain key words of certain language. We don't use date to filter tweets because the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
For more details about the search_tweets
https://docs.tweepy.org/en/stable/api.html?highlight=search%20tweet#tweepy.API.search_tweets

If you want to search older tweets, you can use the search_full_archive method.
https://docs.tweepy.org/en/stable/api.html?highlight=search%20tweet#tweepy.API.search_full_archive



### Configure the search filter

In [None]:
# filter the search result by using below key words 
search_words = "#insee"
# We can get tweet before certain date
# until_date = "2021-11-24
# specify the 
language="fr"
# the max tweet number will be retained in the result
max_tweet_count=1000000

In [None]:
# Get the tweets
tweets = api.search_tweets(q=search_words,lang="fr",result_type="mixed", count=max_tweet_count)

# tweets is a list of object status, which has an attribute _json which is the actual tweet in json string.
print(tweets[0]._json)


In [None]:
# For example, if we want to get the name of the sender, date and text of a tweet

tweet_dict=tweets[0]._json
text=tweet_dict.get("text")
user_name=tweet_dict.get("user").get("name")
date=tweet_dict.get("created_at")
print(f"name:{user_name} | message: {text} | date:{date}")

# get tweets by using twitter api V2. Note to access v2, you need to upgrade your account to a developer or academic account. 

In [None]:
# we need to use client instead of api
tw_client=tw.Client(consumer_key=consumer_key, consumer_secret=consumer_secret, access_token=access_token, access_token_secret=access_token_secret)
tweets=tw_client.search_recent_tweets(query=search_words)
println(len(tweets))

## Step4. Generate a data frame 

Now we have the tweet, we want to generate a dataframe based on the tweet 

In [None]:

def generate_tweet_df(tweets):
     # init dataframe
    df = pd.DataFrame(columns=['name','date','text'])
    index=0
    for tweet in tweets:
        # get column value for each tweet
        tweet_dict=tweet._json
        text=tweet_dict.get("text")
        user_name=tweet_dict.get("user").get("name")
        date=tweet_dict.get("created_at")
        # add new row to the dataframe
        df.loc[index] = pd.Series({'name':user_name, 'date':date, 'text':text})
        index=index+1
    return df
       
        

In [None]:
df=generate_tweet_df(tweets)

In [None]:
df.head(10)

In [None]:
# we can also get all the fields of a tweet
def tweets_json(tweets):
    tweet_json=[]
    for tweet in tweets:
        tweet_json.append(tweet._json)
    return tweet_json
    

# Here we use json normalize to convert json file to a pandas dataframe
df_all_atts = pd.json_normalize(tweets_json(tweets))


## Step 5. Write data frame to s3

We have the data frame, now we want to save the data frame on s3. We want to save the data frame in format parquet. Because it has an integrated schema.

### 5.1 Configure s3 connection

Here we will set the s3 credential and the output path of the parquet file. As we will generate a parquet file each day. We would like to have the generation date inside the file name.


In [None]:
endpoint = os.environ['AWS_S3_ENDPOINT']
bucket = "pengfei"
current_date=date.today().strftime("%d-%m-%Y")
output_path = f"diffusion/demo_prod/tweet_{current_date}"


### 5.2 Write df to s3 as parquet file

In [62]:
# This function write a pandas dataframe to s3 in parquet format
def write_df_to_s3(df, endpoint, bucket_name, path):
    # Convert pandas df to Arrow table
    table = pa.Table.from_pandas(df)
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    pq.write_to_dataset(table, root_path=file_uri, filesystem=fs)

In [None]:
write_df_to_s3(df,endpoint,bucket,output_path)

## 6. Test the output parquet file

In [None]:
# This function read a parquet file and return a arrow table
def read_parquet_from_s3(endpoint: str, bucket_name, path):
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    str_info = fs.info(file_uri)
    print(f"input file metadata: {str_info}")
    dataset = pq.ParquetDataset(file_uri, filesystem=fs)
    table = dataset.read()
    return table

In [None]:
arrow_table=read_parquet_from_s3(endpoint,bucket,output_path)

# Convert back to pandas
df_new = arrow_table.to_pandas()
df_new.head()

# 7. Get old tweets from archive

In this section, we will use **API.search_full_archive(label, query, *, tag, fromDate, toDate, maxResults, next)** to collect old tweets.
To call this endpoint, you need to have extended rights on your twitter developper account

Examples

```python
# fromDate' must be in format 'yyyyMMddHHmm'
from_date="202012010000"
end_date="202012050000"

tweets = api.search_full_archive(label="dev",query=search_words,fromDate=from_date,toDate=end_date, maxResults=100)
print(len(tweets))
```

Belows are some helper functions

In [18]:
import time

# by using the given start_year and end_year, it will generate 5 days date interval for the given year range
def generate_dates(start_year,end_year):
    dates=[]
    months=["01","02","03","04","05","06","07","08","09","10","11","12"]
    days=["01","06","11","16","21","26"]
    for year in range(start_year,end_year):
        for month in months:
            for day in days:
                date=f"{year}{month}{day}0000"
                dates.append(date)
    return dates
  
# Convert a list of tweet status to a list of tweet message in json
def tweets_json(tweets):
    tweet_json=[]
    for tweet in tweets:
        tweet_json.append(tweet._json)
    return tweet_json

# This function write a pandas dataframe to s3 in parquet format
def write_df_to_s3(df, endpoint, bucket_name, path):
    # Convert pandas df to Arrow table
    table = pa.Table.from_pandas(df)
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    pq.write_to_dataset(table, root_path=file_uri, filesystem=fs)
    

# This function use a start_year and an end_year to get all tweets inside this range, then save to s3   
def save_tweets(search_words,start_year,end_year,endpoint,bucket_name,path):
    from_date=None
    end_date=None
    dates=generate_dates(start_year,end_year)
    for i in range(0,len(dates),2):
        from_date,end_date=dates[i],dates[i+1]
        tweets = api.search_full_archive(label="dev",query=search_words,fromDate=from_date,toDate=end_date, maxResults=100)
        pdf_tweets = pd.json_normalize(tweets_json(tweets))
        if pdf_tweets.empty==False:
            print(f"save {len(pdf_tweets)} tweets")
            write_df_to_s3(pdf_tweets,endpoint,bucket_name,path)
        # after each iteration, sleep 60 secs to avoid twitter rate limit 300 request/15mins
        time.sleep(60)
        

## 7.1 Configure save old tweets parameters

In [19]:
start_year=2011
end_year=2021
endpoint = os.environ['AWS_S3_ENDPOINT']
bucket = "pengfei"
output_path=f"diffusion/demo_prod/old/{start_year}_{end_year}"
search_words = "insee"

In [20]:
save_tweets(search_words,start_year,end_year,endpoint,bucket,output_path)

TooManyRequests: 429 Too Many Requests
Request exceeds account’s current package request limits. Please upgrade your package and retry or contact Twitter about enterprise access.