# In this notebook, we will search some twitte message

# Pre-requise: Create twitter developer account and generate api credentials
Step 1: Apply for a Twitter Developer Account

Go to the Twitter developer site to apply for a developer account. Here, you have to select the Twitter user responsible for this account. It should probably be you or your organization. Here’s what this page looks like:



In [67]:
import tweepy as tw
import json
import pandas as pd
import pyarrow.parquet as pq
import s3fs
from pyarrow import fs
import pyarrow as pa
import os
from datetime import date

# Download tweets
## Step1. Setup the twitter api credential


In [85]:
consumer_key = "changeMe"
consumer_secret = "changeMe"
access_token = "changeMe"
access_token_secret = "changeMe"

## Step2. Create an instance of the twitter client

In [3]:
client_auth = tw.OAuthHandler(consumer_key, consumer_secret)
client_auth.set_access_token(access_token, access_token_secret)
api = tw.API(client_auth, wait_on_rate_limit=True, retry_count=5, retry_delay=1)

In [4]:
try:
    api.verify_credentials()
    print("Authentication OK")
except Exception as e:
    print("Error during authentication")

Authentication OK


## Step3. Get tweets that you are intrested in 

We have created the twitter client, now we are ready to search the tweets messages. Here we use the search_tweets to find all the tweets that contains certain key words of certain language. We don't use date to filter tweets because the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
For more details about the search_tweets
https://docs.tweepy.org/en/stable/api.html?highlight=search%20tweet#tweepy.API.search_tweets

If you want to search older tweets, you can use the search_full_archive method.
https://docs.tweepy.org/en/stable/api.html?highlight=search%20tweet#tweepy.API.search_full_archive



### Configure the search filter

In [40]:
# filter the search result by using below key words 
search_words = "#insee"
# We can get tweet before certain date
# until_date = "2021-11-24
# specify the 
language="fr"
# the max tweet number will be retained in the result
max_tweet_count=1000000

In [None]:
# Get the tweets
tweets = api.search_tweets(q=search_words,lang="fr",result_type="mixed", count=max_tweet_count)

# tweets is a list of object status, which has an attribute _json which is the actual tweet in json string.
print(tweets[0]._json)


In [49]:
# For example, if we want to get the name of the sender, date and text of a tweet

tweet_dict=tweets[0]._json
text=tweet_dict.get("text")
user_name=tweet_dict.get("user").get("name")
date=tweet_dict.get("created_at")
print(f"name:{user_name} | message: {text} | date:{date}")

name:V. RICHES-FLORES | message: https://t.co/2nt1wV9P6U
RFR - L’inflation arrive en France #france #croissance #inflation #insee #conjoncture… https://t.co/s47ETAqWBO | date:Wed Nov 24 13:38:12 +0000 2021


## Step4. Generate a data frame 

Now we have the tweet, we want to generate a dataframe based on the tweet 

In [61]:

def generate_tweet_df(tweets):
     # init dataframe
    df = pd.DataFrame(columns=['name','date','text'])
    index=0
    for tweet in tweets:
        # get column value for each tweet
        tweet_dict=tweet._json
        text=tweet_dict.get("text")
        user_name=tweet_dict.get("user").get("name")
        date=tweet_dict.get("created_at")
        # add new row to the dataframe
        df.loc[index] = pd.Series({'name':user_name, 'date':date, 'text':text})
        index=index+1
    return df
       
        

In [62]:
df=generate_tweet_df(tweets)

In [71]:
df.head(10)

Unnamed: 0,name,date,text
0,V. RICHES-FLORES,Wed Nov 24 13:38:12 +0000 2021,https://t.co/2nt1wV9P6U\nRFR - L’inflation arr...
1,Drecfire 🥕🥕🥕,Wed Nov 24 12:57:15 +0000 2021,RT @blaisegrenier: Nous ferons un point d'ici ...
2,Takeo,Wed Nov 24 11:30:40 +0000 2021,Nous ferons un point d'ici quelques temps en c...
3,Thierry Bès,Wed Nov 24 10:16:40 +0000 2021,"RT @RichardEudes: En novembre 2021, le climat ..."
4,Ellisphere,Wed Nov 24 08:12:08 +0000 2021,"RT @RichardEudes: En novembre 2021, le climat ..."
5,"Richard Eudes, PhD",Wed Nov 24 07:57:44 +0000 2021,"En novembre 2021, le climat des affaires dans ..."
6,"Richard Eudes, PhD",Wed Nov 24 07:57:42 +0000 2021,"En novembre 2021, le climat des affaires dans ..."
7,"Richard Eudes, PhD",Wed Nov 24 07:57:41 +0000 2021,"En novembre 2021, le climat des affaires du co..."
8,"Richard Eudes, PhD",Wed Nov 24 07:57:39 +0000 2021,"En novembre 2021, le climat des affaires dans ..."
9,"Richard Eudes, PhD",Wed Nov 24 07:57:38 +0000 2021,"En novembre 2021, le climat des affaires en Fr..."


## Step 5. Write data frame on s3

We have the data frame, now we want to save the data frame on s3. We want to save the data frame in format parquet. Because it has an integrated schema.

### 5.1 Configure s3 connection

Here we will set the s3 credential and the output path of the parquet file. As we will generate a parquet file each day. We would like to have the generation date inside the file name.


In [78]:
endpoint = os.environ['AWS_S3_ENDPOINT']
bucket = "pengfei"
current_date=date.today().strftime("%d-%m-%Y")
output_path = f"diffusion/demo_prod/tweet_{current_date}"


### 5.2 Write df to s3 as parquet file

In [79]:
# This function write a pandas dataframe to s3 in parquet format
def write_df_to_s3(df, endpoint, bucket_name, path):
    # Convert pandas df to Arrow table
    table = pa.Table.from_pandas(df)
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    pq.write_to_dataset(table, root_path=file_uri, filesystem=fs)

In [80]:
write_df_to_s3(df,endpoint,bucket,output_path)

## 6. Test the output parquet file

In [81]:
# This function read a parquet file and return a arrow table
def read_parquet_from_s3(endpoint: str, bucket_name, path):
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    str_info = fs.info(file_uri)
    print(f"input file metadata: {str_info}")
    dataset = pq.ParquetDataset(file_uri, filesystem=fs)
    table = dataset.read()
    return table

In [84]:
arrow_table=read_parquet_from_s3(endpoint,bucket,output_path)

# Convert back to pandas
df_new = arrow_table.to_pandas()
df_new.head()

input file metadata: {'name': 'pengfei/diffusion/demo_prod/tweet_24-11-2021', 'size': 0, 'type': 'directory'}


Unnamed: 0,name,date,text
0,V. RICHES-FLORES,Wed Nov 24 13:38:12 +0000 2021,https://t.co/2nt1wV9P6U\nRFR - L’inflation arr...
1,Drecfire 🥕🥕🥕,Wed Nov 24 12:57:15 +0000 2021,RT @blaisegrenier: Nous ferons un point d'ici ...
2,Takeo,Wed Nov 24 11:30:40 +0000 2021,Nous ferons un point d'ici quelques temps en c...
3,Thierry Bès,Wed Nov 24 10:16:40 +0000 2021,"RT @RichardEudes: En novembre 2021, le climat ..."
4,Ellisphere,Wed Nov 24 08:12:08 +0000 2021,"RT @RichardEudes: En novembre 2021, le climat ..."
