Monitoring Twitter Project - The Division of Humanities and Social Sciences, California Institute of Technology
A technical report for the Twitter monitors: Reliable and Efficient Long-Term Social Media Monitoring
Use twitter developer’s credentials to request real time twitter stream and filter the stream using keywords of interest.
Put_Tweets_in_Kinesis_Stream.py
Put tweets that are relevant to target keywords in Kinesis stream. Specifically, the tweets are labeled by partition keys and directed to available shards. Each shard is a data timeline, it gives us access to data that were put in the stream as early as 24 hours ago. The writing limit of a shard is 1000 records/s and 1 MB/s.
Put_Tweets_in_Kinesis_Stream.py
Monitor usage of each shard and automatically close a redundant shard or open a new shard if incoming data are exceeding the limit.
Put all tweets got from the past 10 minutes in a JSON file and store it in an S3 bucket. The reading limit of a shard is 2 MB/s.
Every hour, the new JSON files in the S3 bucket are pushed to Box FTP, so that the target Box folder will be updated once an hour and is available to all team members.
Search_Tweets_from_REST_API.py
Parse_Kinesis_Stream_to_MariaDB.py
Use twitter developer’s credentials to request real time twitter stream, filter the stream using keywords of interest, then publish the collected tweets in a Pub/Sub Topic.
Use twitter developer’s credentials to request real time twitter stream, filter the stream using keywords of interest, then save the tweets temporarily on a compute instance.
Moving the tweets from the compute instance's hard drive to a Cloud Storage folder.
Moving the tweets from the compute instance's hard drive to a Google Drive folder.
Moving all files in a Box folder to a Google Drive folder.