A script interacting with the Twitter API on a regular basis, fetching, filtering & storing data. Any tweet that mentions the JediSwap Twitter account or quotes a tweet posted by the account will be fetched and passed through a filtering stage. Any tweets that are not dropped during filtering are stored in monthly csv files with columns for the attributes mentioned below:
Data obtained per tweet
- tweet contents, timestamp, referenced tweets, conversation id
- tweet views, replies, quotes, retweets, likes
- author id, username, followers, following, tweet count, listed count
- if reply: tagged accounts in media of parent tweet (scraped)
- if reply: mentions of parent tweet
Each time the script is run, it searches mentions and quotes backwards through time until it encounters the most recent tweet from the known data. This is an attempt at the most sparse implementation possible with regards to the total amount of requests, so no tweet should ever be queried for twice. How often the script has to be run to avoid any gaps in the data depends solely on your API tier and the expected activity of your Twitter account & followers. For example, if you are allowed to query for the last 800 mentions using the Twitter mentions timeline, you need to run the script often enough so that there will be less than 800 new mentions since the last time the script fetched data from the API. No harm is done by running the script much more often than necessary to insure against gaps in the data.
Web scraping
Some essential information cannot be queried via the Twitter API 2.0, for example the list of users that are tagged in a photo of a tweet. In these cases, the script scrapes the information from the Twitter frontend using Selenium. For this to work, you will have to install the version of Chromedriver that most closely matches your installed Google Chrome browser. And since the information is only visible to signed in Twitter users, you'll have to create a user data folder as described here and run the script Selenium_Twitter_Login.py once in order to sign into Twitter manually and create a session cookie that the script can then use for the automated scraping. Should it expire, just repeat this step before running the main script.
To run, a Twitter developer account is needed. Once an
account is registered, paste your API bearer token next to the key API_BEARER_TOKEN
in
the .env
file, as shown in sample.env, omitting any quotes. Paste the Twitter user id you want to use the
script for next to the key TWITTER_USER_ID
, also without any quotes. In main.py, set out_path
to where you want the csv file to be generated.
Run main.py to start the script:
python main.py
-
Query parameters can be customized via
get_query_params()
in query_and_filter.py. This will affect each API response and alter the returned response fields uniformly. -
If called directly, the lower-level querying functions in query_and_filter.py accept additional query parameters as a dictionary
add_params
, which will be appended to the parameters defined inget_query_params()
. This way, an API search can be refined or restricted to a specific time interval. -
filter_patterns
in query_and_filter.py can be expanded to drop tweets programmatically. It uses regex to exclude any tweet where a search pattern matches the tweet contents. -
For more advanced filtering and filtering based on tweet attributes other than
tweet["text"]
, functions can be appended to pandas_pipes.py and added to the pipeline in main.py.
This project is licensed under the MIT license. See the LICENSE file for details. Collaboration welcome!