twwordmap - Re-write after 2 years of code

Preocts#8196 Discord | Github

See v1/ for the original attempt during #100DaysOfCode 2020

Requirements:

`.env` file setup

To run collect.py you will need authentication credentials from Twitter's v2 API for an application. Place these into an .env file in the project root. Note, if the bearer token is also included the secret and key are not needed.

TW_CONSUMER_KEY=[client key]
TW_CONSUMER_SECRET=[client secret]
# Optional alternative: provide existing bearer token
TW_BEARER_TOKEN=[bearer token]

collect.py (re-write)

usage: collect.py [-h] [--name tweets2021.11.14.18.49.44.db] [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] search_term start_date

#100DaysofCode Project - 2021 rewrite

positional arguments:
  search_term           Define the search query, up to 512 characters. Be specific! Highly recommended to use `-is:retweet` to drastically reduce the number of results. Applications have a 500,000 **monthly** limit (per tweet, not request!).
  start_date            YYYY-MM-DD Date of when to start search, 7 days max. Tweets are pulled from current time backward to this date.

optional arguments:
  -h, --help            show this help message and exit
  --name tweets2021.11.14.18.49.44.db
                        sqlite3 file to store results in
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level. Default: INFO

Data population: This script does the heavy lifting of polling Twitter's Recent Search API and collecting the results into the sqllite3 database.

This script will handle 429 throttles from Twitter by monitoring the response headers and pausing when needed. There is a 15-minute reset window for these limits. INFO level debugging has a regular output indicating when the script is still waiting.

NOTE: All tweets pulled are stored in the database as they are pulled. If you break from a throttle window, the tweets successfully pulled are not lost.

Example output with DEBUG log level:

INFO:collect:Retrieving Tweets...
DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "GET /2/tweets/search/recent?start_time=2021-11-13T00%3A00%3A00Z&max_results=100&tweet.fields=created_at&query=%23100DaysOfCode&next_token=b26v89c19zqg8o3fpdv9kz2d7cwclj6yfgts2hkddzed9 HTTP/1.1" 200 23279
INFO:collect:Pulled 100 tweets, 84953 total
DEBUG:collect:Requests remaining: 0
DEBUG:collect:ID start: 1459535890380902405 - end: 1459535502990888960
INFO:collect:Rate limit reached, resets at: 2021-11-14 17:11:53 UTC
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:00:52.159285 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:01:52.219622 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:02:52.279899 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:03:52.340215 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:04:52.400542 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:05:52.460912 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:06:52.521377 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:07:52.581821 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:08:52.642266 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:09:52.702683 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:10:52.762999 UTC...
INFO:collect:Waiting for limit reset, currently: 2021-11-14 17:11:52.823330 UTC...
INFO:collect:Retrieving Tweets...
DEBUG:urllib3.connectionpool:Resetting dropped connection: api.twitter.com
DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "GET /2/tweets/search/recent?start_time=2021-11-13T00%3A00%3A00Z&max_results=100&tweet.fields=created_at&query=%23100DaysOfCode&next_token=b26v89c19zqg8o3fpdv9kz2d7c1leyi5jv2zz6tv9hcvx HTTP/1.1" 200 24445
INFO:collect:Pulled 100 tweets, 85053 total
DEBUG:collect:Requests remaining: 449
DEBUG:collect:ID start: 1459535501870960643 - end: 1459534944624128001

datastore.py (new)

A small abstract layer for storing and retrieving tweet data from a SQLite3 database. Initializing the object sets the file name. Database calls are done within a context manager to ensure proper closing on exit. :memory: is a valid filename and will create a database that only exists in memory. All data will be lost on exit.

Example usage:

from datastore import DataStore

mystore = DataStore("mydata.db")

with mystore.connection() as dbclient:
    # Reference CRUD methods via `dbclient`
    ...

process.py (re-write)

Processes the tweets stored in the sqlite3 database that collect.py creates. Filters out words based on the below rules and outputs an html file with the same name as the input database.

Filters:

No retweets
Removes unicode
Words must be greater than 1 character and less than 42
No html links
Starts with an ascii character (or #)
Ends with an ascii character
Not found in skip_words.py

usage: process.py [-h] [--cutoff CUTOFF] database_name

#100DaysofCode Project - 2021 re-write

positional arguments:
  database_name    Sqlite3 database file to load.

optional arguments:
  -h, --help       show this help message and exit
  --cutoff CUTOFF  Lower percent (0-100) to remove from output. Default: 60

twitterapiv2 - Custom API wrapper

In this rewrite I ended up creating my own custom wrapper. I'll break this out into its own proper repo "soon".

Authenticating with Twitter API v2 as an application

The authentication client of included in the twitterapiv2/ library requires your applications consumer credentials to be loaded in the environment variables before an authentication attempt is made. The consumer credentials are your client key and client secret as found in the application dashboard of the Twitter Dev panel.

Create two environmental variables as follows:

TW_CONSUMER_KEY=[client key]
TW_CONSUMER_SECRET=[client secret]

A 'TW_BEARER_TOKEN' will be created in the environment on successful authentication. This key should be stored securely and loaded to the environment on subsequent calls. When this token already exists, the request for a bearer token can be skipped.

Additional calls to the authentication process will not result in a new bearer token if the same consumer credentials are provided. The former bearer token must be invalided to obtain a new one.

Search client

The search client performs a "Recent Search" from the Twitter V2 API. This search is limited to seven days of history and has a large number of inner objects to select from. By default, the search only returns the text of the tweet and the id of the tweet.

After declaring a base SearchClient() the fields of the search query can be set using the builder methods. These can be chained as they return a new SearchClient with the fields carried forward. When executing a .search() the page_token allows for pagination of results.

Rate limiting must be handled outside of the library. SearchClient.limit_remaining will be an int representing the number of API calls remaining for requests are refused. SearchClient.limit_reset is an unaware UTC datetime object of the next reset time (typically 15 minutes). If a search has not been invoked the .limit_remaining will default to -1 and limit_reset to .utcnow().

Full API details:

https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent#Default

Use example:

from datetime import datetime
from twitterapiv2.auth_client import AuthClient
from secretbox import SecretBox

SecretBox(auto_load=True)

auth = AuthClient()
auth.set_bearer_token()
search_string = "#100DaysOfCode -is:retweet"

mysearch = (
    SearchClient()
    .start_time("2021-11-10T00:00:00Z")
    .expansions("author_id,attachments.poll_ids")
    .max_results(100)
)
while True:
    log.info("Retrieving Tweets...")
    try:
        response = client.search(search_string, page_token=client.next_token)
    except InvalidResponseError as err:
        print(f"Invalid response from HTTP: '{err}'")
        break
    except ThrottledError:
        print(f"Rate limit reached, resets at: {client.limit_reset} UTC")
        while datetime.utcnow() <= client.limit_reset:
            print(f"Waiting for limit reset, currently: {datetime.utcnow()} UTC...")
            sleep(SLEEP_TIME)
        continue
    # Do something with pulled data in response
    if not client.next_token:
        print("No additional pages to poll.")
        break

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
tests		tests
twitterapiv2		twitterapiv2
v1		v1
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
collect.py		collect.py
datastore.py		datastore.py
process.py		process.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
skipwords.py		skipwords.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twwordmap - Re-write after 2 years of code

Requirements:

`.env` file setup

collect.py (re-write)

datastore.py (new)

process.py (re-write)

twitterapiv2 - Custom API wrapper

Authenticating with Twitter API v2 as an application

Search client

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

twwordmap - Re-write after 2 years of code

Requirements:

.env file setup

collect.py (re-write)

datastore.py (new)

process.py (re-write)

twitterapiv2 - Custom API wrapper

Authenticating with Twitter API v2 as an application

Search client

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`.env` file setup

Packages