# Collect tweets for corpus

1. Collect tweets from [@horkrimsisms](https://twitter.com/horkrimsisms)
2. (cache to avoid downloading all tweets every time)
3. produce `tweet-corpus.txt` of tweets to train models
4. Use other notebooks ([markov.ipynb](markov.ipynb))


In [1]:
# credentials not in repo
# defines bearer_token="..."
%run creds

Connect to twitter API using tweepy

In [2]:
import tweepy
tw = tweepy.Client(bearer_token=bearer_token)

Get user id for @horkrimsisms

In [3]:
username = "horkrimsisms"

r = tw.get_user(username=username)
user_id = r.data.id

Download all horkrimsisms tweets (with caching)

In [6]:
from pathlib import Path
import json


def _cache_file(user_id: int) -> Path:
    cache_dir = Path(f"cache")
    cache_dir.mkdir(exist_ok=True)
    return cache_dir.joinpath(f"{user_id}.json")


def load_cache(user_id: int) -> list[dict]:
    cache_file = _cache_file(user_id)
    if not cache_file.exists():
        return []
    with cache_file.open() as f:
        try:
            return json.load(f)
        except json.JSONDecodeError as e:
            print(f"Error decoding cache: {e}")
            return []


def save_cache(user_id: int, tweets: list):
    """Save full list of tweets to cache file"""
    cache_file = _cache_file(user_id)
    with cache_file.open("w") as f:
        json.dump(tweets, f)


def get_tweets(user_id: int) -> list:
    """Collect tweets
    
    1. load cache
    2. retrieve any new tweets
    3. save any new tweets to the cache
    4. return list of {id: int, text: "tweet"}
    """
    cache = load_cache(user_id)
    if cache:
        since_id = cache[0]["id"]
    else:
        since_id = None

    until_id = None

    def fetch() -> list:
        """Download one page of tweets"""
        return tw.get_users_tweets(
            user_id,
            max_results=100,
            exclude=["retweets", "replies"],
            until_id=until_id,
            since_id=since_id,
        ).data

    to_add = []
    new_tweets = fetch()

    if not new_tweets:
        return cache

    while new_tweets:
        to_add.extend(new_tweets)
        until_id = to_add[-1].id
        new_tweets = fetch()

    all_tweets = [dict(t) for t in to_add] + cache
    save_cache(user_id, all_tweets)
    return all_tweets

In [7]:
tweets = get_tweets(user_id)
len(tweets)

533

Most recent 3 tweets

In [9]:
tweets[:3]

[{'id': 1542921191383617539,
  'text': 'imagine a dark carousel of various elks'},
 {'id': 1541821432535154690,
  'text': 'there is a unique hunger here for these nuts'},
 {'id': 1541459891495309314,
  'text': "it's purely about the density of the taint"}]

Generate corpus file from list of tweets

In [11]:
START = "HORKRIMS "
END = " ENDHORKRIMS\n\n"
with open("tweet-corpus.txt", "w") as f:
    for tweet in tweets:
        f.write(START + tweet["text"].replace("\n", " ") + END)

In [13]:
with open("tweet-corpus.jsonl", "w") as f:
    for tweet in tweets:
        f.write(json.dumps({
            "prompt": "",
            "completion": " " + tweet["text"]
        }))
        f.write("\n")

Now we have our corpus!

Follow-up in another notebook to generate tweets.

E.g. [markov.ipynb](markov.ipynb) for markov chain,
or [gpt-gen.ipynb](gpt-gen.ipynb) for GPT.