# Daily Tweet Collection
Iterate all existing user accounts and get all their tweets since `since_id` (last collected tweet, ideally the day before) or `misc.CONIFG["oldest_tweet"]`

Additionally:
* identify suspensions on a daily basis and mark the suspension day.

In [1]:
# required imports to access api_db, misc, misc.CONFIG, ...
import sys
sys.path = ['.', '..', '../..'] + sys.path
from collection import *

using 105 seed accounts (20 from news sources) and 0 hashtags
Database stats: 
{'avgObjSize': 102.53848223444147,
 'collections': 3,
 'dataSize': 6097861.0,
 'db': 'electionswatch',
 'fsTotalSize': 510770802688.0,
 'fsUsedSize': 445008846848.0,
 'indexSize': 1077248.0,
 'indexes': 3,
 'numExtents': 0,
 'objects': 59469,
 'ok': 1.0,
 'scaleFactor': 1.0,
 'storageSize': 3387392.0,
 'views': 0}
DB size (B): 6097861.0
DB size (MB): 5.82
DB size (GB): 0.01
Using API keys for app albertina_01

Done initializing at 07:12PM on September 26, 2020.
----------------------------------------


### Conditional Execution
Each file needs to verify if it should be executed or not based on the configurations (for some files this is not optional but all should have this section, even if it is tautological). Example:
```python
if not misc.CONFIG["collection"]["execute_this_script"]: exit()
```

In [2]:
# Conditional execution
pass

<hr>
<h1 align="center">driver code</h1>

The users where we will search for tweets are users that either don't have a `most_common_language` yet or whose `most_common_language` is in `config.collection.search_languages` and with `depth<=2`

In [3]:
def task(skip, limit):
    from collections import Counter

    oldest_t = misc.CONFIG["collection"]["oldest_tweet"]
    since_id_key = "since_id"
    print("Collection with oldest tweet at %s and key for since_id '%s'" % (oldest_t, since_id_key))

    def update_most_common_language(user, tweets):
        # assumes tweets are all from the same user
        # returns dict of {lang:count}
        if not len(tweets): return
        lang = "tweeted_languages"
        user[lang] = dict_key_or_default(user, lang, {})
        new_langs = Counter(map(lambda x: dict_key_or_default(x, "lang", "und"), tweets))
        for k, v in new_langs.items():
            user[lang][k] = dict_key_or_default(user[lang], k, 0) + v
        # always update the most_common_language
        user["most_common_language"] = dict_key_for_max_val(user[lang])
        # update the number of tweets analyzed for this user
        user["count_parsed_tweets"] = sum(user[lang].values())
        # finally update in db
        upsert_user(user)

    find_params = find_exclude_invalid({
        "depth": {"$lte": 1}
    })
    users = api_db.col_users.find(find_params, {since_id_key: True, "suspended": True}, no_cursor_timeout=True).skip(skip).limit(limit)

    for u in users:
        print("getting tweets for: %s..." % u["_id"], end="", flush=True)
        if "suspended" in u: u = update_account_details_on_suspended(u)
        tweets = get_tweets(u, api_db.api.GetUserTimeline, since_id_key, oldest_t, {"trim_user":True})
        insert_tweets(tweets)
        update_most_common_language(u, tweets)
        print("got %d new tweets, done." % len(tweets))

In [4]:
task(0, 100)

Collection with oldest tweet at 2020-09-01 00:00:00+00:00 and key for since_id 'since_id'
getting tweets for: 94081671...got 3206 new tweets, done.


In [3]:
find_params = find_exclude_invalid({
    "depth": {"$lte": 1}
})
total = api_db.col_users.count_documents(find_params)
print("Total to process: %d" % total)

Total to process: 52421


In [None]:
max_threads = misc.CONFIG["collection"]["max_threads"]
batch_size = int(total/max_threads)+1000
dp = DynamicParallelism(total, task, "tweet_collection", batch_size=batch_size, max_threads=max_threads)

In [None]:
dp.run()

In [None]:
print("DONE")