In [1]:
# For future reference, this is how we can set up the test database in a notebook, for examples

import sys, os
DJANGO_LOCATION = "/Users/pvankessel/.pyenv/versions/3.6.5/envs/python3/lib/python3.6/site-packages"
sys.path.append(DJANGO_LOCATION)
import django
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"
os.environ['DJANGO_SETTINGS_MODULE'] = 'testapp.settings'
django.setup()

from django import test
from django.db import connection
test.utils.setup_test_environment() # Setup the environment
db = connection.creation.create_test_db() # Create the test db

from testapp.tests import DjangoTwitterTests
DjangoTwitterTests().setUp()

Creating test database for alias 'default'...
Got an error creating the test database: database "test_postgres" already exists



Type 'yes' if you would like to try deleting the test database 'test_postgres', or 'no' to cancel: yes


Destroying old test database for alias 'default'...


# Django Twitter

Django Twitter is designed to make it easy to collect and store data from Twitter in a database, using Django.
It is layered on top of Pewhooks - our collection of Python utilities for interfacing with various APIs, including
Twitter - and provides a set of standardized abstract Django models and Django commands for querying the Twitter
API and storing the data you get back.

## Configuration

In the following examples, we have a test app set up that has Django Twitter installed like so:

### settings.py
```python
INSTALLED_APPS = [
    "django.contrib.auth",
    "django.contrib.contenttypes",
    "django.contrib.sites",
    "django_twitter",
    "testapp",
]
TWITTER_APP = "testapp"
```

And in our app's `models.py` file, we import all of the abstract models that Django Twitter provides, and create them as concrete models in our own app. The abstract models define a "template" for how to store Twitter data, but the tables will be created for and belong to our own app. You'll also notice that we have two other tables in our app - `Person` and `Organization`. What's nice about Django Twitter providing _abstract_ models is that you can build on the model templates and extend them with additional data however you like. In our app, we're going to add some additional metadata on who owns each Twitter account, and store that info in other custom tables in our app.

### models.py

```python
from django.db import models
from django_twitter.models import *

class Person(models.Model):
    name = models.CharField(max_length=250)
    
class Organization(models.Model):
    name = models.CharField(max_length=250)

class TwitterProfile(AbstractTwitterProfile):

    person = models.ForeignKey(
        "testapp.Person",
        related_name="twitter_profiles",
        null=True,
        on_delete=models.SET_NULL,
    )
    organization = models.ForeignKey(
        "testapp.Organization",
        related_name="twitter_profiles",
        null=True,
        on_delete=models.SET_NULL,
    )

class TwitterProfileSnapshot(AbstractTwitterProfileSnapshot):
    pass

class Tweet(AbstractTweet):
    pass

class TwitterFollowerList(AbstractTwitterFollowerList):
    pass

class TwitterFollowingList(AbstractTwitterFollowingList):
    pass

class TwitterHashtag(AbstractTwitterHashtag):
    pass

class TwitterPlace(AbstractTwitterPlace):
    pass

class TweetSet(AbstractTweetSet):
    pass

class TwitterProfileSet(AbstractTwitterProfileSet):
    pass
```

Finally, we're going to grab our Twitter credentials and save them as environment variables. Django Twitter allows you to manually pass your Twitter credentials to all of its data collection commands, but it's much easier to take advantage of Pewhooks' support for environment variables and let it fetch them automatically. The variables it needs are `TWITTER_API_KEY`, `TWITTER_API_SECRET`, `TWITTER_API_ACCESS_TOKEN`, `TWITTER_API_ACCESS_SECRET`.

With our credentials set and the above two files figured out, we've now got Django Twitter all configured. We just need to run `python manage.py makemigrations testapp` and `python manage.py migrate` to actually create the tables, and then we're ready to load in some data! Below is the list of accounts we're going to track.

## Adding accounts and downloading profile data

In [2]:
MY_ACCOUNTS = [
    "pewresearch",
    "pewglobal",
    "pewmethods",
    "pewjournalism",
    "facttank",
    "pewscience",
    "pewreligion",
    "pewhispanic",
    "pewinternet",
    "pvankessel",
    "justinbieber",
]

Django Twitter provides a bunch of built-in management commands that make it easy to collect Twitter data. If we wanted to start pulling in data for these profiles, we could run the following command from the CLI and Django Twitter would hit the API and store the results in the database:

`python manage.py django_twitter_get_profile pewresearch`

But Django also provides a `call_command` function that lets you call management commands programatically, so that's what we're going to use here.

In [3]:
from django.core.management import call_command

In [4]:
for handle in MY_ACCOUNTS:
    call_command(
        "django_twitter_get_profile", handle
    )

Collecting profile data for pewresearch
Successfully saved profile data for pewresearch: http://twitter.com/pewresearch
Collecting profile data for pewglobal
Successfully saved profile data for pewglobal: http://twitter.com/pewglobal
Collecting profile data for pewmethods
Successfully saved profile data for pewmethods: http://twitter.com/pewmethods
Collecting profile data for pewjournalism
Successfully saved profile data for pewjournalism: http://twitter.com/pewjournalism
Collecting profile data for facttank
Successfully saved profile data for facttank: http://twitter.com/facttank
Collecting profile data for pewscience
Successfully saved profile data for pewscience: http://twitter.com/pewscience
Collecting profile data for pewreligion
Successfully saved profile data for pewreligion: http://twitter.com/pewreligion
Collecting profile data for pewhispanic
Successfully saved profile data for pewhispanic: http://twitter.com/pewhispanic
Collecting profile data for pewinternet
Successfully sa

In [5]:
from testapp.models import TwitterProfile
TwitterProfile.objects.count()

11

In [6]:
TwitterProfile.objects.filter(screen_name='pewresearch').values()

<QuerySet [{'id': 1, 'twitter_id': '22642788', 'last_update_time': datetime.datetime(2021, 6, 16, 12, 16, 59, 550270), 'historical': False, 'tweet_backfilled': False, 'screen_name': 'pewresearch', 'created_at': datetime.datetime(2009, 3, 3, 10, 39, 39), 'twitter_error_code': None, 'person_id': None, 'organization_id': None, 'most_recent_snapshot_id': 1}]>

Hmm, that looks like we've got less data than we expected. Where'd all the data go? Well, because Twitter profiles change over time - people gain and lose followers, they change their descriptions (and sometimes even their screen names) - and because we might be interested in tracking that data, Django Twitter actually stores profile data in a separate table, every time it queries the API. We call these "profile snapshots" and you can access them like so:

In [7]:
profile = TwitterProfile.objects.get(screen_name='pewresearch')
profile.snapshots.all()

<QuerySet [<TwitterProfileSnapshot: pewresearch: http://twitter.com/pewresearch AS OF 2021-06-16 12:16:59.525119>]>

In [8]:
profile.snapshots.values()

<QuerySet [{'id': 1, 'timestamp': datetime.datetime(2021, 6, 16, 12, 16, 59, 525119), 'screen_name': 'pewresearch', 'name': 'Pew Research Center', 'contributors_enabled': False, 'description': 'Nonpartisan, non-advocacy data and analysis on the issues, attitudes and trends shaping the world. Subscribe: https://t.co/Kpq1V0w9bM ✉️', 'favorites_count': 892, 'followers_count': 430682, 'followings_count': 96, 'is_verified': True, 'is_protected': False, 'listed_count': 13197, 'profile_image_url': 'http://pbs.twimg.com/profile_images/879728447026868228/U4Uzpdp6_normal.jpg', 'status': 'RT @_StephKramer: 1. New analysis: Americans lost roughly 5.5 million years of life to COVID-19 in 2020 alone, more than all accidents comb…', 'statuses_count': 90365, 'urls': ['https://www.pewresearch.org/'], 'location': 'Washington, DC', 'json': {'id': 22642788, 'url': 'https://t.co/OBLpll8VR0', 'lang': None, 'name': 'Pew Research Center', 'id_str': '22642788', 'status': {'id': 1405210666189852674, 'geo': None

There's the data! For convenience, you can always access the most recent snapshot of a profile directly using the `most_recent_snapshot` field:

In [9]:
profile.most_recent_snapshot

<TwitterProfileSnapshot: pewresearch: http://twitter.com/pewresearch AS OF 2021-06-16 12:16:59.525119>

## Collecting tweets

Okay, now let's get some tweets using the `django_twitter_get_profile_tweets` command. The Twitter v1 API allows you to go back as far as the ~3200 most recent tweets produced by an account. With query limits, that would take a while, so we're going to set a limit of 25 tweets. But normally, we'd probably want to grab everything we could, and then periodically run this command again to get new tweets on a regular basis. Our Twitter account is pretty active - but it's definitely not producing 3200 tweets every day, so when we run this command a second time, we probably don't need to iterate through everything all over again. Instead, Django Twitter sets a `tweet_backfilled=True` flag on the profile the first time it works its way through the full available feed for a profile. Then, in subsequent runs of `django_twitter_get_profile_tweets`, it'll default to breaking off the data collection when it encounters a tweet it's already seen. 

(Sidenote: since you're probably collecting tweets for multiple profiles and it's possible that some accounts mention or retweet each other, Django Twitter is smart enough to check for this, and it only breaks off when it encounters a tweet that could only have been captured by collecting the profile's own feed.)

You can ignore the backfill flag by simply passing `--ignore_backfill` to the command, and it'll keep iterating. And, if you just want to ignore the backfill flag for a limited timeframe (only refreshing existing tweets up to a certain point) then you can easily pass `--max_backfill_days` or `--max_backfill_date` to the command to tell it how far back you want to go. Finally, Django Twitter avoids overwriting existing tweet data, unless you pass `--overwrite`.  Combining some of these flags - like `--ignore_backfill --max_backfill_days 7 --overwrite` - can be useful if you want to refresh the stats (i.e. likes, retweets) for recent tweets, but don't care after a certain point (stats tend to level-off after a few days, so we often stop refreshing tweets after a week).

Anyway, below, we're going to collected the most recent 25 tweets for the @pewresearch account:

In [10]:
call_command(
    "django_twitter_get_profile_tweets",
    "pewresearch",
    limit=25
)

Retrieving tweets for user pewresearch: 0it [00:00, ?it/s]

Retrieving tweets for user pewresearch


Retrieving tweets for user pewresearch: 24it [00:03,  6.14it/s]

pewresearch: http://twitter.com/pewresearch: 25 tweets scanned, 25 updated





In [11]:
from testapp.models import Tweet
Tweet.objects.count()

37

In [12]:
TwitterProfile.objects.count()

19

Awesome, we've got tweets now!  But wait, we have more tweets than we expected, and we also have more Twitter profiles. What gives?

Well, Django Twitter automatically creates new records for any and all tweets and accounts it encounters. So, if @pewresearch quote tweets an account we hadn't seen before, both the quote tweet and the original tweet get created in the database, along with the account that created the quoted tweet. This is nice, because our lovely database grows and tracks all of the data it can - but now we've got extra profiles. Which ones are our original ones?

We could just keep using our initial list of screen names to keep track of our "primary" accounts, but that poses another problem: screen names can change and get recycled by new accounts. If we were to retire our @pewresearch account and it were to get snatched up by a spam bot (not an uncommon scenario for popular handles), our queries would start pulling in spammy tweets, and we wouldn't even know something was wrong unless we took a close look. The better way of tracking accounts is to use their actually-unique Twitter IDs, which you get back from the API the first time you query a screen name.

In [13]:
profile.twitter_id

'22642788'

So now we have to go and look up all of our accounts' _actual_ IDs and replace our list of screen names? That's a huge pain! Wouldn't it be nice if we could just define lists of accounts that we care about directly in the database?

## Profile and tweet sets

That's where profile and tweet sets come in. Every Django Twitter command (where it makes sense) accepts `--add_to_profile_set` and/or `--add_to_tweet_set` commands that take arbitrary labels that get associated with the profiles and/or tweets that it encounters. This makes it really easy to give a set of profiles or tweets a name, and then you can access that set directly in the database - and better yet, you can also run commands directly on a _set_ of profiles all at once. Let's see how that works.

Let's repeat the process of looping over and loading in our accounts, but this time we're going to add them to a profile set.

In [14]:
for handle in MY_ACCOUNTS:
    call_command(
        "django_twitter_get_profile", handle, add_to_profile_set="my_profile_set"
    )

Collecting profile data for pewresearch
Successfully saved profile data for pewresearch: http://twitter.com/pewresearch
Collecting profile data for pewglobal
Successfully saved profile data for pewglobal: http://twitter.com/pewglobal
Collecting profile data for pewmethods
Successfully saved profile data for pewmethods: http://twitter.com/pewmethods
Collecting profile data for pewjournalism
Successfully saved profile data for pewjournalism: http://twitter.com/pewjournalism
Collecting profile data for facttank
Successfully saved profile data for facttank: http://twitter.com/facttank
Collecting profile data for pewscience
Successfully saved profile data for pewscience: http://twitter.com/pewscience
Collecting profile data for pewreligion
Successfully saved profile data for pewreligion: http://twitter.com/pewreligion
Collecting profile data for pewhispanic
Successfully saved profile data for pewhispanic: http://twitter.com/pewhispanic
Collecting profile data for pewinternet
Successfully sa

Now we can access these accounts through the profile set that we just created:

In [15]:
from testapp.models import TwitterProfileSet

TwitterProfileSet.objects.get(name="my_profile_set").profiles.count()

11

Now, the next time we want to refresh the profile data for these accounts, we can do it all at once by using the `django_twitter_get_profile_set` command, no for-loop necessary - and no need to specify those problematic screen names; Django Twitter will use the correct unique IDs automatically:

In [16]:
call_command(
    "django_twitter_get_profile_set", "my_profile_set", num_cores=2
)

100%|██████████| 11/11 [00:00<00:00, 933.22it/s]

Collecting profile data for 3015897974Collecting profile data for 111339670






Successfully saved profile data for pewmethods: http://twitter.com/pewmethodsSuccessfully saved profile data for pewjournalism: http://twitter.com/pewjournalism

Collecting profile data for 1262729180Collecting profile data for 1265726480

Successfully saved profile data for pewscience: http://twitter.com/pewscience
Collecting profile data for 22642788
Successfully saved profile data for facttank: http://twitter.com/facttank
Collecting profile data for 831470472
Successfully saved profile data for pewglobal: http://twitter.com/pewglobal
Collecting profile data for 36462231
Successfully saved profile data for pewresearch: http://twitter.com/pewresearch
Collecting profile data for 426041590
Successfully saved profile data for pewreligion: http://twitter.com/pewreligion
Collecting profile data for 17071048
Successfully saved profile data for pewhispanic: http://twitter.com/pewhispanic
Collecting profile data for 530977797
Successfully saved profile data for pewinternet: http://twitter.com

In [17]:
print("WOOT")

WOOT


And to download the latest tweets for _all_ of these accounts, we can now run the `django_twitter_get_profile_set_tweets` command

In [18]:
call_command(
    "django_twitter_get_profile_set_tweets",
    "my_profile_set",
    limit=25,
    overwrite=True,
    ignore_backfill=True
)

100%|██████████| 11/11 [00:00<00:00, 10762.15it/s]
Retrieving tweets for user pewmethods: 0it [00:00, ?it/s]

Retrieving tweets for user pewmethods

Retrieving tweets for user pewjournalism: 0it [00:00, ?it/s]

Retrieving tweets for user pewjournalism



Retrieving tweets for user pewjournalism: 24it [00:02,  9.74it/s]


pewjournalism: http://twitter.com/pewjournalism: 25 tweets scanned, 25 updated


Retrieving tweets for user pewscience: 0it [00:00, ?it/s]it/s]

Retrieving tweets for user pewscience


Retrieving tweets for user pewmethods: 24it [00:03,  7.48it/s]


pewmethods: http://twitter.com/pewmethods: 25 tweets scanned, 25 updated


Retrieving tweets for user facttank: 0it [00:00, ?it/s]0it/s]

Retrieving tweets for user facttank


Retrieving tweets for user facttank: 24it [00:02,  9.67it/s]s]


facttank: http://twitter.com/facttank: 25 tweets scanned, 25 updated


Retrieving tweets for user pewscience: 24it [00:03,  7.03it/s]


pewscience: http://twitter.com/pewscience: 25 tweets scanned, 25 updated


Retrieving tweets for user pewglobal: 0it [00:00, ?it/s]

Retrieving tweets for user pewglobal


Retrieving tweets for user pewresearch: 0it [00:00, ?it/s]

Retrieving tweets for user pewresearch


Retrieving tweets for user pewglobal: 24it [00:02, 11.33it/s]s]


pewglobal: http://twitter.com/pewglobal: 25 tweets scanned, 25 updated


Retrieving tweets for user pewreligion: 0it [00:00, ?it/s]it/s]

Retrieving tweets for user pewreligion


Retrieving tweets for user pewresearch: 24it [00:03,  7.84it/s]


pewresearch: http://twitter.com/pewresearch: 25 tweets scanned, 25 updated


Retrieving tweets for user pewhispanic: 0it [00:00, ?it/s]

Retrieving tweets for user pewhispanic


Retrieving tweets for user pewreligion: 24it [00:01, 12.85it/s]


pewreligion: http://twitter.com/pewreligion: 25 tweets scanned, 25 updated


Retrieving tweets for user pewinternet: 0it [00:00, ?it/s]it/s]

Retrieving tweets for user pewinternet


Retrieving tweets for user pewhispanic: 24it [00:02,  8.08it/s]


pewhispanic: http://twitter.com/pewhispanic: 25 tweets scanned, 25 updated


Retrieving tweets for user pvankessel: 0it [00:00, ?it/s]6it/s]

Retrieving tweets for user pvankessel


Retrieving tweets for user pewinternet: 24it [00:03,  7.79it/s]


pewinternet: http://twitter.com/pewinternet: 25 tweets scanned, 25 updated


Retrieving tweets for user justinbieber: 0it [00:00, ?it/s]s]

Retrieving tweets for user justinbieber


Retrieving tweets for user pvankessel: 24it [00:03,  7.99it/s]]


pvankessel: http://twitter.com/pvankessel: 25 tweets scanned, 25 updated


Retrieving tweets for user justinbieber: 24it [00:03,  7.64it/s]


justinbieber: http://twitter.com/justinbieber: 25 tweets scanned, 25 updated


We can also keep track of all the tweets we collect when we run this command, by passing it a label for a tweet set:

In [19]:
call_command(
    "django_twitter_get_profile_set_tweets",
    "my_profile_set",
    limit=25,
    overwrite=True,
    ignore_backfill=True,
    add_to_tweet_set="my_tweet_set"
)

100%|██████████| 11/11 [00:00<00:00, 6263.55it/s]
Retrieving tweets for user pewmethods: 0it [00:00, ?it/s]

Retrieving tweets for user pewmethods

Retrieving tweets for user pewscience: 0it [00:00, ?it/s]


Retrieving tweets for user pewscience


Retrieving tweets for user pewmethods: 24it [00:02,  9.88it/s]


pewmethods: http://twitter.com/pewmethods: 25 tweets scanned, 25 updated


Retrieving tweets for user pewscience: 24it [00:02,  9.61it/s]


pewscience: http://twitter.com/pewscience: 25 tweets scanned, 25 updated


Retrieving tweets for user pewjournalism: 0it [00:00, ?it/s]

Retrieving tweets for user pewjournalism


Retrieving tweets for user pewglobal: 0it [00:00, ?it/s]

Retrieving tweets for user pewglobal


Retrieving tweets for user pewjournalism: 24it [00:01, 12.63it/s]
Retrieving tweets for user pewglobal: 19it [00:01, 16.42it/s]

pewjournalism: http://twitter.com/pewjournalism: 25 tweets scanned, 25 updated


Retrieving tweets for user pewreligion: 0it [00:00, ?it/s]/s]

Retrieving tweets for user pewreligion


Retrieving tweets for user pewglobal: 24it [00:02, 10.90it/s]


pewglobal: http://twitter.com/pewglobal: 25 tweets scanned, 25 updated


Retrieving tweets for user facttank: 0it [00:00, ?it/s]

Retrieving tweets for user facttank


Retrieving tweets for user pewreligion: 24it [00:01, 13.02it/s]


pewreligion: http://twitter.com/pewreligion: 25 tweets scanned, 25 updated


Retrieving tweets for user pewhispanic: 0it [00:00, ?it/s]s]

Retrieving tweets for user pewhispanic


Retrieving tweets for user facttank: 24it [00:01, 12.98it/s]


facttank: http://twitter.com/facttank: 25 tweets scanned, 25 updated


Retrieving tweets for user pewresearch: 0it [00:00, ?it/s]

Retrieving tweets for user pewresearch


Retrieving tweets for user pewresearch: 24it [00:02, 10.40it/s]


pewresearch: http://twitter.com/pewresearch: 25 tweets scanned, 25 updated


Retrieving tweets for user pewhispanic: 24it [00:02,  8.82it/s]


pewhispanic: http://twitter.com/pewhispanic: 25 tweets scanned, 25 updated


Retrieving tweets for user pewinternet: 0it [00:00, ?it/s]

Retrieving tweets for user pewinternet


Retrieving tweets for user pvankessel: 0it [00:00, ?it/s]

Retrieving tweets for user pvankessel


Retrieving tweets for user pewinternet: 24it [00:01, 13.20it/s]


pewinternet: http://twitter.com/pewinternet: 25 tweets scanned, 25 updated


Retrieving tweets for user pvankessel: 24it [00:02, 11.98it/s]


pvankessel: http://twitter.com/pvankessel: 25 tweets scanned, 25 updated


Retrieving tweets for user justinbieber: 0it [00:00, ?it/s]

Retrieving tweets for user justinbieber


Retrieving tweets for user justinbieber: 24it [00:02, 11.07it/s]


justinbieber: http://twitter.com/justinbieber: 25 tweets scanned, 25 updated


In [20]:
from testapp.models import TweetSet
TweetSet.objects.get(name="my_tweet_set").tweets.count()

275

And we could even add those profiles to an entirely new profile set, to keep track of data collection, for example.

In [21]:
call_command(
    "django_twitter_get_profile_set_tweets",
    "my_profile_set",
    limit=25,
    overwrite=True,
    ignore_backfill=True,
    add_to_tweet_set="my_tweet_set",
    add_to_profile_set="my_second_profile_set",
)

100%|██████████| 11/11 [00:00<00:00, 13921.95it/s]
Retrieving tweets for user pewmethods: 0it [00:00, ?it/s]

Retrieving tweets for user pewmethods

Retrieving tweets for user pewscience: 0it [00:00, ?it/s]


Retrieving tweets for user pewscience


Retrieving tweets for user pewmethods: 24it [00:02, 10.14it/s]
Retrieving tweets for user pewscience: 23it [00:02, 14.50it/s]

pewmethods: http://twitter.com/pewmethods: 25 tweets scanned, 25 updated


Retrieving tweets for user pewscience: 24it [00:02,  9.60it/s]


pewscience: http://twitter.com/pewscience: 25 tweets scanned, 25 updated


Retrieving tweets for user pewreligion: 0it [00:00, ?it/s]

Retrieving tweets for user pewreligion


Retrieving tweets for user pewjournalism: 0it [00:00, ?it/s]

Retrieving tweets for user pewjournalism


Retrieving tweets for user pewreligion: 24it [00:02, 11.47it/s]s]


pewreligion: http://twitter.com/pewreligion: 25 tweets scanned, 25 updated


Retrieving tweets for user pewjournalism: 24it [00:02, 11.33it/s]


pewjournalism: http://twitter.com/pewjournalism: 25 tweets scanned, 25 updated


Retrieving tweets for user pewglobal: 0it [00:00, ?it/s]

Retrieving tweets for user pewglobal


Retrieving tweets for user facttank: 0it [00:00, ?it/s]

Retrieving tweets for user facttank


Retrieving tweets for user facttank: 24it [00:01, 13.36it/s]]


facttank: http://twitter.com/facttank: 25 tweets scanned, 25 updated


Retrieving tweets for user pewglobal: 24it [00:02, 11.85it/s]


pewglobal: http://twitter.com/pewglobal: 25 tweets scanned, 25 updated


Retrieving tweets for user pewhispanic: 0it [00:00, ?it/s]

Retrieving tweets for user pewhispanic


Retrieving tweets for user pewresearch: 0it [00:00, ?it/s]

Retrieving tweets for user pewresearch


Retrieving tweets for user pewresearch: 24it [00:03,  7.05it/s]
Retrieving tweets for user pewhispanic: 21it [00:03,  5.43it/s]

pewresearch: http://twitter.com/pewresearch: 25 tweets scanned, 25 updated


Retrieving tweets for user pewinternet: 0it [00:00, ?it/s]it/s]

Retrieving tweets for user pewinternet


Retrieving tweets for user pewhispanic: 24it [00:04,  5.44it/s]


pewhispanic: http://twitter.com/pewhispanic: 25 tweets scanned, 25 updated


Retrieving tweets for user pvankessel: 0it [00:00, ?it/s]

Retrieving tweets for user pvankessel


Retrieving tweets for user pewinternet: 24it [00:03,  6.56it/s]


pewinternet: http://twitter.com/pewinternet: 25 tweets scanned, 25 updated


Retrieving tweets for user justinbieber: 0it [00:00, ?it/s]/s]

Retrieving tweets for user justinbieber


Retrieving tweets for user pvankessel: 24it [00:03,  6.99it/s]


pvankessel: http://twitter.com/pvankessel: 25 tweets scanned, 25 updated


Retrieving tweets for user justinbieber: 24it [00:02, 11.25it/s]


justinbieber: http://twitter.com/justinbieber: 25 tweets scanned, 25 updated


In [22]:
TwitterProfileSet.objects.get(name="my_second_profile_set").profiles.count()

11

## Followers and followings lists

So that's how to collect profile and tweet data, but you also might be interested in tracking the followers or friends (we call them "followings") for particular accounts. For really popular accounts, not only can it take a super long time to collect all of their followers from the API, their followers can also change substantially over time. To that end, Django Twitter stores lists of followers/followings in a dedicated table, tracking the start and finish time of the data collection, and storing each list separately every time you collect it. Let's see how that works.

In [23]:
call_command("django_twitter_get_profile_followers", "pewresearch", limit=25)

Retrieving followers for user pewresearch: 25it [00:00, 37.05it/s]


We now have a TwitterFollowerList object attached to our profile, and if we take a look at its values in the table, we can see that it logged its start and finish time (although we forced a limit of 25, so that's a little misleading!)

In [24]:
pew = TwitterProfile.objects.get(screen_name="pewresearch")
pew.follower_lists.all()

<QuerySet [<TwitterFollowerList: TwitterFollowerList object (1)>]>

In [25]:
pew.follower_lists.values()

<QuerySet [{'id': 1, 'start_time': datetime.datetime(2021, 6, 16, 12, 18, 5, 145019), 'finish_time': datetime.datetime(2021, 6, 16, 12, 18, 5, 824802), 'profile_id': 1}]>

We can also use a shortcut function on TwitterProfile objects to grab the most recent list

In [26]:
pew.current_follower_list()

<TwitterFollowerList: TwitterFollowerList object (1)>

And we can jump directly to the profile objects in that list directly using another shortcut function:

In [27]:
pew.current_followers()

<QuerySet [<TwitterProfile: 1201619184083439618>, <TwitterProfile: 1346956009768583168>, <TwitterProfile: 52478741>, <TwitterProfile: 30280532>, <TwitterProfile: 1259515289437261824>, <TwitterProfile: 1595924352>, <TwitterProfile: 1267354047217987586>, <TwitterProfile: 1392486817048432642>, <TwitterProfile: 1605273162>, <TwitterProfile: 205131484>, <TwitterProfile: 1143628075214786574>, <TwitterProfile: 1176601698481201152>, <TwitterProfile: 2374126231>, <TwitterProfile: 795357584>, <TwitterProfile: 73904329>, <TwitterProfile: 1371564590962724865>, <TwitterProfile: 1377083428407889924>, <TwitterProfile: 2334602137>, <TwitterProfile: 18156430>, <TwitterProfile: 11037862>, '...(remaining elements truncated)...']>

In [28]:
pew.current_followers().count()

25

But as you can see, we pretty much just have a list of Twitter IDs

In [29]:
pew.current_followers().values()[0]

{'id': 71,
 'twitter_id': '1201619184083439618',
 'last_update_time': datetime.datetime(2021, 6, 16, 12, 18, 5, 428482),
 'historical': False,
 'tweet_backfilled': False,
 'screen_name': None,
 'created_at': None,
 'twitter_error_code': None,
 'person_id': None,
 'organization_id': None,
 'most_recent_snapshot_id': None}

If we want to ask Twitter to actually provide us with profile info for each follower, we have to specifically request it - because it eats up a LOT more API quota. To request this data, you just need to pass `--hydrate`

In [30]:
call_command("django_twitter_get_profile_followers", "pewresearch", limit=25, hydrate=True)

Retrieving followers for user pewresearch: 25it [00:01, 12.82it/s]


And now we have actual data, including screen names and profile snapshots

In [31]:
pew.current_followers().values()[0]

{'id': 71,
 'twitter_id': '1201619184083439618',
 'last_update_time': datetime.datetime(2021, 6, 16, 12, 18, 7, 59659),
 'historical': False,
 'tweet_backfilled': False,
 'screen_name': 'thanhds1',
 'created_at': datetime.datetime(2019, 12, 2, 15, 48, 50),
 'twitter_error_code': None,
 'person_id': None,
 'organization_id': None,
 'most_recent_snapshot_id': 1160}

So we can do fancy things like, see how many of @pewresearch's followers have at least 10 followers themselves

In [32]:
pew.current_followers().filter(most_recent_snapshot__followers_count__gte=10).count()

18

Followings works the exact same way - just substitute the word "follower" for "following"

## Data auditing utilities (looking for account and coverage errors)

When you're working with social media data, there can be a lot of moving parts, and occasionally bad data can slip into your database. Maybe a handle that you got from a third-party list was outdated, or someone gave you a fake username that turned out to be a spam bot, or someone that you've been tracking deleted their profile and it immediately got picked up by a spam bot. There are ways to minimize the risk of all of this happening, but there's no substitute for doing manual spot-checks! Fortunately, Django Twitter offers some utility functions to help you check for weird accounts.

In `django_twitter.utils` there are two functions that take a set of profiles and compute their average text similarity to each other by looking at a sample of their recent tweets (`identify_unusual_profiles_by_tweet_text`) or their profile descriptions (`identify_unusual_profiles_by_descriptions`). Usually we're interested in tracking accounts that have something in common - politicians, news organizations, celebrities and other public figures. In some cases, it's reasonable to assume that the accounts in our collection will tweet similar content - or at least, their tweets will be more similar to each other than the tweets produced by a spam bot.

In our example, it turns out that Justin Bieber's tweets are so reliably different than the content produced by the Pew Research accounts, that we actually use him in our unit tests. (This - and _not_ the fact that he's the greatest musician of all time - is the reason that he's in our example!)

In [33]:
profiles = TwitterProfileSet.objects.get(name="my_profile_set").profiles.all()

In [34]:
from django_twitter.utils import identify_unusual_profiles_by_tweet_text
most_similar, most_unique = identify_unusual_profiles_by_tweet_text(profiles)
most_unique

Gathering tweet text: 100%|██████████| 11/11 [00:00<00:00, 78.98it/s]


Unnamed: 0,twitter_id,tweet_text,avg_cosine
4,27260086,"RT @MIAFestival: LINEUP ALERT!\nJustin Bieber,...",0.50205


In [35]:
from django_twitter.utils import identify_unusual_profiles_by_descriptions
most_similar, most_unique = identify_unusual_profiles_by_descriptions(profiles)
most_unique

Unnamed: 0,twitter_id,snapshots__description,avg_cosine
5,27260086,JUSTICE the album out now,0.163522


Even if you have a perfect account roster with no accidental Biebers, different Twitter accounts posts at different rates, and Twitter only provides each account's ~3200 most recent tweets. If you're interested in doing any sort of historical analysis on any period prior to when you began regular data collection, you're going to need to assess how far back the backfill process got you. You may have years' worth of tweets for some accounts, but only weeks for others. 

Django Twitter provides two functions to assess tweet coverage over time for a set of profiles you're interested in. The `get_monthly_twitter_activity` function produces a spreadsheet where every row is an account and every column is a month, across whatever timeframe you request. The cells contain how many tweets exist in the database for each profile/month, and if you load this spreadsheet into Excel and do some conditional formatting to highlight empty cells, it makes it super easy to tell how far back you can reasonably analyze data without losing a ton of coverage.

In [36]:
import datetime
from django_twitter.utils import get_monthly_twitter_activity
results = get_monthly_twitter_activity(
    profiles,
    datetime.date(2018, 1, 1),
    max_date=datetime.datetime.now().date() + datetime.timedelta(days=1),
)

In [37]:
results

Unnamed: 0,2020_12,2021_1,2021_2,2021_3,2021_4,2021_5,2021_6,pk,screen_name,created_at,name
5.0,0.0,1.0,0.0,0.0,1.0,3.0,29.0,1.0,pewresearch,2009-03-03 10:39:39,Pew Research Center
8.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,2.0,pewglobal,2012-09-18 12:08:41,Pew Research Global
1.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,3.0,pewmethods,2015-02-09 16:00:41,Pew Research Methods
7.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,4.0,pewjournalism,2010-02-04 09:42:57,Pew Research Journalism
9.0,0.0,0.0,0.0,0.0,2.0,27.0,2.0,5.0,facttank,2013-03-13 18:41:33,Pew Research Fact Tank
2.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,6.0,pewscience,2013-03-12 14:42:00,Pew Research Science
6.0,0.0,0.0,0.0,0.0,0.0,3.0,22.0,7.0,pewreligion,2009-04-29 15:03:06,Pew Research Religion
10.0,0.0,0.0,0.0,0.0,5.0,14.0,6.0,8.0,pewhispanic,2011-12-01 13:26:52,PewResearch Hispanic
0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,9.0,pewinternet,2008-10-30 13:40:17,Pew Research Internet
3.0,1.0,15.0,0.0,8.0,1.0,0.0,0.0,10.0,pvankessel,2012-03-19 22:58:08,Patrick van Kessel


The second function, `find_missing_date_ranges`, scans a time period and returns a dataframe of all periods of at least N consecutive days where a profile doesn't have any tweets in the database. This can be useful to search around for weird anomalies that may have been caused by data collection errors, or temporarily suspended accounts, etc.

In [38]:
from django_twitter.utils import find_missing_date_ranges
results = find_missing_date_ranges(
    profiles,
    datetime.date(2021, 1, 1),
    max_date=datetime.datetime.now().date() + datetime.timedelta(days=1),
    min_consecutive_missing_dates=5,
)

Scanning profiles for missing dates: 100%|██████████| 11/11 [00:00<00:00, 67.66it/s]


In [39]:
results

Unnamed: 0,twitter_id,start_date,end_date,range
19,111339670,2021-01-01,2021-06-13,163
0,17071048,2021-01-01,2021-06-03,153
2,1262729180,2021-01-01,2021-06-03,153
1,3015897974,2021-01-01,2021-06-02,152
20,831470472,2021-01-01,2021-06-01,151
18,36462231,2021-01-01,2021-05-27,146
22,1265726480,2021-01-01,2021-04-27,116
26,426041590,2021-01-01,2021-04-22,111
14,22642788,2021-01-16,2021-04-30,104
8,27260086,2021-01-01,2021-04-13,102


## Extracting/exporting data

Finally, let's take a look at exporting our data. Often we want a giant spreadsheet of tweets, or a spreadsheet of a profile's data (like follower counts) over time. The `get_twitter_profile_dataframe` can grab the latter for you, and the `get_tweet_dataframe` function gives you the former. Presumably we've inspected our tweet coverage using the functions above and have determined that we don't have any tweet coverage issues, but when it comes to profile data, it's possible that we haven't been collecting that as regularly, or we may have some gaps in our timeseries. To help with this, the `get_twitter_profile_dataframe` function uses linear interpolation (for numerical values) and front-filling (for fixed attributes like descriptions) to fill in gaps where it can and provide you with a complete day-by-day profile dataframe that can be merged in with tweets on the days they were created.

Since we just started collecting data, we only have profile snapshots for today. So to illustrate how the interpolation works, we're going to create a fake snapshot on that historic and fateful day when Justin Bieber first joined Twitter in 2009, and we're just going to approximate his followers by assuming that they've increased at a steady linear rate. (This is obviously a poor assumption, but it works really well for shorter periods, which is all you should have to fill in if you've been collecting data at least somewhat regularly and aren't making fake decade-old datapoints like I am.)

In [40]:
from testapp.models import TwitterProfileSnapshot
justin = TwitterProfile.objects.get(twitter_id="27260086")
fake_snapshot = TwitterProfileSnapshot.objects.create(
    profile=justin,
    screen_name=justin.most_recent_snapshot.screen_name,
    followers_count=0,
    favorites_count=0,
    followings_count=0,
    statuses_count=0
)
fake_snapshot.timestamp = justin.created_at
fake_snapshot.save()

Now let's get our dataframe

In [41]:
from django_twitter.utils import get_twitter_profile_dataframe
df = get_twitter_profile_dataframe(
    profiles, datetime.datetime(2021, 1, 1), datetime.datetime.now(), skip_interpolation=False
)
df[df['twitter_id']=="27260086"].dropna(subset=['followers_count'])

Extracting Twitter profile snapshots: 100%|██████████| 11/11 [00:01<00:00,  8.11it/s]


Unnamed: 0,date,description,followers_count,favorites_count,followings_count,listed_count,statuses_count,name,screen_name,status,is_verified,is_protected,location,created_at,twitter_error_code,twitter_id,pk
4297,2021-01-01,,1.096274e+08,4415.42505,278229.546493,,30208.978714,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4298,2021-01-02,,1.096529e+08,4416.45261,278294.296213,,30216.008963,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4299,2021-01-03,,1.096785e+08,4417.48017,278359.045933,,30223.039211,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4300,2021-01-04,,1.097040e+08,4418.50773,278423.795653,,30230.069460,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4301,2021-01-05,,1.097295e+08,4419.53529,278488.545373,,30237.099709,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4459,2021-06-12,,1.137605e+08,4581.88976,288719.001120,,31347.879005,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4460,2021-06-13,,1.137860e+08,4582.91732,288783.750840,,31354.909254,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4461,2021-06-14,,1.138115e+08,4583.94488,288848.500560,,31361.939503,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11
4462,2021-06-15,,1.138370e+08,4584.97244,288913.250280,,31368.969751,,justinbieber,,,,,2009-03-28 11:41:22,,27260086,11


If we want tweets, it's a very similar process

In [42]:
from django_twitter.utils import get_tweet_dataframe
get_tweet_dataframe(
    profiles, datetime.datetime(2021, 1, 1), datetime.datetime.now()
)

Unnamed: 0,pk,twitter_id,last_update_time,historical,created_at,text,retweet_count,favorite_count,profile,retweeted_status,in_reply_to_status,quoted_status,date
0,28,1404865073684828161,2021-06-16 12:17:57.082522-04:00,False,2021-06-15 13:15:09-04:00,@Katrina_HRM You might also enjoy our short em...,1,2,22642788,,1404850695715659777,,2021-06-15
1,39,1405210870247051270,2021-06-16 12:17:50.759735-04:00,False,2021-06-16 12:09:13-04:00,"On average, posts with neither a positive nor ...",1,0,111339670,,1405210867109699591,,2021-06-16
2,40,1405210867109699591,2021-06-16 12:17:50.826009-04:00,False,2021-06-16 12:09:12-04:00,"In early March 2021, Facebook posts with a pos...",1,0,111339670,,1405210863901057034,,2021-06-16
3,56,1402689641619333126,2021-06-16 12:17:48.123364-04:00,False,2021-06-09 13:10:45-04:00,5/5 Learn more about our American Trends Panel...,0,0,3015897974,,1402689567317270538,,2021-06-09
4,57,1402689567317270538,2021-06-16 12:17:48.167531-04:00,False,2021-06-09 13:10:28-04:00,"4/ A new @pewresearch methodological report, b...",0,1,3015897974,,1402689503475769344,,2021-06-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,21,1404940033216466950,2021-06-16 12:17:56.252245-04:00,False,2021-06-15 18:13:01-04:00,"RT @kat_devlin: Views of NATO, 2009-21 https:/...",5,0,22642788,1404807694696030215,,,2021-06-15
286,153,1404822466858471428,2021-06-16 12:17:53.218184-04:00,False,2021-06-15 10:25:51-04:00,"RT @kat_devlin: Views of NATO, 2009-21 https:/...",5,0,831470472,1404807694696030215,,,2021-06-15
287,1,1405210666189852674,2021-06-16 12:17:55.405359-04:00,False,2021-06-16 12:08:25-04:00,RT @_StephKramer: 1. New analysis: Americans l...,4,0,22642788,1405208831538438145,,,2021-06-16
288,174,1405176896237809667,2021-06-16 12:17:50.454931-04:00,False,2021-06-16 09:54:13-04:00,RT @pewresearch: Americans with any religious ...,4,0,36462231,1405156974631829504,,,2021-06-16


In [43]:
from django_twitter.utils import get_tweet_dataframe
profiles = TwitterProfileSet.objects.get(name="my_profile_set").profiles.all()
get_tweet_dataframe(
    profiles, datetime.datetime(2021, 1, 1), datetime.datetime.now()
)

Unnamed: 0,pk,twitter_id,last_update_time,historical,created_at,text,retweet_count,favorite_count,profile,retweeted_status,in_reply_to_status,quoted_status,date
0,28,1404865073684828161,2021-06-16 12:17:57.082522-04:00,False,2021-06-15 13:15:09-04:00,@Katrina_HRM You might also enjoy our short em...,1,2,22642788,,1404850695715659777,,2021-06-15
1,39,1405210870247051270,2021-06-16 12:17:50.759735-04:00,False,2021-06-16 12:09:13-04:00,"On average, posts with neither a positive nor ...",1,0,111339670,,1405210867109699591,,2021-06-16
2,40,1405210867109699591,2021-06-16 12:17:50.826009-04:00,False,2021-06-16 12:09:12-04:00,"In early March 2021, Facebook posts with a pos...",1,0,111339670,,1405210863901057034,,2021-06-16
3,56,1402689641619333126,2021-06-16 12:17:48.123364-04:00,False,2021-06-09 13:10:45-04:00,5/5 Learn more about our American Trends Panel...,0,0,3015897974,,1402689567317270538,,2021-06-09
4,57,1402689567317270538,2021-06-16 12:17:48.167531-04:00,False,2021-06-09 13:10:28-04:00,"4/ A new @pewresearch methodological report, b...",0,1,3015897974,,1402689503475769344,,2021-06-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,21,1404940033216466950,2021-06-16 12:17:56.252245-04:00,False,2021-06-15 18:13:01-04:00,"RT @kat_devlin: Views of NATO, 2009-21 https:/...",5,0,22642788,1404807694696030215,,,2021-06-15
286,153,1404822466858471428,2021-06-16 12:17:53.218184-04:00,False,2021-06-15 10:25:51-04:00,"RT @kat_devlin: Views of NATO, 2009-21 https:/...",5,0,831470472,1404807694696030215,,,2021-06-15
287,1,1405210666189852674,2021-06-16 12:17:55.405359-04:00,False,2021-06-16 12:08:25-04:00,RT @_StephKramer: 1. New analysis: Americans l...,4,0,22642788,1405208831538438145,,,2021-06-16
288,174,1405176896237809667,2021-06-16 12:17:50.454931-04:00,False,2021-06-16 09:54:13-04:00,RT @pewresearch: Americans with any religious ...,4,0,36462231,1405156974631829504,,,2021-06-16
