First, we load all required python modules for this notebook

In [1]:
from collections import Counter
import configparser
from datetime import datetime
import gzip
import json

import numpy as np
import pandas as pd

# Building a random sample of users using OSoMe

After obtaining the circadian rhythms of the \`\`Depressed'' cohort, we want to compare these rhythms to a general selection of Twitter users. This notebook describes the process of the formation of that random sample of Twitter users, which we refer to as the \`\`Random'' cohort.

## Sample design

Social Media platforms typically evolve over time. This is also reflected in the individuals that use the platform. Therefore, we build our \`\`Random'' cohort of Twitter users by using characteristics of the \`\`Depressed'' cohort as follows.

1. We determine the per month distribution of the profile creation dates of the \`\`Depressed'' cohort. 
2. We sample users at random from OSoMe to create a user pool that could be used in our \`\`Random'' cohort.
3. We randomly choose Twitter users from the OSoMe user pool in such a way that our \`\`Random'' cohort has the same distribution of profile creation dates as the \`\`Depressed'' cohort.

### 1. Depression users distribution

First, we construct a list that contains all user IDs of users in the \`\`Depressed'' cohort

In [2]:
label_fn = "data/diagnosis_labelling_depression_final.tsv"
labels = pd.read_csv(label_fn, sep="\t", index_col="tweet_id")

dep_users = labels[labels.diag_final == "1"].user_id.unique()

with open("data/depression_user_to_tz.json") as jf:
    tz_dict = json.load(jf)

tz_dep_users = np.intersect1d(dep_users, list(tz_dict.keys()))

        
print("In total we have", dep_users.size, "Twitter users in the ``Depressed'' cohort, out of which",
      tz_dep_users.size, "have time zone information.")

In total we have 1211 Twitter users in the ``Depressed'' cohort, out of which 691 have time zone information.


Finally, we obtain the information regarding the creation dates of these users using the Twitter API. The error messages that occurred for the users for which we could not obtain the profile creation date are displayed below.

| Error Code | Error Message            | Number of occurrences |
|------------|--------------------------|-----------------------|
| 50         | User not found.          | 43                    |
| 63         | User has been suspended. | 13                    |

In [3]:
to_sample_distribution = pd.read_csv("data/to_sample_distribution.tsv", sep="\t", header=None, squeeze=True, index_col=0)
to_sample_distribution.index.rename("per_month", inplace=True)

### 2. Creating a pool of users with OSoMe

We obtained several weeks of tweet data from OSoMe, using the random sample option. We obtained three weeks of tweet data:
   1. September 1st 2017 00:00 UTC to September 8th 2017 00:00 UTC
   2. March 1st 2018 00:00 UTC to March 8th 2018 00:00 UTC
   3. September 1st 2018 00:00 UTC to September 8th 2018 00:00 UTC
   
Based on the obtained results, we build a dataframe that lists the creation dates of the Twitter users in our sample. We exclude all users that are already in the \`\`Depressed'' cohort.

In [7]:
filenames = ["2017_09", "2018_03", "2018_09"]
users = {}
locations = {}
all_users = set()
users_with_loc_not_D = set()

for fn in filenames:
    with gzip.open("data/do_not_share/user_sample_"+fn+".gz") as doc:
        for l in doc.readlines():
            data = json.loads(l.decode("utf-8").strip("\n"))
            all_users.add(data["user"]["id_str"])
            if int(data["user"]["id_str"]) not in dep_users:
                users[data["user"]["id_str"]] = datetime.strptime(data["user"]["created_at"], "%a %b %d %H:%M:%S %z %Y")
                if data["user"].get("location"):
                    users_with_loc_not_D.add(data["user"]["id_str"])
                    locations[data["user"]["id_str"]] = data["user"]["location"]
                    
print("In total, we obtain", len(all_users), "users, out of which", len(users_with_loc_not_D), "provided a location and are not in our ``Depressed'' cohort.")

In total, we obtain 588356 users, out of which 387509 provided a location and are not in our ``Depressed'' cohort.


In [8]:
sample_dist = pd.DataFrame(index=users.keys())
sample_dist["created_at"] = pd.Series(data=users)
sample_dist["locations"] = pd.Series(data=locations)
sample_dist["per_month"] = sample_dist["created_at"].apply(lambda x: x.strftime("%Y_%m"))

We only want to sample users for which we can obtain the time zone information based on the dictionary we built for the depression time lines. Therefore, we load that dictionary and determine if we can obtain the time zone for each user in the sample. All users that have time zone information are stored in the list `with_tz_info`.

In [9]:
with open("data/tz_user_loc.json", encoding="ISO 8859-1") as doc:
    loc_to_tz = json.loads(doc.read())

def get_tz(x):
    if loc_to_tz.get(x):
        return loc_to_tz[x]
    
sample_dist["tz_info"] = sample_dist["locations"].apply(get_tz)
with_tz_info = sample_dist["tz_info"].dropna().index.values

print("In total, we obtain timezone information for", with_tz_info.size, "users.")

In total, we obtain timezone information for 71277 users.


We then calculate the profile creation date distribution for all users for which we can obtain a timezone.

In [10]:
sample_counts_per_month = sample_dist.loc[with_tz_info, :].groupby(["per_month"]).count()
amounts_in_sample = sample_counts_per_month.loc[to_sample_distribution.index, "created_at"]

Based on `to_sample_distribution`, we find the maximum number of users that we can extract from the seed set in such a way that these users have the same distribution in creation month for all users in both the \`\`Depressed'' cohort. The values that have to be sampled from the `sample_dist` are stored in `to_sample_amounts`.

In [11]:
max_multiple = amounts_in_sample.sum() / to_sample_distribution.sum()
N = np.where([np.all(amounts_in_sample >= x * to_sample_distribution) for x in np.arange(max_multiple)])[0].max()

to_sample_amounts = N * to_sample_distribution
print("We sample", to_sample_amounts.sum(), "Twitter users as our ``Random'' cohort, out of a total of", amounts_in_sample.sum(), "Twitter users.")

We sample 9525 Twitter users as our ``Random'' cohort, out of a total of 70155 Twitter users.


Based on these numbers, we can now sample the `sample_dist` per month with the number of user ids that we want to obtain from that month based on `to_sample_amounts`. The result is a list of user ids `random_user_sample`.

In [12]:
grouped_sample_dist = sample_dist.loc[with_tz_info, :].groupby(["per_month"])

random_user_sample = []

for (month, amount) in to_sample_amounts.iteritems():
    uids = np.random.choice(grouped_sample_dist.groups[month], size=amount, replace=False)
    random_user_sample = random_user_sample + uids.tolist()

We then add the timezone information for these users to a dictionary `sample_tz_dict`

In [13]:
sample_tz_dict = {}
for u in random_user_sample:
    sample_tz_dict[u] = sample_dist.loc[u, "tz_info"]

Finally, we write this sample of users with their timezones to a file called `random_sample_user_to_tz.json`.

```with open("data/random_sample_user_to_tz.json", "w" as out):
    json.dump(sample_tz_dict, out)```