# Retrieving Mastodon datasets.

Retrieving Mastodon datasets for specific sentiment analysis.

Written by Luc Bijl.

Retrieving Mastodon key from credentials file.

In [1]:
with open("../.credentials", "r") as file:
    for line in file:
        if 'mastodon-key' in line:
            mastodon_key = line.split('mastodon-key=')[1].strip()
            break

Initializing the Mastodon API.

In [2]:
import pandas as pd
from mastodon import Mastodon

api_base_url = "https://mastodon.social"

mastodon = Mastodon(access_token=mastodon_key,api_base_url=api_base_url)

Converting the retrieval start and end time period to timestamps.

In [3]:
from datetime import datetime

start_date_string = '2022-10-11 00:00:00'
end_date_string = '2022-12-11 00:00:00'

start_timestamp = int(datetime.strptime(start_date_string, "%Y-%m-%d %H:%M:%S").timestamp())
end_timestamp = int(datetime.strptime(end_date_string, "%Y-%m-%d %H:%M:%S").timestamp())

Queriying all messages containing Eli Lilly in the period 2022-10-11 to 2022-12-11.

In [8]:
query = 'language:en after:2022-10-11 before:2022-12-11'

Specifiying the chosen search topics.

In [24]:
topics = [
    'Lilly', 'Eli Lilly', 'Eli Lilly and company', 'LLY',
    'Insulin', 
    'Diabetes', 'Diabetic',
    'Medicine', 'Medical',
    'pharmaceutical'
]

Obtaining all toots in for every topic, and adding the dataframes of every topic to a dictionary.

In [25]:
limit = 1000

dataframes = {}

for topic in topics:

    max_id = None
    dates = []
    ids = []
    contents = []

    while len(dates) < limit:

        toots = mastodon.search_v2(query + ' ' + topic,result_type='statuses',max_id=max_id)['statuses']

        if not toots:
            break

        if toots[-1].id == max_id:
            break
        

        for n in range(1,len(toots)):
            date = int(toots[n].created_at.timestamp())

            if start_timestamp <= date <= end_timestamp:
                dates.append(datetime.utcfromtimestamp(date))
                ids.append(toots[n].id)
                contents.append(toots[n].content)

        max_id = toots[-1].id

    data = {'Date': dates, 'ID': ids, 'Content': contents}
    df_toots = pd.DataFrame(data)
    dataframes[topic] = df_toots

dataframes

{'Lilly':                   Date                  ID  \
 0  2022-12-08 14:42:58  109478581359363509   
 1  2022-12-05 17:58:35  109462365059484871   
 2  2022-12-01 20:33:41  109440324430339788   
 3  2022-11-30 18:53:33  109434268151114981   
 4  2022-11-28 03:01:05  109419198231109803   
 ..                 ...                 ...   
 93 2022-11-07 06:42:17  109301159536501565   
 94 2022-11-06 19:52:47  109298605607324195   
 95 2022-11-03 23:05:59  109282378357896423   
 96 2022-10-30 09:00:14  109256403763306021   
 97 2022-10-27 21:06:07  109242270847700247   
 
                                               Content  
 0   <p>ICYMI yesterday, our new <a href="https://n...  
 1   <p><span class="h-card"><a href="https://roman...  
 2   <p>Sophia has nursery toys Charlie and Lilly f...  
 3   <p>Eli Lilly CEO says insulin tweet flap “prob...  
 4   <p>Tickets acquired to see The Mountain Goats ...  
 ..                                                ...  
 93  <p>@nixnick@octodon.s

Saving the dataframes dictionary in the datasets directory.

In [40]:
import pickle 

with open('../datasets/mastodon-lly.pkl','wb') as file:
    pickle.dump(dataframes, file)