In [1]:
%%html
<!-- CSS settings for this notbook -->
<style>
    h1 {color:#BB0000}
    h2 {color:purple}
    h3 {color:#0099ff}
    hr {    
        border: 0;
        height: 3px;
        background: #333;
        background-image: linear-gradient(to right, #ccc, black, #ccc);
    }
</style>

# Data Mining Mastodon

------

# Objectives
* What is **Mastodon**?
* Why we're presenting **Mastodon** rather than **Twitter**
* **Data-mine Mastodon** with **Mastodon.py** library
* Use various **Mastodon** API methods
* **Get information** about a specific Mastodon account
* **Look up trending hashtags** 
* **Search for toots** containing a specific hashtag
* **Process streams of toots** as they’re happening
* **Clean and preprocess toots** to prepare them for analysis
* **Translate foreign language toots** into English 
* Tap into the **live streams of toots**
<!--* Perform **sentiment analysis** on toots from the live stream-->
* Create an **interactive map of Mastodon servers locations** from which Toots are received

------

# 12.1 Introduction 
* **Data mining** &mdash; searching large collections of data for **insights**
* **Sentiment** in toots can help **make predictions**  
    * **Stock prices**
    * **Election results**
    * Likely **revenues** for a **new movie** or, more generally, **product**
    * **Success** of a company’s **marketing campaign**
* Spot **comments on your company's products** 
* Spot **faults in competitors’ products** 
* Spot **trending topics**
* **Connect to Mastodon** with easy-to-use **Web services**

## What Is Mastodon?
* Free social network
* Similar to X, but decentralized and more privacy focused
* No ads
* Thousands of servers run by individuals and companies worldwide
* **Federated** (known as the **Fediverse**)
    * Independent servers distributed across the Internet
    * Communication among the server nodes  
    > https://en.wikipedia.org/wiki/Distributed_social_network
    * Can communicate with accounts throughout the **Fediverse** 
* **Toots**
    * Messages up to **500 characters**
    * Some servers allow more
* Anyone can generally choose to follow anyone else but depends on
    * individual users' account settings 
    * specific server rules

## Accessing Mastodon Data Programmatically 
* Anyone with an account can use the APIs
* Access and manipulate accounts, servers, toots (statuses), timelines, trends, ...
* Can **tap into the live stream** of toots for a given server or the **Fediverse** 

------

# 12.2 Overview of the Mastodon APIs 
* **Web services** are methods that you call in the **cloud**
* Each method has a **web service endpoint** represented by a **URL**
    * **Caution**: Internet connections can be lost, services can change and some services are not available in all countries, so **apps can be brittle**
* Some **API categories** 
    * **Accounts API** — Access information about and manipulate **Mastodon user accounts**
    * **Statuses API** — Access info about and post **status updates**, known as **toots**
    * **Timelines API** — Toots and other "events" (follows, likes, ...) over time — since the inception of each Mastodon server
    > Enables access to toots and other "events" from the **public fediverse**, **toots with specific hashtags**, **logged-in user's timeline** (including accounts the user follows) and **lists** for filtering a user's home timeline. Can use timelines to search for **past toots** containing specific hashtags and access **live toot streams**
* **Mastodon API categories** under the **API METHODS** heading in the left column at
>https://docs.joinmastodon.org/

------

# 12.3 Creating a Mastodon Account 

## Developer Accounts
* **Mastodon does not have separate developer accounts**
    * Anyone with a Mastodon account can be a developer
    * Every server has its own rules — some servers allow anyone to join, some require approval

## Servers
* Sign-up for main server: https://mastodon.social/auth/sign_up
* Or, explore servers worldwide at
    * https://joinmastodon.org/servers
    * Many more servers than listed here

## Mastodon.social — Original and Largest Overall Mastodon Server
* **Deitel joined `mastodon.social`** 
* https://joinmastodon.org/servers enables you to filter servers based on
    * region
    * language
    * topical focus of that server

------

# 12.4 What’s in a Mastodon API Response? 
* Mastodon API methods return **JSON (JavaScript Object Notation)** objects
    * Like Twitter and most popular web services today
* Text-based **data-interchange format** 
* Represents objects as **collections of name–value pairs** (like dictionaries)
* Commonly used in web services
* Human and computer readable

## JSON
* **JSON object format**:

> ```
> {propertyName1: value1, propertyName2: value2}
> ```
* **JSON array format (like Python list)**:

> ```
> [value1, value2, value3]
> ```
* **Mastodon.py handles the JSON for you** behind the scenes

## Class `mastodon.AttribAccessDict` 
* Mastodon returns JSON as **`mastodon.AttribAccessDict` objects**
* Python `dict` (dictionary) subclass
* Access via
    * traditional Python dictionary keys  
    * attributes named to match the dictionary keys
* **API ENTITIES** section of the Mastodon docs (https://docs.joinmastodon.org/) describes the 52 JSON objects you'll find in various Mastodon API responses

## Sample JSON for Trending Hashtags
* A portion of the JSON response to a request for recent trending hashtags

```json
[{'name': 'caturday',
  'url': 'https://mastodon.social/tags/caturday',
  'history': [{'day': datetime.datetime(2023, 4, 22, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '719',
    'uses': '828'},
   {'day': datetime.datetime(2023, 4, 21, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '58',
    'uses': '62'},
   {'day': datetime.datetime(2023, 4, 20, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '25',
    'uses': '32'},
   {'day': datetime.datetime(2023, 4, 19, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '28',
    'uses': '34'},
   {'day': datetime.datetime(2023, 4, 18, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '19',
    'uses': '22'},
   {'day': datetime.datetime(2023, 4, 17, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '25',
    'uses': '26'},
   {'day': datetime.datetime(2023, 4, 16, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '225',
    'uses': '254'}],
  'following': False},
 {'name': 'ScreenshotSaturday',
  'url': 'https://mastodon.social/tags/ScreenshotSaturday',
  'history': [{'day': datetime.datetime(2023, 4, 22, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '56',
    'uses': '59'},
   {'day': datetime.datetime(2023, 4, 21, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '0',
    'uses': '0'},
   {'day': datetime.datetime(2023, 4, 20, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '0',
    'uses': '0'},
   {'day': datetime.datetime(2023, 4, 19, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '0',
    'uses': '0'},
   {'day': datetime.datetime(2023, 4, 18, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '3',
    'uses': '3'},
   {'day': datetime.datetime(2023, 4, 17, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '3',
    'uses': '3'},
   {'day': datetime.datetime(2023, 4, 16, 0, 0, tzinfo=datetime.timezone.utc),
    'accounts': '24',
    'uses': '26'}],
  'following': False},
  ...
 ]
```

------

# 12.5 Installing the Libraries Used in This Notebook

## Installing Mastodon.py 
* https://github.com/halcy/Mastodon.py 
* Easy access to Mastodon APIs
* Mastodon.py docs: https://mastodonpy.readthedocs.io/
> `pip3 install Mastodon.py`



## DeepL AI Translator 
* Mastodon's API supports translation, but not yet supported by Mastodon.py library
* https://github.com/DeepLcom/deepl-python
> `pip install --upgrade deepl`
* DeepL requires an API key
* Free one allows 500,000 characters/month
* To get a key:
> * Go to https://www.deepl.com/pro#developer
> * Click **API**
> * Click **Sign up for free**
> * Under **DeepL API Free** click **Sign up for free**
> * Specify an email/password and click **Continue**
> * Fill in the form and provide a credit card — required to prevent “fraudulent multiple registrations”, then click **Continue**
> * Read the terms and, if you agree, click **Sign up for free**
> * Click the **Account Management** link on the thank you page
> * Click the **Account** tab and scroll to **Authentication Key for DeepL API**
> * Copy your key then open **`keys_mastodon.py`** and replace `'your key here'` with your DeepL key
>> `deepL_key = 'your key here'`

## Installing geopy 
* https://github.com/geopy/geopy
* Convert locations, such as **Boston, MA**, into latitudes and longitudes, such as **42.3602534** and **-71.0582912**, for plotting on maps
* We'll use the free **ArcGIS** service
>`conda install -c conda-forge geopy`
> * Windows users: **Run the Anaconda Prompt as an Administrator**

## Folium Library and Leaflet.js JavaScript Mapping Library
* https://github.com/python-visualization/folium
* Creates interactive maps
> `pip install folium`

**Maps from OpenStreetMap.org**
* Leaflet.js uses open-source maps from `OpenStreetMap.org`. 
* Copyrighted by the OpenStreetMap.org contributors
* www.openstreetmap.org/copyright 
* www.opendatacommons.org/licenses/odbl

------

# 12.6 Preparing to Interact with Mastodon Programmatically

## Import Username and Password
* Before executing this cell, ensure that your copy of `keys_mastodon.py` contains your Mastodon credentials
* **Many Mastodon APIs do not require authentication**
    * Some APIs optionally require authentication — determined by each server's administrator
    * Some require authentication, such as those that enable administration of a mastodon server
* See each method's documentation for **authentication requirements**
    * Mastodon.py: https://mastodonpy.readthedocs.io/
    * Main Mastodon docs: https://docs.joinmastodon.org/
* **We will log in, so we are authenticated for calls that require authentication**, such as searching for accounts

In [2]:
import keys_mastodon

## Register a Mastodon App
* Must be done **once per server** that you'll directly interact with via the API 
    * As you'll see, through one server, you can get access to the Fediverse data
* For apps you are distributing (e.g., a mobile-phone app for interacting with Mastodon)
    * must register **once for each device/server pair**
    * for example, a mobil app might allow the user to manage accounts on multiple Mastodon servers
* Arguments
    * app name
    * `api_base_url` — your specific Mastodon server
    * `to_file` — file in which `create_app` saves app credentials to the specified file

In [3]:
from mastodon import Mastodon

In [4]:
# create deiteltest app and save its credentials
credentials = Mastodon.create_app(
    'DeitelPythonDataScienceMastodonApp',
    api_base_url='https://mastodon.social',
    to_file='deiteltest_client_credentials.secret'
)

<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.7 Creating a `Mastodon` object to Access Mastodon APIs
* **`Mastodon` object** is your gateway to using the Mastodon APIs
* Uses the info stored via the `to_file` parameter in preceding `create_app` call 

<!-- mastodon = Mastodon(client_id='deiteltest_client_credentials.secret') -->

## Create Mastodon Object for Authentication Purposes

In [5]:
mastodon = Mastodon(
    client_id='deiteltest_client_credentials.secret',
    api_base_url='https://mastodon.social'
)

## Generate authorization URL
* Mastodon recently changed the authenticatin process
* This creates a URL the user can use to log into Mastodon

In [6]:
auth_url = mastodon.auth_request_url(scopes=['read', 'write'])

## Have user log in
* Open the URL in a browser and obtain the authorization code 

In [9]:
print(f"Please visit this URL and authorize the app: {auth_url}")
code = input("Enter the authorization code: ")

Please visit this URL and authorize the app: https://mastodon.social/oauth/authorize?client_id=aCVlpR-SpoOItHw3_CACf6In3H_JrPuZRwN1eVYzLNU&response_type=code&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=read+write&force_login=False&state=None&lang=None


Enter the authorization code:  S-vuay8GEnNidn_c6aO8W9XgvYMK2i3aEgKE1cATevE


## Log in with the authorization code
* Required for some API calls

In [10]:
access_token_str = mastodon.log_in(
    code=code,
    scopes=['read'], # can include read, write, follow, push
    to_file='usercred.secret'
)

<!-- ## Log into Mastodon
* Log into the account
* May not be required depending on the API methods you'll use

access_token = mastodon.log_in(keys_mastodon.usr, keys_mastodon.pwd, 
    scopes=['read', 'write', 'follow'],
    to_file='deiteltest_client_credentials.secret')

mastodon = Mastodon(
    client_id='deiteltest_client_credentials.secret',
    access_token='usercred.secret',
    api_base_url='https://mastodon.social'
) -->

## **Example:** Rate Limits
* Typically, **300 calls per user** or **7500 calls per IP address** in **5 minutes**
    * Can vary by server
* Options **throw**, **wait** and **pace**
    * **throw** (default): `MastodonRateLimitError` when a request hits the rate limit — for apps that manage their own rate limiting
    * **wait**: When rate limit hit, waits until rate limit resets (at end of five-minute interval), then tries again
    * **pace**: Delays each request after the first, attempting to avoid hitting the rate limit; acts like **wait** mode if limit is hit
* Following statement sets rate limit method for all calls

In [11]:
Mastodon.RATE_LIMIT_METHOD = 'wait' 

### Number of Calls Per 5 Minutes Allowed on This Server

In [12]:
mastodon.ratelimit_limit 

300

### Number of Calls Remaining in Current Rate Interval Period

In [13]:
mastodon.ratelimit_remaining

287

<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.8 **Example:** Getting a Mastodon Instance's (Server's) Info
* **Instance** JSON description: https://docs.joinmastodon.org/entities/Instance/

### Get the Instance Info
* Depending on server, might need to be logged in

In [14]:
instance = mastodon.instance() 

### Print Some Instance Info

In [15]:
#print(f'{"server title":>19}: {instance["title"]}')
#print(f'{"uri":>19}: {instance["uri"]}')
#print(f'{"short_description":>19}: {instance["short_description"]}')
#print(f'{"stats.user_count":>19}: {instance["stats"]["user_count"]:,}')
#print(f'{"stats.status_count":>19}: {instance["stats"]["status_count"]:,}')
#print(f'{"stats.domain_count":>19}: {instance["stats"]["domain_count"]:,}')

print(f'{"server title":>19}: {instance.title}')
print(f'{"server domain":>19}: {instance.domain}')
print(f'{"server description":>19}: {instance.description}')
print(f'{"active users on this server (4 weeks)":>19}: {instance.usage.users.active_month:,}')

       server title: Mastodon
      server domain: mastodon.social
 server description: The original server operated by the Mastodon gGmbH non-profit
active users on this server (4 weeks): 280,329


<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.9 Searching for Mastodon Accounts By User Name
* A mobile app might allows user to locate other accounts to follow
* Can programmatically search for accounts containing a specified string

## **Example:** Find Account Names Containing `'Mastodon'`
* Returns list of **Account**s
* **Account** JSON description: https://docs.joinmastodon.org/entities/Account/
<!--@mastodon.social-->

In [16]:
accounts = mastodon.account_search(q='Mastodon') 

In [17]:
len(accounts)

40

## **Example:** Basic Account Information for Top 3 Accounts with `mastodon` in the name
* You can discover info about an account
    * Might want to follow and account based on popularity (number of followers)
    * Might want to follow some of the same accounts that a specific account follows
* Each has many properties, including:
    * `username` — user’s Mastodon handle 
    * `id` — account’s unique ID number
    * `url` — URL used to access the account in a web browser
    * `note` — account description (may contain HTML tags)
    * `statuses_count` — number of toots posted by the account
    * `followers_count` — account's number of followers
    * `following_count` — number of other accounts this account follows
* Sort by `followers_count` in descending order, then display top 3 accounts by followers

In [18]:
sorted_accounts = sorted(accounts, key=lambda acct: acct.followers_count, reverse=True)

In [19]:
print('username: ', sorted_accounts[0].username)
print('id: ', sorted_accounts[0].id)
print('url: ', sorted_accounts[0].url)
print(f'statuses_count: {sorted_accounts[0].statuses_count:,}')
print(f'followers_count: {sorted_accounts[0].followers_count:,}')
print(f'following_count: {sorted_accounts[0].following_count:,}')
print()

username:  Mastodon
id:  13179
url:  https://mastodon.social/@Mastodon
statuses_count: 344
followers_count: 843,800
following_count: 34



### Getting Your Own Account’s Information
* Get via `Mastodon` object’s `me` method
> `my_account = mastodon.me()`
* Returns an **Account object** for the account you used to authenticate with Mastodon

<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.10 Spotting Trending Hashtags: Mastodon Trends API
* If a topic **“goes viral,”** thousands or even millions of people could be talking about it
* Mastodon allows you to look up **trending hashtags**, **trending toots** and  **trending links** across the fediverse
    * NOTE: **trending toots** and **trending links** are supposed to return lists of items, but each returns only one item at the moment 
    * **I filed an issue in the Mastodon.py GitHub repository** — the developers acknowledged the bug and are looking into it 

## **Example:** Getting a List of Trending Hashtags 
* `trending_tags` returns a list of trending hashtags in the fediverse
* Returned as JSON **Tag** objects
> https://docs.joinmastodon.org/entities/Tag/
* Each contains
    * `name`
    * `url`
    * `history` list of last 7 days' stats for the hashtag
    * `following` ― whether the logged in account is following the trending tag

In [20]:
trends = mastodon.trending_tags(limit=20)  # 20 max, 10 default

In [21]:
trends[0] # sample dictionary for one hashtag

Tag([('name', 'anycookieasongorpoem'),
     ('url', 'https://mastodon.social/tags/anycookieasongorpoem'),
     ('history',
      [TagHistory([('day',
                    datetime.datetime(2025, 7, 11, 0, 0, tzinfo=datetime.timezone.utc)),
                   ('uses', '0'),
                   ('accounts', '0')]),
       TagHistory([('day',
                    datetime.datetime(2025, 7, 10, 0, 0, tzinfo=datetime.timezone.utc)),
                   ('uses', '356'),
                   ('accounts', '106')]),
       TagHistory([('day',
                    datetime.datetime(2025, 7, 9, 0, 0, tzinfo=datetime.timezone.utc)),
                   ('uses', '0'),
                   ('accounts', '0')]),
       TagHistory([('day',
                    datetime.datetime(2025, 7, 8, 0, 0, tzinfo=datetime.timezone.utc)),
                   ('uses', '0'),
                   ('accounts', '0')]),
       TagHistory([('day',
                    datetime.datetime(2025, 7, 7, 0, 0, tzinfo=datetime.timezone.utc)),


## **Example:** Display Trending Hashtags in Descending Order By Toot Volume over the Last Seven Days

In [22]:
def tag_count(tag):
    """Counts number of times a hashtag was used in last 7 days"""
    total_uses = 0
    
    for day in tag.history:
        total_uses += int(day['uses'])  
    
    # add attribute to tag object specifying 7-day hashtag count
    tag['seven_day_count'] = total_uses 
    return total_uses

* Sort the trends in **descending** order by toot volume:

In [23]:
trends.sort(key=tag_count, reverse=True)

* Display names, counts and URLs of the **top 20 trending topics**

In [24]:
for tag in trends:
    print(f'{tag.name}: {tag["seven_day_count"]}')
    print(f'   {tag.url}')

anycookieasongorpoem: 356
   https://mastodon.social/tags/anycookieasongorpoem
rebrandwebsitesortech: 292
   https://mastodon.social/tags/rebrandwebsitesortech
exciteafictionalcharacter: 213
   https://mastodon.social/tags/exciteafictionalcharacter
MeerMittwoch: 115
   https://mastodon.social/tags/meermittwoch
ThrowbackThursday: 109
   https://mastodon.social/tags/throwbackthursday
FINSUI: 94
   https://mastodon.social/tags/finsui
musiquinta: 91
   https://mastodon.social/tags/musiquinta
murderbot: 91
   https://mastodon.social/tags/murderbot
makershour: 88
   https://mastodon.social/tags/makershour
doorsday: 69
   https://mastodon.social/tags/doorsday
ThursdayFiveList: 64
   https://mastodon.social/tags/thursdayfivelist
lula: 62
   https://mastodon.social/tags/lula
streamsofdreams: 57
   https://mastodon.social/tags/streamsofdreams
tbt: 52
   https://mastodon.social/tags/tbt
GoodTroubleLivesOn: 47
   https://mastodon.social/tags/goodtroubleliveson
public: 40
   https://mastodon.social

<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.11 Searching for Toots Containing Specific Hashtags; Getting More than One Page of Results 

* Mastodon API methods often **return collections of objects** 
    * Docs describe these as "Arrays" 
    * Mastodon.py returns **lists of Python dictionaries**
* For example, `timeline_xxx` functions can return:
    * `timeline_hashtag` — toots **containing a specific hashtag**
    * `timeline_public` — toots from the **fediverse's public timeline** 
    * `timeline_local` — toots posted on the **local server**
    * `timeline_home` — toots in a user’s **home timeline** (includes accounts followed by the user)
* Each Mastodon API call returns a maximum number of items per call
    * known as a **page of results**
    * default is often 10 or 20
    * max is typically 40 for toots and 80 for accounts
* **Mastodon.py can handle paging details** with utility functions (as you'll see momentarily)
> https://mastodonpy.readthedocs.io/en/stable/12_utilities.html

## Functions `print_toot` and `print_toots` from `tootutilities.py`

## **Example:** Paging Through Results
* We purposely grab only 2 toots per page of results to show the mechanics of paging

In [25]:
total_toots = 10 # how many toots to get
pages = 5 # of pages of toots to process
toots_per_page = total_toots // pages # 2 per page for demo purposes; can do 40/page
hashtag = 'football'

### Get First Page of Toots Containing a Given Hashtag 
* `timeline_hashtag` searches public timeline of past toots containing hashtag you specify

In [26]:
result = mastodon.timeline_hashtag(hashtag, limit=toots_per_page)
saved_toots = result # saved_toots will eventually contain all the toots

### Get Remaining Pages of Toots Containing a Hashtag 
* Mastodon.py utility function **`fetch_next`** gets the next page of results
* **Argument is the previous page of results**, which includes the info need by Mastodon.py to get the next page of result

In [27]:
for toots in range(pages - 1): # for each remaining page
    # save previous page of results
    previous_result = result 
    
    # use Mastodon.py utility function fetch_next to get next page of results
    result = mastodon.fetch_next(previous_result) 

    # if there are results add them to saved_toots; otherwise, terminate loop
    if result:
        saved_toots += result
    else:
        break # no more results

In [28]:
len(saved_toots) # total toots acquired

10

In [29]:
import tootutilities 
tootutilities.print_toots(saved_toots)

AstonVilla:


Bournemouth:


BrentfordFC:


CrystalPalace:


EvertonFC:


IpswichTown:


LeedsUnited:


lasvegasraiders:


kansascitychiefs:


ManchesterCity:


<hr style="height:2px; border:none; color:#AAA; background-color:#AAA;">

# 12.12 Cleaning/Preprocessing Toots for Analysis
* **Data cleaning** is one of data scientists' most common and important tasks 
* Depending on the text analyses you wish to perform, you may need to normalize text so NLP tools can "understand" it
    * abbreviations, slang, incorrectly spelled words, inconsistent formatting, etc. can limit NLP tools' ability to understand and analyze text 
* Some NLP tasks (Lesson 11) for normalizing social media posts
    * Converting all text to the same case
    * Removing `#` from hashtags, `@`-mentions, duplicates, hashtags
    * Removing excess whitespace, punctuation, **stop words**, URLs
    * **Stemming** and **lemmatization**
    * **Tokenization**
    * Removing formatting, like HTML, which NLP tools might not understand

### **BeautifulSoup4** Library
* https://www.crummy.com/software/BeautifulSoup/
> `pip install beautifulsoup4`
* Most popular library for **parsing HTML and extracting content** from it
* Commonly used to **data-mine content in web pages**

### **tweet-preprocessor** Library 
* https://github.com/s/preprocessor
* Library designed to clean tweets, but useful for posts in general
* `pip install tweet-preprocessor`
* Can automatically remove any combination of:

| Option | Option constant |
| :--- | :--- |
| **`OPT.MENTION`** | @-Mentions (e.g., `@nasa`) |
| **`OPT.EMOJI`** | Emoji |
| **`OPT.HASHTAG`** | Hashtag (e.g., `#mars`) |
| **`OPT.NUMBER`** | Number |
| **`OPT.RESERVED`** | Twitter reserved Words (`RT` and `FAV`) |
| **`OPT.SMILEY`** | Smiley |
| **`OPT.URL`** | URL |

## **Example:** Cleaning a Toot Containing HTML and a URL

In [30]:
toot_text = '<p style="padding-left: 3em">A sample fake toot with a URL https://nasa.gov</p>'

* **BeautifulSoup** library can be used to **parse HTML and extract content** from it

In [31]:
from bs4 import BeautifulSoup

In [32]:
soup = BeautifulSoup(toot_text, 'html.parser') 

In [33]:
plain_text = soup.get_text() # remove all HTML/CSS tags and commands

In [34]:
plain_text

'A sample fake toot with a URL https://nasa.gov'

* The **tweet-preprocessor** library’s module name is **`preprocessor`**

In [35]:
import preprocessor as p

In [36]:
p.set_options(p.OPT.URL)

In [37]:
p.clean(plain_text)

'A sample fake toot with a URL'

------

# 12.13 Mastodon Streaming API
* Your app can receive various Mastodon streams as they occur in real-time
    * `stream_hashtag` — toots containing specified hashtag (home timeline and notifications)
    * `stream_user` — events related to the logged in user account
    * `stream_public` — public fediverse event stream
    * `stream_local` — local server event stream
    * `stream_list` — events for the specified user, but resticted to accounts from a list 

## Creating a Subclass of `StreamListener` 
* Mastodon **pushes** data to your listener
* Streaming rate varies 
* Create a **subclass of Mastodon.py’s `StreamListener` class** to process the stream
* Mastodon.py calls `StreamListener` methods as it receives events
    * `on_update(self, status)` is called when when a toot arrives from the stream
    * `StreamListener` defines other **`on_`** methods for other "events"
    > https://mastodonpy.readthedocs.io/en/stable/10_streaming.html#streamlistener
    * Override only the methods your app needs

## Class `TootListener` (Located in tootlistener.py)

## **Example:** Streaming the Mastodon Federated Timeline

### Creating a `TootListener` 
* `StreamListener` subclass `TootListener` manages the connection to the Mastodon stream and receives and processes the toots

In [38]:
import tootlistener 

In [39]:
toot_listener = tootlistener.TootListener(limit=5)

### Streaming All Public Events
* Events that are not toots are ignored by our `StreamListener` subclass
    * notifications (e.g., someone followed you or reblogged your post)
    * a toot deleted
    * someone direct messaged you
    * a toot was edited
    * the streaming connection terminated
* `stream_public` starts the live stream
    * `toot_listener` receives each event—for toots, displays toot text
    * `run_async=True` ensures that `stream_public` **returns a stream handle** we can use to **close the stream**
* **Asynchronous vs. Synchronous Streams**
    * `run_async=True` (asynchronous) runs the stream in a separate thread and returns a stream handle for managing the stream
    * `run_async=False` (synchronous) runs the stream forever unless an unanticipated failure occurs, such as an unhandled exception

In [40]:
stream = mastodon.stream_public(toot_listener, run_async=True)
toot_listener.stream = stream  # so toot_listener to terminate stream later

prtimes:
ORIGINAL: 


TRANSLATED: 



Farbs:



nijisanji_topics:
ORIGINAL: 


TRANSLATED: 



PemudapancasilaFM:



nijisanji_topics:
ORIGINAL: 


TRANSLATED: 





------

# 12.14 **Example:** Sentiment Analysis 
* Political researchers might use during elections to understand how people feel about specific politicians and issues, and **how they're likely to vote**
* Companies might use to see **what people are saying about their products and competitors’ products**
* Class `SentimentListener` (in `sentimentlistener.py`) checks sentiment on toots 

## Class `SentimentListener` (in sentimentlistener.py)

## Main Application

### Specify number of toots to tally

In [41]:
limit = 10

### Set up Dictionary to Track Toot Sentiment

In [50]:
sentiment_dict = {'positive': 0, 'neutral': 0, 'negative': 0}

### Create `StreamListener` Subclass Object

In [51]:
import sentimentlistener 
sentiment_listener = sentimentlistener.SentimentListener(sentiment_dict, limit)

### Start Stream and Store Its Handle

In [52]:
stream = mastodon.stream_public(sentiment_listener, run_async=True)
sentiment_listener.stream = stream  # so sentiment_listener to terminate stream later

- emtk: Something is wrong: ......

- DavidBlue: I went back to the scary place.

  Tagesspiegel: Worried about backpacker: Lost in the outback? - German woman still missing in Australiahttps://www.tagesspiegel.de/gesellschaft/panorama/sorge-um-backpackerin-im-outback-verirrt-deutsche-weiter-in-australien-vermisst-14006189.html?utm_source=flipboard&utm_medium=activitypub Posted in Tagesspiegel Welt @tagesspiegel-welt-Tagesspiegel

  stacescases2.bsky.social: It's guilt. Not the kind you and I feel, he KNOWS he fucked.

+ lucas_a_meyer: Daughter yells from the other side of the house: DAAAAD, quick question? Whats 16,510 divided by two? Quick!Me: 8,255. Why?D: Im paying a debt!Me: WHAT???D: In Animal Crossing, daaad

  penguin: 

  adhd_memetherapy: Once again, the bots, racists, Zionists, & MAGA crowd are crawling all over the comments. It never ceases to amaze how loud your lack of empathy and basic humanity is.Why are u even here? The majority of ADHD & Autistic people r justice sens

#### Display summary of results

In [53]:
print(f'Toot sentiment:')
print('Positive:', sentiment_dict['positive'])
print(' Neutral:', sentiment_dict['neutral'])
print('Negative:', sentiment_dict['negative'])

Toot sentiment:
Positive: 1
 Neutral: 7
Negative: 2


------

# 12.15 **(IF TIME) Example:** Geocoding and Mapping
* Collect streaming toots
* Look up sever locations and plot toots at those loctions on an interactive map
* **Mastodon is privacy focused**
    * Only server admins have access to any location data
* Even in Twitter, geo location is off by default, though many accounts specify home location 
    * Sometimes invalid or fictitious 
* Map markers will show the sender's `location` and toot text

### **geopy** library
* https://github.com/geopy/geopy
* Installed in Section 5
* **Geocoding** — translate locations into **latitude** and **longitude**
* **geopy** supports dozens of **geocoding web services**, many with **free or lite tiers**


### **folium library** and Leaflet.js JavaScript Mapping Library
* https://github.com/python-visualization/folium
* Setup in Section 5
* For maps — uses **Leaflet.js JavaScript mapping library** to display maps in a web page 
* Folium can save HTML files that you can view in your web browser or add to a website

## Getting and Mapping the Toots
* We’ll use utility functions from our **`tootutilities.py`** file and class **`LocationListener`** in **`locationlistener.py`**
* Each is included after the example

### Collections Required By LocationListener
* a list (`toots`) to store the data from the toots we collect 
* a dictionary (`counts`) to track the total number of toots we collect and the number that have location data

In [54]:
toots = [] 
counts = {'total_toots': 0, 'locations': 0}

### Creating the LocationListener 
* Collect 50 toots 
* `LocationListener` will use utility function `get_toot_content` (located in `tootutilities.py`; discussed after this example) to place in a dictionary the `username`, toot `text` and Mastodon server `location` from each toot

In [55]:
import locationlistener 

location_listener = locationlistener.LocationListener(
    counts_dict=counts, toots_list=toots, limit=50)

### Start Stream and Store Its Handle
* We display the toot count so far and usernames to show progress

In [56]:
stream = mastodon.stream_public(location_listener, run_async=True)
location_listener.stream = stream  # so location_listener to terminate stream later

 1: buffalobills
 2: marvin_h2g2
 3: tkhunt
 4: gulfchannels
 5: SudOuest
 6: Tokyo
 7: EInvestidor
 8: tychotithonus
 9: africa_social
10: jrd_vs
11: roadfmsong
12: prtimes
13: AztecaDeportes
14: buffalobills
15: jaseowo
16: trainaccident
17: ten_nami_ten
18: monumental_movement_records
19: easy_fx
20: emtk
21: inanna
22: speedweek
23: tv
24: rsssunstar
25: onlineusers
26: akahata
27: thatwriterguy.com
28: schaumburgernachrichten
29: life
30: AgentofSocialMediaChaos
31: mmr3cords.bsky.social
32: flyover
33: ben
34: teteatete
35: kyodo_news
36: DAZNFootball
37: inaba_benibana
38: 709508
39: news_s
40: iembot_dvn
41: LaCeys_ES
42: youtube_ANNnewsCH
43: news.medical
44: HindustanTimes
45: guppy0228
46: autonerdery
47: todayupdate
48: mlevel
49: FOXSportsMX
50: jctrip.bsky.social


<!--

### Displaying the Location Statistics
* of the toots we processed, check the percentage of servers for which we were able to find locations (should be 100%)

counts['total_toots']

counts['locations']

print(f'{counts["locations"] / counts["total_toots"]:.1%}')
-->

### Geocoding the Locations
* Use `get_geocodes` utility function (from `tootutilities.py`; discussed after this example) to geocode the location of each toot stored in the list of toots

In [57]:
from tootutilities import get_geocodes
bad_locations = get_geocodes(toots)

Getting coordinates for mastodon server locations...
geo_location=Location(Ashburn, Virginia, (39.0427652, -77.4858142, 0.0))
geo_location=Location(San Francisco, California, (37.7800771, -122.4201615, 0.0))
geo_location=Location(Tokyo, (35.68945633, 139.69171609, 0.0))
geo_location=Location(Fremont, California, (37.5502017, -121.98083, 0.0))
geo_location=Location(Seattle, Washington, (47.603229, -122.33028, 0.0))
geo_location=Location(Ashburn, Virginia, (39.0427652, -77.4858142, 0.0))
geo_location=Location(Seattle, Washington, (47.603229, -122.33028, 0.0))
geo_location=Location(San Francisco, California, (37.7800771, -122.4201615, 0.0))
geo_location=Location(San Francisco, California, (37.7800771, -122.4201615, 0.0))
geo_location=Location(San Francisco, California, (37.7800771, -122.4201615, 0.0))
geo_location=Location(San Francisco, California, (37.7800771, -122.4201615, 0.0))
geo_location=Location(Tokyo, (35.68945633, 139.69171609, 0.0))
geo_location=Location(San Francisco, Californ

In [58]:
bad_locations

0

<!--

* For each toot with a valid location, the `get_geocodes` function adds the new keys `'latitude'` and `'longitude'` to that toot’s dictionary in the `toots` list — these will be used to plot map markers on our interactive map

### Displaying the Bad Location Statistics
* If geopy is unable to geo-encode a specific location `bad_locations` will be greater than 0

bad_locations 

print(f'{bad_locations / counts["locations"]:.1%}')
-->

### Cleaning the Data
* Before we plot the toot locations on a map, let’s use a pandas `DataFrame` to clean the data
* When you create a `DataFrame` from the `toots` list, it may contain `NaN` for `'latitude'` and `'longitude'` if geopy was unable to geoencode a specific location 
* `NaN` cannot be plotted on a map, so remove any rows containing `NaN` by calling the `DataFrame`’s `dropna` method

In [59]:
toots[0]

{'username': 'buffalobills',
 'text': '<p>Bills’ Damar Hamlin predicted to lose job in key position battle in Buffalo</p><p>Damar Hamlin pretty much earned a Buffalo Bills starting safety job by default in 2024. When second-round pick…<br><a href="https://channels.im/tags/NFL" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>NFL</span></a> <a href="https://channels.im/tags/BuffaloBills" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>BuffaloBills</span></a> <a href="https://channels.im/tags/Buffalo" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Buffalo</span></a> <a href="https://channels.im/tags/Bills" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Bills</span></a> <a href="https://channels.im/tags/brandonbeane" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>brandonbeane</span></a> <a href="https://channels.im/tags/ColeBishop" class="mention hashtag" rel="nofollow noop

In [60]:
import pandas as pd

In [61]:
df = pd.DataFrame(toots)

In [62]:
df

Unnamed: 0,username,text,location,latitude,longitude
0,buffalobills,<p>Bills’ Damar Hamlin predicted to lose job i...,"Ashburn, Virginia, US",39.042765,-77.485814
1,marvin_h2g2,The Kinki region will continue to be subject t...,"San Francisco, California, US",37.780077,-122.420162
2,tkhunt,"<p><a href=""https://www.tkhunt.com/1981499/"" r...","Tokyo, Tokyo, JP",35.689456,139.691716
3,gulfchannels,https://www.gulfchannels.com/221139/ [BLACKPIN...,"Fremont, California, US",37.550202,-121.98083
4,SudOuest,"Road safety: driving in flip-flops, barefoot o...","Seattle, Washington, US",47.603229,-122.33028
5,Tokyo,<p>AIM B2B Launches in Tokyo to Transform B2B ...,"Ashburn, Virginia, US",39.042765,-77.485814
6,EInvestidor,"Economic Calendar: Friday, July 11https://einv...","Seattle, Washington, US",47.603229,-122.33028
7,tychotithonus,"<p>""Engineering should eat security.""<br>-- Ca...","San Francisco, California, US",37.780077,-122.420162
8,africa_social,📢 Volleyball: All clubs have asked for youth a...,"San Francisco, California, US",37.780077,-122.420162
9,jrd_vs,<p>​:ohiru1:​だ​:blobwani_omusubi:​</p>,"San Francisco, California, US",37.780077,-122.420162


In [63]:
df = df.dropna() # if there are rows with missing data drop them

In [64]:
df

Unnamed: 0,username,text,location,latitude,longitude
0,buffalobills,<p>Bills’ Damar Hamlin predicted to lose job i...,"Ashburn, Virginia, US",39.042765,-77.485814
1,marvin_h2g2,The Kinki region will continue to be subject t...,"San Francisco, California, US",37.780077,-122.420162
2,tkhunt,"<p><a href=""https://www.tkhunt.com/1981499/"" r...","Tokyo, Tokyo, JP",35.689456,139.691716
3,gulfchannels,https://www.gulfchannels.com/221139/ [BLACKPIN...,"Fremont, California, US",37.550202,-121.98083
4,SudOuest,"Road safety: driving in flip-flops, barefoot o...","Seattle, Washington, US",47.603229,-122.33028
5,Tokyo,<p>AIM B2B Launches in Tokyo to Transform B2B ...,"Ashburn, Virginia, US",39.042765,-77.485814
6,EInvestidor,"Economic Calendar: Friday, July 11https://einv...","Seattle, Washington, US",47.603229,-122.33028
7,tychotithonus,"<p>""Engineering should eat security.""<br>-- Ca...","San Francisco, California, US",37.780077,-122.420162
8,africa_social,📢 Volleyball: All clubs have asked for youth a...,"San Francisco, California, US",37.780077,-122.420162
9,jrd_vs,<p>​:ohiru1:​だ​:blobwani_omusubi:​</p>,"San Francisco, California, US",37.780077,-122.420162


### Creating a Map with Folium
Create a folium Map on which we’ll plot the toot locations

Note: We used **Stamen map tiles** in our book: 
* Stamen recently lost their funding, which they used to maintain their own servers. 
* The `folium` maintainers have decided not to support Stamen map tiles directly moving forward, but you can specify a custom map tile link with `folium`.
* Stamen has begun a new partnership with stadiamaps.com, which requires an API key. 
* See https://stadiamaps.com/stamen/ for details on setting up a free account and getting an API key. 
* In `keys_mastodon.py`, provide your API key in the `stadia_key` variable. 

In [65]:
import keys_mastodon
import folium

In [None]:
#usmap = folium.Map(location=[39.8283, -98.5795], 
#    tiles='Stamen Terrain', zoom_start=4, detect_retina=True) 

In [None]:
#base_tile_url = 'https://tiles.stadiamaps.com/tiles/stamen_terrain/{z}/{x}/{y}@2x.png'
#tile_url = f'{base_tile_url}?api_key="{keys_mastodon.stadia_key}")'

In [66]:
usmap = folium.Map(location=[39.8283, -98.5795], 
    zoom_start=4, detect_retina=True)  

In [None]:
usmap = folium.Map(location=[39.8283, -98.5795], 
    #tiles=tile_url,
    #attr='Map tiles by Stamen Design, under CC BY 4.0. Data by OpenStreetMap, under ODbL.',
    zoom_start=4, detect_retina=True)  

* `location` keyword argument specifies a sequence containing latitude and longitude coordinates for the **map’s center point** 
    * The values in this snippet are the **geographic center of the continental United States**
    * In many places worldwide, the term `'football'` describes the sport we call soccer in the U.S., so some of the toots we plot may be outside the U.S
    * You can zoom using the **+** and **–** buttons at the map’s top-left, or you can dragging the map with the mouse (that is, pan) to see anywhere in the world
*  `zoom_start` keyword argument specifies the map’s initial zoom level, lower values show more of the world
* `detect_retina` keyword argument enables folium to detect high-resolution screens to use higher-resolution maps from `OpenStreetMap.org`

### Creating Popup Markers for the Toot Locations
* Create `folium` `Popup` objects containing each toot’s text and add them to the `Map`
* `DataFrame` method `itertuples` creates a named tuple from each row containing properties corresponding to each `DataFrame` column

In [67]:
for t in df.itertuples():
    text = ''.join(['<p>' + t.username + '</p>', t.text if t.text else ''])
    popup = folium.Popup(text)
    marker = folium.Marker((t.latitude, t.longitude), popup=popup)
    marker.add_to(usmap)

* Creates a string (`text`) containing the user’s `username` and toot `text` 
* Creates a `folium` `Popup` to display the `text`
* Creates a `folium` `Marker`
    * tuple to specify the `Marker`’s latitude and longitude
    * `popup` keyword argument associates the toot’s `Popup` object with the new `Marker`
* Calls the `Marker`’s `add_to` method to specify the `Map` that will display the `Marker`

### Saving the Map
* Call the `Map`’s `save` method to store the map in an HTML file, which you can then double-click to open in your web browser

In [None]:
usmap.save('toot_map.html')

In [68]:
usmap # displays the map in the notebook

## Class `LocationListener` (in locationlistener.py)

## Utility Functions in `tootutilities.py` 

### Utility Function `get_domain_location_from_url`  
* Receives the URL of a toot, extracts the domain name, looks up the domain's IP address, then uses the free tier of the ipgeolocation.io API to lookup the geographic location of that IP address
*  Get a key at for ipgeolocation.io web service at: https://ipgeolocation.io/signup.html
    * store it in `ipgeolocation_key` within in c`keys_mastodon.py`

### `get_geocodes` Utility Function 
* Receives a list of dictionaries containing toots and **geocodes their server locations**
* If geocoding is successful for a toot, adds the **latitude** and **longitude** to the toot’s **dictionary in `toot_list`**

<!--
# More Info 
* See Lesson 12 in [**Python Fundamentals LiveLessons** here on O'Reilly Online Learning](https://learning.oreilly.com/videos/python-fundamentals/9780135917411)
* See Chapter 12 in [**Python for Programmers** on O'Reilly Online Learning](https://learning.oreilly.com/library/view/python-for-programmers/9780135231364/)
* See Chapter 13 in [**Intro Python for Computer Science and Data Science** on O'Reilly Online Learning](https://learning.oreilly.com/library/view/intro-to-python/9780135404799/)
* Interested in a print book? Check out:

| Python for Programmers<br>(640-page professional book) | Intro to Python for Computer<br>Science and Data Science<br>(880-page college textbook)
| :------ | :------
| <a href="https://amzn.to/2VvdnxE"><img alt="Python for Programmers cover" src="../images/PyFPCover.png" width="150" border="1"/></a> | <a href="https://amzn.to/2LiDCmt"><img alt="Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud" src="../images/IntroToPythonCover.png" width="159" border="1"></a>

>Please **do not** purchase both books&mdash;_Python for Programmers_ is a subset of _Intro to Python for Computer Science and Data Science_
-->

------
&copy;1992&ndash;2024 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 12 of the book [**Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud**](https://amzn.to/2VvdnxE).

DISCLAIMER: The authors and publisher of this book have used their 
best efforts in preparing the book. These efforts include the 
development, research, and testing of the theories and programs 
to determine their effectiveness. The authors and publisher make 
no warranty of any kind, expressed or implied, with regard to these 
programs or to the documentation contained in these books. The authors 
and publisher shall not be liable in any event for incidental or 
consequential damages in connection with, or arising out of, the 
furnishing, performance, or use of these programs.                  