The files mentioned below are associated with a runtime and are deleted when one terminates. These files are attached on Canvas, for reference.

This is the code I used to scrape the 'tags' on Tumblr for the data to be analysed. I make use of the `pytumblr` API, using the following information about its usage published by the official Tumblr account: [https://github.com/tumblr/pytumblr](https://). 

Further documentation is available at [https://pypi.org/project/PyTumblr2/](https://).

As per the requirements of `pytumblr`, this project was registered as a Tumblr application to acquire oAuth consumer keys.

Due to rate limitations of this API ([https://www.tumblr.com/docs/en/api/v2#authentication](https://)), it was run in sections, scraping several thousand posts for a given tag in every iteration. This code was run over the course of two days (4/27/23-4/29/23), on the first 100 words among the tags trending on Tumblr (found on Canvas under file name 'final.txt'). This list does not include proper nouns such as names of celebrities, TV shows or movies, and was manually compiled on 4/27/23, using Tumblr: [https://www.tumblr.com/explore/trending](https://).

Upon completion, our data to be analysed contains over 70,000 instances of the usage of these words. It can be found on Canvas under 'tumblr.txt'.

In [1]:
# Install pytumblr
!pip install pytumblr

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytumblr
  Downloading PyTumblr-0.1.2-py2.py3-none-any.whl (19 kB)
Installing collected packages: pytumblr
Successfully installed pytumblr-0.1.2


In [2]:
# Import pytumblr, and other useful libraries
import pytumblr
import calendar
import time
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# Authenticate via OAuth
client = pytumblr.TumblrRestClient(
  'dq5cGucSj7Mxscb4KLfyeUwDEwTqIPvD8XY8NZJkXmHOmSoEKi',
  'E0QO6nas5EckuTkxQGrvuvvWWpzHo5VvZ5kwYfDNI8OkuHH7Mc',
  'jucbMZUhBq1qMMCD757SR3kgNYkfEYmcpYDbTNwg4f0sHmLhyp',
  '1quwvGQmn21DFX6CkbF9xwCQ1ZTWfOnsnsLBDvFy3QCKHvCasR'
)

In [4]:
# Reads the contents of the file 'final.txt'
# This file contains the 100 words under consideration.
with open('final.txt') as f:
  words = f.read()

word_list = nltk.word_tokenize(words)

In [5]:
# Scrapes as many posts as possible for each tag and appends them to 
# 'tumblr.txt'
def scrape(tag):

  # Filter for the tags.
  filter = 'raw' 

  # The posts from before the time at which the code was run are considered.
  before = calendar.timegm(time.gmtime())
  time_note = before

  # Creates a reference to 'tumblr.txt', and appends the data to it.
  f = open('tumblr.txt', 'a')

  j = 0
  # The following code is limited to being run 25 times due to the rate 
  # limitations of the pytumblr API, as discussed above.
  while j < 25:

    # Retrieves search results with the given tag.
    search_results = client.tagged(tag, filter=filter, before=before)

    # Iterates through the search results, appending them to the given file.
    for i in search_results:
      tags = (i)['tags']
      f.write(' '.join(tags))
      f.write('\n')

      # Records time of the last post in this iteration, so a unique post can
      # be considered in the following one.
      time_note = (i)['timestamp']

    # Updates variables accordingly.
    before = time_note
    j = j+1

In [8]:
# Scrapes Tumblr for each of the 100 words on our list.
for word in word_list:
  scrape(word)