# <center>An Analysis of Transformative Works Created by Fans of the Harry Potter Series in Reaction to the Author's Public Political Comments</center>

## <center>Project completed by Kymberlee McMaster on May 16th, 2022</center>

### <center>Introduction</center>

Fans are often known for using their talents and dedication to create new media based on the things they enjoy. One of the best examples of this is the writing and reading of fanfiction, the practice in which amateur authors may take aspects from the original content that they enjoyed and transforming them into original works of their own creation. There are various methods that these authors use to share their works with other individuals who also enjoyed the original piece of media but one of the most common is to post the work to a dedicated site for the posting and reading of fanfiction. While there are quite a few options available, we'll be focusing on Archive of Our Own, known colloquially as AO3, for our purposes as AO3's built in tagging and data storage system will allow us to search through the works of fiction using the author's own tags for their work rather than attempting to create tags ourself. 

However, since AO3 currently has over nine million works, in order to better analyze the data associated with the site and trends of fanfiction authors, we'll be focusing on writings by fans of a specific piece of media: the Harry Potter series written by J.K. Rowling.[[1]](https://archiveofourown.org/works/search?work_search%5Bquery%5D=) Additionally, we'll be specifically be focusing on the fanfiction written around a specific date in time as there are over 300,000 works for that series alone. 

On June 6th of 2020, author J.K. Rowling took to Twitter to express her displeasure over the use of the phrase "people who menstruate" rather than the word women.[[2]](https://www.glamour.com/story/a-complete-breakdown-of-the-jk-rowling-transgender-comments-controversy) This tweet and the subsequent tweets that followed it came under a lot of backlash with trans activists and fans of the Harry Potter series. This was not the first time that author J.K. Rowling had expressed such views and received backlash, but it is one of the most notable, so we will be analyzing works of fanfiction posted onto AO3 for the two weeks before the tweet was made and the two weeks following the tweet to view the potential impact that Rowling's postings may have had on the writings of the LGBTQIA+ community members and their allies. 

<b>An Important Note:</b> Content posted on AO3, while subject to AO3's terms of service, is not policed for the actual content itself. AO3's dedication to protect the authors who post on their site means that content posted to the site can contain a wide array of mature content, as most of their focus on author protection are centered around protecting the author from backlash by the original owners of the potentially trademarked intellectual property. Per their own site: "The Archive does not prescreen for content. Complaints are investigated only when they are submitted through the appropriate channels and with the appropriate information."[[3]](https://archiveofourown.org/tos#content) Users are expected to police their own media consumption through the use of the built-in tagging system. This means that some of the information we collect about fics for this project may mention or allude to mature themes. 

### <center>Data Collection</center>

As AO3 does not have a built in API, we will need to build our own method of scraping the data found on the site. In order to collect the data and avoid unneccesary scraping we'll be using AO3's built in search function to pre-search for works that were created between our dates of interest: May 23rd,2020 and June 19th, 2020. We do this by accessing the Works Search page located [here](https://archiveofourown.org/works/search), and entering our parameters into the Any Search field: created_at:["2020-05-23" TO "2020-06-19"]. As well as selecting the English language option. This will generate the link that we can use in the data scraper that will gather the information about the works for us, located [here](https://archiveofourown.org/works/search?commit=Search&page=1&work_search[bookmarks_count]=&work_search[character_names]=&work_search[comments_count]=&work_search[complete]=&work_search[creators]=&work_search[crossover]=&work_search[fandom_names]=Harry+Potter+-+J.+K.+Rowling&work_search[freeform_names]=&work_search[hits]=&work_search[kudos_count]=&work_search[language_id]=en&work_search[query]=created_at%3A[%222020-05-23%22+TO+%222020-06-19%22]&work_search[rating_ids]=&work_search[relationship_names]=&work_search[revised_at]=&work_search[single_chapter]=0&work_search[sort_column]=created_at&work_search[sort_direction]=asc&work_search[title]=&work_search[word_count]=). 

By looking at our search results, we will see that we will be scraping the information about 3,523 works that were created and area available publicly without an account in the month time period we've identified. Below, we will first import the libararies necessary for this project. 

In [1]:
# Import the libraries necessary to complete this project 
import requests
import math
from bs4 import BeautifulSoup 
import csv
import re 
import random
import time
import pandas as pd 
import numpy as np 
from datetime import datetime 


import json 
import os.path 
import matplotlib.pyplot as plt 

With our libaries imported, we can start off by initializing some of the necessary variables and setting up the CSV file we will be using to temporarily store all of this data. We know that there are 3,523 works to be consumed, and there are 20 works displayed on each search page so we'll need to request the informtion from 177 pages. We split the URL into parts before and after the page number is stored to that we can complete our requests through an iterative process which automatically updates the page number used to request data. Then we initialize a new files to store the content from the scraping as we'll be completing it in page portions and that would make it difficult to store as a pandas dataframe right off the bat. 

In [2]:
#Split the URL into parts and store the current page number with those parts 
urlpt1 = "https://archiveofourown.org/works/search?commit=Search&page="
currpagenum = 1
urlpt2 = "&work_search[bookmarks_count]=&work_search[character_names]=&work_search[comments_count]=&work_search[complete]=&work_search[creators]=&work_search[crossover]=&work_search[fandom_names]=Harry+Potter+-+J.+K.+Rowling&work_search[freeform_names]=&work_search[hits]=&work_search[kudos_count]=&work_search[language_id]=en&work_search[query]=created_at%3A[%222020-05-23%22+TO+%222020-06-19%22]&work_search[rating_ids]=&work_search[relationship_names]=&work_search[revised_at]=&work_search[single_chapter]=0&work_search[sort_column]=created_at&work_search[sort_direction]=asc&work_search[title]=&work_search[word_count]="

#Identified the number of works and pages that the scraper will need to iterate through
works = 3523
pages = math.ceil(works/20)

#Iniate a new file to store the basic content from the scraping 
header = ['Title', 'Author', 'ID', 'Date_updated', 'Rating', 'Pairing', 'Warning', 'Complete', 'Language', 'Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits', 'Tags', 'Summary']
with open('storedbasic.csv','w', encoding='utf8') as storedbasic:
    writer = csv.writer(storedbasic)
    writer.writerow(header)

Now that we've completed some of our basic work, we can begin to design some of the functions we'll need to call to scrape the data out of the page. First, we have inspect the page to see how the data on each page is stored. By using the developers tools, we can see that the results of the search are displayed in a class identified as a “works index group" and each work is a list item below that with the role "article". 

<center><b>Page Inspection Using Web Developer Tools</b></center>
<center><img src="PageInspect.png"></center>


We'll use that information to define our helper function that will take the BeautifulSoup from a page, add the relevant information about the work to various lists and then store those lists into the CSV file that we previously initialized. 

There are a couple interesting things to note here about the information we are choosing to gather. AO3 has built in site protections that displays a simple Retry Later if too many requests are being made to the site by one person at a time. As such, we are only scraping the information that is available from the works search page since that means we only search the number of pages in the search results rather than gathering the content that is available on the works' pages themselves. One such example of information that could have been gathered from the works' pages is the comments associated with the work. AO3 allows for author-specific comments sections in which the author of the work can choose whether or not to allow comments and if they are allowing comments, whether those comments need to come from someone who holds an account with the site itself. While these comments could have had some interested information for us to take a look at, we area already dealing with an extremely large amount of data and the scraping of the comments section would have required that we scrape each individual webpage for the 3,523 different works that we have gathered data on. 

In [3]:

#Function to gather all data 
def basicdata(mysoup): 
    #Initialize a set of variables to store all titles and info for page to add to the CSV all at once 
    titles = []
    authors = []
    ids = []
    date_updated = []
    ratings = []
    pairings = []
    warnings = []
    complete = []
    languages = []
    word_count = []
    chapters = []
    comments = []
    kudos = []
    bookmarks = []
    hits = []
    tags = []
    summary = []
    
    for article in mysoup.find_all('li', {'role':'article'}):
        titles.append(article.find('h4', {'class':'heading'}).find('a').text)
        try:
            authors.append(article.find('a', {'rel':'author'}).text)
        except:
            authors.append('Anonymous')
        ids.append(article.find('h4', {'class':'heading'}).find('a').get('href')[7:])
        date_updated.append(article.find('p', {'class':'datetime'}).text)
        ratings.append(article.find('span', {'class':re.compile(r'rating\-.*rating')}).text)
        pairings.append(article.find('span', {'class':re.compile(r'category\-.*category')}).text)
        warnings.append(article.find('span', {'class':re.compile(r'warning\-.*warnings')}).text)
        complete.append(article.find('span', {'class':re.compile(r'complete\-.*iswip')}).text)
        languages.append(article.find('dd', {'class':'language'}).text)
        tags.append(article.find('ul', {'class':'tags commas'}).text)
        count = article.find('dd', {'class':'words'}).text
        if len(count) > 0:
            word_count.append(count)
        else:
            word_count.append('0')
        chapters.append(article.find('dd', {'class':'chapters'}).text.split('/')[0])
        try:
            comments.append(article.find('dd', {'class':'comments'}).text)
        except:
            comments.append('0')
        try:
            kudos.append(article.find('dd', {'class':'kudos'}).text)
        except:
            kudos.append('0')
        try:
            bookmarks.append(article.find('dd', {'class':'bookmarks'}).text)
        except:
            bookmarks.append('0')
        try:
            hits.append(article.find('dd', {'class':'hits'}).text)
        except:
            hits.append('0')
        #try: 
            #tags.append(article.find('span', {'class':re.compile(r'freeforms\-.*freeforms')}).text)
        #except: 
            #tags.append(' ')
        try:
            summary.append(article.find('blockquote', {'class':'userstuff summary'}).text)
        except: 
            summary.append(' ')
            
            
    df = pd.DataFrame(list(zip(titles, authors, ids, date_updated, ratings, pairings,\
                              warnings, complete, languages, word_count, chapters,\
                               comments, kudos, bookmarks, hits, tags, summary)))
    
    with open('storedbasic.csv','a', encoding='utf8') as storedbasic:
        df.to_csv(storedbasic, header=False, index=False)
    

With our helper function for our basic data, we can now  iterate through the pages of the searched works and gather the basic data into the CSV files previously created. As previously stated, due to AO3's built in site protections, we will scrape by increments of pages and pause between each set of pages in order to ensure that we can gather all of the data we are trying to request from the site. For our purposes, we will be using an increment of 100 pages as it was a nice round number that was regularly accepted by the site and then a pause time of 10 minutes so that enough time had elapsed after our previous round that it would accept the next round of 100 pages being requested. 

In [4]:
#Reset page number in case anything has gotten messed up with the block
currpagenum = 1

#Set the page by using the page number, and the URL parts
page = requests.get(urlpt1 + str(currpagenum) + urlpt2)

#Use BeautifulSoup to parse the data as html
soup = BeautifulSoup(page.content, "html.parser")

#This for loop will iterate through the pages and add the basic data to the basic data table 
for i in range(1, pages + 1): 
    
    url = urlpt1 + str(currpagenum) + urlpt2 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    basicdata(soup) 
    
    currpagenum += 1
    
    if (i % 100) == 0 : 
        print("Taking a break from parsing. Current page count is: " + str(currpagenum - 1) )
        print("Will resume in 10 min")
        time.sleep(600)
        print("Wait time is over, will resume parsing now.")
    
print("Parsing has finished, the remainder of basic data has been consumed")

Taking a break from parsing. Current page count is: 100
Will resume in 10 min
Wait time is over, will resume parsing now.
Parsing has finished, the remainder of basic data has been consumed


The output of the page parsing into the CSV files is several print statements that just allow us to view where our current parsing status is, as we have a 10 minute break included in our parsing so we want to be able to know that everything is continuing to process and move along smoothly. 

With all of the data parsed into a CSV file, we can now use the built in pandas functionality to create a pandas dataframe that we can use to manipulate our data. 

In [6]:
#Use read_csv to read the data stored in the CSV files into pandas dataframes
AO3 = pd.read_csv("storedbasic.csv")

#Display the final dataframe
display(AO3)

Unnamed: 0,Title,Author,ID,Date_updated,Rating,Pairing,Warning,Complete,Language,Word_count,Num_chapters,Num_comments,Num_kudos,Num_bookmarks,Num_hits,Tags,Summary
0,This Love I Have Inside,stargazing_dreamer_girl,24329029,23 May 2020,General Audiences,F/M,No Archive Warnings Apply,Complete Work,English,9403,1,0,37,1,612,\nNo Archive Warnings ApplyFred Weasley/Origin...,"\nClara Comder, student at Hogwarts School of ..."
1,Spelling It Out,Nocturnal_Daydreams,24329044,23 May 2020,Not Rated,"F/M, Other",Choose Not To Use Archive Warnings,Complete Work,English,29407,1,14,213,32,5338,\nCreator Chose Not To Use Archive WarningsHer...,\nTattoos or Birthmarks of your soulmates firs...
2,Untamed Journey,Jetainia,24329050,23 May 2020,General Audiences,F/M,No Archive Warnings Apply,Complete Work,English,1612,1,0,17,1,240,\nNo Archive Warnings ApplyHelga Hufflepuff/Sa...,\nThe roaming Hogwarts saloon is a place of ha...
3,The Quill,xslytherclawx,24329686,23 May 2020,Teen And Up Audiences,M/M,No Archive Warnings Apply,Complete Work,English,586,1,4,68,2,664,\nNo Archive Warnings ApplyDraco Malfoy/Harry ...,\nHarry's always wondered about Draco's golden...
4,Harry Potter Bending Force,Gman85,24329902,19 Jan 2022,Explicit,F/M,Underage,Work in Progress,English,128631,17,104,286,133,21619,\nUnderageOther Relationship Tags to Be Added ...,\nHarry was woefully unprepared for the Tri-Wi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3518,Ante Astra,Evandar,24812620,19 Jun 2020,Mature,M/M,Underage,Complete Work,English,3126,1,38,825,67,7871,\nUnderageRegulus Black/Sirius Black Alphard B...,\nThe morning after the night before. Sirius i...
3519,The Slytherin Scarf,FateRestarting,24812632,19 Jun 2020,Teen And Up Audiences,F/M,No Archive Warnings Apply,Complete Work,English,2677,1,38,404,43,3622,\nNo Archive Warnings ApplyHermione Granger/Dr...,\nA prefect's round that ends in the usual arg...
3520,Muggle Robbery,Axelle_Sof,24812779,19 Jun 2020,General Audiences,M/M,Choose Not To Use Archive Warnings,Complete Work,English,1522,1,0,43,3,780,\nCreator Chose Not To Use Archive WarningsDra...,\nHarry and Draco go on a date. But not togeth...
3521,Forged in Fire,torino10154,24812839,20 Jun 2020,Explicit,M/M,"Rape/Non-Con, Underage",Complete Work,English,857,1,23,753,50,43553,\nRape/Non-Con UnderageHarry Potter/Quirinus Q...,


### <center>Data Processing</center>

We now have a singular data frame, that contains all relevant information about fanfiction works that were created between May 23rd, 2020, and June 19th, 2020. However, this is the raw data so we want to first check that our dataframe is storing the data as the correct types. 

In [7]:
#Display the data frame columns with their associated types
AO3.dtypes

Title            object
Author           object
ID                int64
Date_updated     object
Rating           object
Pairing          object
Complete         object
Language         object
Word_count       object
Num_chapters      int64
Num_comments      int64
Num_kudos         int64
Num_bookmarks     int64
Num_hits          int64
Tags             object
Summary          object
dtype: object

As we see above, while most of the data is properly stored: integers are all stored as int64 types and strings stored are objects, one column is not with Date_Updated being stored as a string object rather than a date type. Below we update the dataframe so that Date_Updated is stored correctly and we will be able to properly use it for our data analysis. 

In [9]:
#Update the dataframe so that Date_Updated is properly stored as a datetime object 
AO3['Date_updated']= pd.to_datetime(AO3['Date_updated'])

#Display the updated types of the dataframe
AO3.dtypes

Title                    object
Author                   object
ID                        int64
Date_updated     datetime64[ns]
Rating                   object
Pairing                  object
Complete                 object
Language                 object
Word_count               object
Num_chapters              int64
Num_comments              int64
Num_kudos                 int64
Num_bookmarks             int64
Num_hits                  int64
Tags                     object
Summary                  object
dtype: object

Returning to our original purpose, we want to see how Rowling's tweets may have impacted the posting of fans of her works who belong to the LGBTQIA+ community. As AO3 contains transformative works, and the original source material does not contain any LGBTQIA+ representation, the best way to view any potential impact is by identifying if a change occurred in the number of fanfiction works posted with LGBTQIA+ content. Especially since authors may have posted retaliatory works in which they transformed one or more of the canon characters within the work to be queer. Below shows one such example in which AO3 author, ughdotcom, posted a work and specified in their notes that the work was posted as a response to something JK Rowling had previously expressed. [[4]](https://archiveofourown.org/works/22817488)

<center><b>ughdotcom's Anti Terf Posting</b></center>
<center><img src="ughdotcom.png"></center>


Therefore, in order to best analyze potential responses by fanfiction authors, we will search for the occurrence of various queer identities and flag their presence within the tags so that we can use those flags to identify rates of posting. As seen above, the tags may contain markers before canon character names that indicate an update to something about this character, in this particular work, the character Hermione Granger is trans and Luna Lovegood is genderqueer. 

As gender and sexual identity expression is constantly changing and evolving, we will attempt to flag for most major sexual and gender identities, but there is the possibility that we may miss some. Our flags will center around the following groups who are members of the LGBTQIA+ community: lesbian, gay, bisexual, trans, nonbinary, queer, intersex, asexual, agender, and aromantic. 

In [10]:
#Flag for presence of Lesbian


#Flag for presence of Gay 


#Flag for presence of Bisexual 


#Flag for presence of Trans 


#Flag for presence of Nonbinary


#Flag for presence of Queer 


#Flag for presence of Intersex 


#Flag for presence of Asexual 


#Flag for presence of Agender 


#Flag for presence of Aromantic 

### <center>Exploratory Data Analysis</center>



### <center>Hypothesis Testing</center>



### <center>Conclusions</center>

We now have a singular data frame 