# <center>An Analysis of Transformative Works Created by Fans of the Harry Potter Series in Reaction to the Author's Public Political Comments</center>

## <center>Project completed by Kymberlee McMaster on May 16th, 2022</center>

### <center>Introduction</center>

Fans are often known for using their talents and dedication to create new media based on the things they enjoy. One of the best examples of this is the writing and reading of fanfiction, the practice in which amateur authors may take aspects from the original content that they enjoyed and transforming them into original works of their own creation. There are various methods that these authors use to share their works with other individuals who also enjoyed the original piece of media but one of the most common is to post the work to a dedicated site for the posting and reading of fanfiction. While there are quite a few options available, we'll be focusing on Archive of Our Own, known colloquially as AO3, for our purposes as AO3's built in tagging and data storage system will allow us to search through the works of fiction using the author's own tags for their work rather than attempting to create tags ourself. 

However, since AO3 currently has over nine million works, in order to better analyze the data associated with the site and trends of fanfiction authors, we'll be focusing on writings by fans of a specific piece of media: the Harry Potter series written by J.K. Rowling.[[1]](https://archiveofourown.org/works/search?work_search%5Bquery%5D=) Additionally, we'll be specifically be focusing on the fanfiction written around a specific date in time as there are over 300,000 works for that series alone. 

On June 6th of 2020, author J.K. Rowling took to Twitter to express her displeasure over the use of the phrase "people who menstruate" rather than the word women.[[2]](https://www.glamour.com/story/a-complete-breakdown-of-the-jk-rowling-transgender-comments-controversy) This tweet and the subsequent tweets that followed it came under a lot of backlash with trans activists and fans of the Harry Potter series. This was not the first time that author J.K. Rowling had expressed such views and received backlash, but it is one of the most notable, so we will be aanalyzing works of fanfiction posted onto AO3 for the two weeks before the tweet was made and the two weeks following the tweet to view the potential impact that Rowling's postings may have had on the writings of the LGBTQIA+ community members and their allies. 

### <center>Data Collection</center>

As AO3 does not have a built in API, we will need to build our own method of scraping the data found on the site. In order to collect the data and avoid unneccesary scraping we'll be using AO3's built in search function to pre-search for works that were created between our dates of interest: May 23rd,2020 and June 19th, 2020. We do this by accessing the Works Search page located [here](https://archiveofourown.org/works/search), and entering our parameters into the Any Search field: created_at:["2020-05-23" TO "2020-06-19"]. As well as selecting the English language option. This will generate the link that we can use in the data scraper that will gather the information about the works for us, located [here](https://archiveofourown.org/works/search?commit=Search&page=1&work_search[bookmarks_count]=&work_search[character_names]=&work_search[comments_count]=&work_search[complete]=&work_search[creators]=&work_search[crossover]=&work_search[fandom_names]=Harry+Potter+-+J.+K.+Rowling&work_search[freeform_names]=&work_search[hits]=&work_search[kudos_count]=&work_search[language_id]=en&work_search[query]=created_at%3A[%222020-05-23%22+TO+%222020-06-19%22]&work_search[rating_ids]=&work_search[relationship_names]=&work_search[revised_at]=&work_search[single_chapter]=0&work_search[sort_column]=created_at&work_search[sort_direction]=asc&work_search[title]=&work_search[word_count]=). 

By looking at our search results, we will see that we will be scraping the information about 3,523 works that were created and area available publicly without an account in the month time period we've identified. Below, we will first import the libararies necessary for this project. 

In [1]:
# Import the libraries necessary to complete this project 
import requests
import math
from bs4 import BeautifulSoup 
import csv
import re 
import random
import time
import pandas as pd 
import numpy as np 
from datetime import datetime 


import json 
import os.path 
import matplotlib.pyplot as plt 

Next, we need to inspect the page to see how the data on each page is stored. By using the developers tools, we can see that the results of the search are displayed in a class identified as a “works index group" and each work is a list item below that with the role "article". We know that there are 3,523 works to be consumed, and there are 20 works displayed on each search page so we'll need to request the informtion from 177 pages. We split the URL into parts before and after the page number is stored to that we can complete our requests through an iterative process which automatically updates the page number used to request data. Then we initialize a couple new files to store the content from the scraping as we'll be completing it in page portions and that would make it difficult to store as a pandas dataframe right off the bat. 

In [2]:
#Split the URL into parts and store the current page number with those parts 
urlpt1 = "https://archiveofourown.org/works/search?commit=Search&page="
currpagenum = 1
urlpt2 = "&work_search[bookmarks_count]=&work_search[character_names]=&work_search[comments_count]=&work_search[complete]=&work_search[creators]=&work_search[crossover]=&work_search[fandom_names]=Harry+Potter+-+J.+K.+Rowling&work_search[freeform_names]=&work_search[hits]=&work_search[kudos_count]=&work_search[language_id]=en&work_search[query]=created_at%3A[%222020-05-23%22+TO+%222020-06-19%22]&work_search[rating_ids]=&work_search[relationship_names]=&work_search[revised_at]=&work_search[single_chapter]=0&work_search[sort_column]=created_at&work_search[sort_direction]=asc&work_search[title]=&work_search[word_count]="

#Identified the number of works and pages that the scraper will need to iterate through
works = 3523
pages = math.ceil(works/20)

#Iniate a new file to store the basic content from the scraping 
header = ['Title', 'Author', 'ID', 'Date_updated', 'Rating', 'Pairing', 'Warning', 'Complete', 'Language', 'Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits', 'Tags', 'Summary']
with open('storedbasic.csv','w', encoding='utf8') as storedbasic:
    writer = csv.writer(storedbasic)
    writer.writerow(header)

Now that we've completed some of our basic work, we can begin to design some of the functions we'll need to call to scrape the data out of the page. Specifically, we'll need one function to read in the content off the page. One piece of information that we could scrape but are choosing not to for the purpose of this project is the actual Comments on the published works. While this could have some interested information for us to take a look at, we are already dealing with an extremely large amount of data and the Comments section of each work does not have any bearing on what our true aim is with this project. 

In [3]:

#Function to gather all data 
def basicdata(mysoup): 
    #Initialize a set of variables to store all titles and info for page to add to the CSV all at once 
    titles = []
    authors = []
    ids = []
    date_updated = []
    ratings = []
    pairings = []
    warnings = []
    complete = []
    languages = []
    word_count = []
    chapters = []
    comments = []
    kudos = []
    bookmarks = []
    hits = []
    tags = []
    summary = []
    
    for article in mysoup.find_all('li', {'role':'article'}):
        titles.append(article.find('h4', {'class':'heading'}).find('a').text)
        try:
            authors.append(article.find('a', {'rel':'author'}).text)
        except:
            authors.append('Anonymous')
        ids.append(article.find('h4', {'class':'heading'}).find('a').get('href')[7:])
        date_updated.append(article.find('p', {'class':'datetime'}).text)
        ratings.append(article.find('span', {'class':re.compile(r'rating\-.*rating')}).text)
        pairings.append(article.find('span', {'class':re.compile(r'category\-.*category')}).text)
        warnings.append(article.find('span', {'class':re.compile(r'warning\-.*warnings')}).text)
        complete.append(article.find('span', {'class':re.compile(r'complete\-.*iswip')}).text)
        languages.append(article.find('dd', {'class':'language'}).text)
        tags.append(article.find('ul', {'class':'tags commas'}).text)
        count = article.find('dd', {'class':'words'}).text
        if len(count) > 0:
            word_count.append(count)
        else:
            word_count.append('0')
        chapters.append(article.find('dd', {'class':'chapters'}).text.split('/')[0])
        try:
            comments.append(article.find('dd', {'class':'comments'}).text)
        except:
            comments.append('0')
        try:
            kudos.append(article.find('dd', {'class':'kudos'}).text)
        except:
            kudos.append('0')
        try:
            bookmarks.append(article.find('dd', {'class':'bookmarks'}).text)
        except:
            bookmarks.append('0')
        try:
            hits.append(article.find('dd', {'class':'hits'}).text)
        except:
            hits.append('0')
        #try: 
            #tags.append(article.find('span', {'class':re.compile(r'freeforms\-.*freeforms')}).text)
        #except: 
            #tags.append(' ')
        try:
            summary.append(article.find('blockquote', {'class':'userstuff summary'}).text)
        except: 
            summary.append(' ')
            
            
    df = pd.DataFrame(list(zip(titles, authors, ids, date_updated, ratings, pairings,\
                              warnings, complete, languages, word_count, chapters,\
                               comments, kudos, bookmarks, hits, tags, summary)))
    
    with open('storedbasic.csv','a', encoding='utf8') as storedbasic:
        df.to_csv(storedbasic, header=False, index=False)
    

With our helper function for our basic data, we can now  iterate through the pages of the searched works and gather the basic data into the CSV files previously created. Due to AO3's built in site protections, We will scrape by increments of 100 pages and pause between each code block execution in order to ensure that we can gather all of the data we are trying to request from the site.  

In [4]:
#Reset page number in case anything has gotten messed up with the block
currpagenum = 1

#Set the page by using the page number, and the URL parts
page = requests.get(urlpt1 + str(currpagenum) + urlpt2)

#Use BeautifulSoup to parse the data as html
soup = BeautifulSoup(page.content, "html.parser")

#This for loop will iterate through the pages and add the basic data to the basic data table 
for i in range(100): 
    
    url = urlpt1 + str(currpagenum) + urlpt2 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    basicdata(soup) 
    
    currpagenum += 1
    
print("Parsing has finished, first 100 pages of basic data has been consumed, waiting 10 min before the remaining data consumption")

time.sleep(600) 

#This for loop will iterate through the pages and add the basic data to the basic data table 
for i in range(100,200): 
    
    url = urlpt1 + str(currpagenum) + urlpt2 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    basicdata(soup) 
    
    currpagenum += 1
    
print("Parsing has finished, the remainder of basic data has been consumed")

Parsing has finished, first 100 pages of basic data has been consumed, waiting 5 min before the remaining data consumption
Parsing has finished, the remainder of basic data has been consumed


In [None]:
#This for loop will iterate through the pages and add the basic data to the basic data table 
for i in range(100,200): 
    
    url = urlpt1 + str(currpagenum) + urlpt2 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    basicdata(soup) 
    
    currpagenum += 1
    
print("Parsing has finished, the remainder of basic data has been consumed")

The above print statements are simply to show that the parser has completed running and is moving on to the next code snippet where we will store the information we've collected in a pandas dataframe and verify that our parsing was successful. 

In [None]:
#Use read_csv to read the data stored in the CSV files into pandas dataframes
AO3 = pd.read_csv("storedbasic.csv")

#Display the final dataframe
display(AO3)

### <center>Data Processing</center>

We now have a singular data frame 

### <center>Exploratory Data Analysis</center>



### <center>Hypothesis Testing</center>



### <center>Conclusions</center>

We now have a singular data frame 