# Power of Female and Male Characters in Japanese Light Novels

### Group Members
**Kristina Gong, Christina Ding, Evelyn Lin**

This notebook is on Colab. Here is the [link](https://colab.research.google.com/drive/1OuOzChhuvcZV8A5n5odEaiV7rbxcoFGg?usp=sharing).

## Backgroud 

People are now becoming aware of gender inequalities and stereotypes in movies and other literary works. We hope to take a closer look at light novels and to see if gender issues exist here. Targeting young people (age between 15-30), Japanese light novels are a popular literary genre that is closely related to manga and anime. One of the most popular themes in light novels is romantic heterosexual love, and there are novels written respectively for female and male audiences. Those books which target adult male audiences often include elements such as open relationships, pornography, and violence, and there exist some characteristic patterns in female characters in those books. Therefore, we would love to study the portrayal of female characters versus male characters in light novels targeting respectively male and female audiences. 

With the purpose of investigating the power of female and male characters in light novels, and how their power evolves in different phases of the story, we create the dataset by screening Japanese light novels from the online catalog under the genre of “Harem”. 


## Research Questions
We would like to investigate the power of female and male characters in light
novels, and how their power evolves in different phases of the story. 

## Dataset creation and cleaning

### Data collection 
The data are from Baka-Tsuki Translation Community. Baka-Tsuki (BT) is a fan translation community that hosts translations for light novels in the Wiki format. Baka-Tsuki is not a translation group. Independent translators come to Baka-Tsuki and voluntarily upload their translations to the Wiki for public sharing. We plan to scrape a subset of the novels under the category of Light Novel (English). The novel is included in our dataset if it has at least one genre label with it. Different volumes of the same novel are treated as two separate books and stored in different files.

The instances that comprise the dataset are novels under the catogory *English*, meaning the translated novel is in English. The genres include Sci-Fi, Harem, Fantasy, Comedy, Supernatural and so on. The dataset doesn’t contain all possible instances, but it contains all novels that have genres labeled and have completed translation in a relatively standard format on Baka-Tsuki. We choose to include those novels in the dataset because it is the most comprehensive catalog available online we can find. 


Here is the [link](https://colab.research.google.com/drive/1XfXJllcynGx0S7hedEzgj4KgKFfC6ZDv#scrollTo=ndgYCp48yNfD) to the notebook with the scraping process.


In [None]:
import requests
from bs4 import BeautifulSoup
import os
import re
import pickle
from os import path
import pandas as pd
import numpy as np

### Data cleaning 

Before we run our following codes, we first cleaned our data. Since we scraped out data from a Translation Community, the format of each novel is slightly different from one another due to the preferences of different translators. For example, there might be a "translator's comment" at the end or start of the chapter, or a record of date or feelings at that time. Besides, there are some pages where the novel is incomplete or the website does not contain anything at all. Therefore we looked at the data we got again, and we deleted those "page does not exist" or contained invalid contents manually.



We also cleaned the `agency_power_lemma.csv` used for calculating power scores. We put the verbs in the csv into lemma form to prepare for matching. Here is the [link](https://colab.research.google.com/drive/1q1aHKAHnNg4DimefFiLhrDlhJ305ZSUY#scrollTo=T91TG7PQJPdK) to the notebook that did the cleaning.

In [None]:
agency_power = pd.read_csv("../content/agency_power_lemma.csv")


In [None]:
my_dict = pickle.load(open('/content/all_dict.pickle', 'rb'))
str(my_dict["A_Simple_Survey"]["genre"])

"['Fantasy', 'Sci-Fi', 'Supernatural']"

## Preprocessing

Unzip our corpus:

In [None]:
!unzip /content/light_novel.zip

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
  inflating: 现在的light_novel/City_Series/Volume9/output_splitted_textad/City_Series_Volume9_splitted_textad.entities  
  inflating: __MACOSX/现在的light_novel/City_Series/Volume9/output_splitted_textad/._City_Series_Volume9_splitted_textad.entities  
  inflating: 现在的light_novel/City_Series/Volume9/output_splitted_textad/City_Series_Volume9_splitted_textad.quotes  
  inflating: __MACOSX/现在的light_novel/City_Series/Volume9/output_splitted_textad/._City_Series_Volume9_splitted_textad.quotes  
  inflating: 现在的light_novel/City_Series/Volume9/output_splitted_textaa/City_Series_Volume9_splitted_textaa.quotes  
  inflating: __MACOSX/现在的light_novel/City_Series/Volume9/output_splitted_textaa/._City_Series_Volume9_splitted_textaa.quotes  
  inflating: 现在的light_novel/City_Series/Volume9/output_splitted_textaa/City_Series_Volume9_splitted_textaa.entities  
  inflating: __MACOSX/现在的light_novel/City_Series/Volume9/output_splitted_textaa/._City_Series_Volume9_splitt

In [None]:
!pip install booknlp

Collecting booknlp
  Downloading booknlp-1.0.7.tar.gz (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 5.3 MB/s 
Collecting spacy>=3
  Downloading spacy-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 33.1 MB/s 
[?25hCollecting transformers>=4.11.3
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 54.5 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 33.5 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 36.0 MB/s 
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |███████████████████████████

In [None]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 24.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
from booknlp.booknlp import BookNLP

using device cpu


In [None]:
model_params={
		"pipeline":"entity,quote,event,coref", 
		"model":"small" 
	}

booknlp = BookNLP("en", model_params)

{'pipeline': 'entity,quote,event,coref', 'model': 'small'}
downloading entities_google_bert_uncased_L-4_H-256_A-4-v1.0.model
downloading coref_google_bert_uncased_L-2_H-256_A-4-v1.0.model
downloading speaker_google_bert_uncased_L-8_H-256_A-4-v1.0.1.model


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/55.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/37.0M [00:00<?, ?B/s]

--- startup: 19.192 seconds ---


## Define some basic methods

In [None]:
def find_last_digit(string):
  s = re.findall("[0-9]{1,2}", string)
  return int(s[-1])

In [None]:
def sort_chapters(chapter_list):
  return sorted(chapter_list,key = find_last_digit)

In [None]:
import json
from collections import Counter

In [None]:
def proc(filename):
    with open(filename) as file:
        data=json.load(file)
    return data

In [None]:
!rmdir /content/light_novel/.ipynb_checkpoints
!rmdir /content/light_novel/.DS_Store
!find . -name ".DS_Store" -delete
series_list = sorted(os.listdir("light_novel"))

#series_list = ["Shinmai_Maou_no_Keiyakusha"]

series_list
# series = "Anohana:_The_Flower_We_Saw_That_Day"
# volume_list = os.listdir(f"light_novel/{series}")
# volume_list = [i for i in volume_list if i.startswith('Volume')]

rmdir: failed to remove '/content/light_novel/.ipynb_checkpoints': No such file or directory
rmdir: failed to remove '/content/light_novel/.DS_Store': Not a directory


['A_Simple_Survey',
 'Absolute_Duo',
 'Anohana:_The_Flower_We_Saw_That_Day',
 'Apocalypse_Witch',
 'Baka_to_Test_to_Shoukanjuu',
 'BlazBlue',
 'Chrome_Shelled_Regios',
 'City_Series',
 'Cute_Kunoichis',
 'Dai_Densetsu_no_Y%C5%ABsha_no_Densetsu',
 'Dantalian_no_Shoka',
 'Denpa_Onna_to_Seishun_Otoko',
 'Ghost_Hunt',
 'Godhorn_Tech',
 'Golden_Time',
 'Gundam_Unicorn',
 'HEAVY_OBJECT',
 'Hagure_Yuusha_no_Aesthetica',
 'Hanbun_no_Tsuki_ga_Noboru_Sora',
 'Hikaru_ga_Chikyuu_ni_Itakoro......',
 'Hyouka',
 'Kami-sama_no_Inai_Nichiyoubi',
 'Kamisama_no_Memochou',
 'Kanon',
 'Kara_no_Kyoukai',
 'Kaze_no_Stigma',
 'Kino_no_Tabi',
 'Madan_no_Ou_to_Vanadis',
 'Magika_No_Kenshi_To_Shoukan_Maou',
 'Maou_na_Ore_to_Fushihime_no_Yubiwa',
 'Maria-sama_ga_Miteru',
 'Maru-MA',
 'Masou_Gakuen_HxH',
 'Mimizuku_to_Yoru_no_Ou',
 'Mondaiji-tachi_ga_isekai_kara_kuru_soudesu_yo',
 'Monster_Hunter',
 'Nogizaka_Haruka_no_Himitsu',
 'Omae_o_Otaku_ni_Shiteyaru_kara,_Ore_o_Riajuu_ni_Shitekure!',
 'Onii-chan_Dakedo_Ai_S

In [None]:
five_parts = ["splitted_textaa","splitted_textab","splitted_textac","splitted_textad","splitted_textae"]



We first create a file with full text combining all the chapters.

Then we split the full text into five equal sections in order to see how power of the characters evolves throughout the book.

Find the five "most major" characters in the book by selecting the characters whose names appeared the most number of times.

Generate a `character_list` containing the name of the major characters (the name being the mention that is most commonly used) and the inferred gender.

## Define some methods relating to the using booknlp and the calculation of power

Create a method to check whether a character using his/her mentions. Cannot use id because ids change in different parts of the story.

In [None]:
def check_in_mention(current_mention, target_mention):
  current_mention_list = [i["n"] for i in current_mention]
  for i in current_mention_list:
    for key,a in target_mention.items():
      for b in a:
        if i == b:
          return key
  return -1

In [None]:
def get_counter_from_dependency_list(dep_list):
    counter=Counter()
    for token in dep_list:
        term=token["w"]
        tokenGlobalIndex=token["i"]
        counter[term]+=1
    return counter

In [None]:
def make_lemma(word):
  doc = nlp(word)
  a = ""
  for x in range(len(doc)):
    if x == 0:
      a += doc[x].lemma_
    else:
      a += " "
      a += doc[x].lemma_
  return a

Create a method to return the verbs for which the character is either an agent or a patient, and the number of times that verb is used for the character. 


In [None]:
def get_agent_patient(data,target_mentions):
  character_agent_patient = []
  for character in data["characters"]:
    
    agentList=character["agent"]
    patientList=character["patient"]


    mentions=character["mentions"]
    proper_mentions=mentions["proper"]

    character_information = {}

    # just print out information about named characters
    if len(mentions["proper"]) > 0 and check_in_mention(proper_mentions,target_mentions) != -1:

        character_information["name"] = check_in_mention(proper_mentions,target_mentions)

        printTop=None

        agent_dict = {}
        patient_dict = {}

        for k, v in get_counter_from_dependency_list(agentList).most_common(printTop):
            k = make_lemma(k)
            agent_dict[k] = v
       

        for k, v in get_counter_from_dependency_list(patientList).most_common(printTop):
            k = make_lemma(k)
            patient_dict[k] = v
       

        character_information["agent"] = agent_dict
        character_information["patient"] = patient_dict
    if character_information != {}:
      character_agent_patient.append(character_information)
  return character_agent_patient

Create a method to calculate the power score of a given character based on `agency_power_lemma.csv`

In [None]:
def calculate_power(character):

  agent = character["agent"]
  patient = character["patient"]

  length = 0
  power = 0
  for i in list(agent.keys()):
    length += agent[i]
    if i in list(agency_power["verb"]):
      index = list(agency_power["verb"]).index(i)
      if index != -1:
        if agency_power["power"][index] == "power_agent":
          power += 1 * agent[i]
        elif agency_power["power"][index] == "power_theme":
          power -= 1 * agent[i]

  for i in list(patient.keys()):
    length += patient[i]
    if i in list(agency_power["verb"]):
      index = list(agency_power["verb"]).index(i)
      if index != -1:
        if agency_power["power"][index] == "power_agent":
          power -= 1 * patient[i]
        elif agency_power["power"][index] == "power_theme":
          power += 1 * patient[i]
  try:
    normalized_power = power/length
  except:
    normalized_power = "/"

  return normalized_power

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

Go over the volumes in each series and generate a dataframe with the name, inferred gender and power score for the five major characters.

In [None]:
for series in series_list:

  x = f"/content/light_novel/{series}/.ipynb_checkpoints"
  z = f"/content/light_novel/{series}/.DS_Store"
  !rmdir {x}
  !rmdir {z}

  volume_list = os.listdir(f"light_novel/{series}")
  volume_list = [i for i in volume_list if i.startswith('Volume')]
  volume_list = ["Volume1","Volume4","Volume7","Volume9"]


  print(f"processing {series}")

  for volume in volume_list:
    print(f"processing {series} {volume}")

    y = f"/content/light_novel/{series}/{volume}/.ipynb_checkpoints"
    g = f"/content/light_novel/{series}/{volume}/.DS_Store"
    
    !rmdir {y}
    !rmdir {g}

    chapter_list = os.listdir(f"light_novel/{series}/{volume}")
    fulltext = ""
    for i in sort_chapters(chapter_list):
      with open(f"light_novel/{series}/{volume}/{i}") as f:
        fulltext += f.read()

    with open(f"light_novel/{series}/{volume}/full_text", 'w') as f:
      f.write(fulltext)
    
    a = f"/content/light_novel/{series}/{volume}/full_text"
    b = f"/content/light_novel/{series}/{volume}/splitted_text"

    !split -n l/5 {a} {b}

    inputFile=f"/content/light_novel/{series}/{volume}/full_text"
    outputDir=f"/content/light_novel/{series}/{volume}/{series}_{volume}_output"
    idd=f"{series}_{volume}"

    booknlp.process(inputFile, outputDir, idd)

    data=proc(f"{outputDir}/{idd}.book")
    character_count={}
    for i in data["characters"]:
      character_id=i["id"]
      mention = i["mentions"]["proper"]
      count=i["count"]
      if len(mention) > 0:
        character_count[character_id] = count


      top_4 = sorted(list(character_count.values()))[-5:]
      max_keys = [k for k, v in character_count.items() if v in top_4]


      
    character_list = []
    mentions = {}

    for i in data["characters"]:

      character_information = {}
      mention = i["mentions"]["proper"]
      referential_gender_distribution=referential_gender_prediction="unknown"

      if i["g"] is not None and i["g"] != "unknown":
          referential_gender_distribution=i["g"]["inference"]
          referential_gender=i["g"]["argmax"]

      if len(mention) >0 and i["id"] in max_keys:
        max_proper_mention=mention[0]["n"]
        character_information["name"] = max_proper_mention
        character_information["gender"] = referential_gender
        if character_information != {}:
          character_list.append(character_information)
        mentions[mention[0]["n"]] = [i["n"] for i in mention]
    character_list = pd.DataFrame(character_list)

    arr = np.array(character_list["gender"])
    if np.all(arr == arr[0]):
      continue


    power_df = character_list.copy()
    power_df[["power1","power2","power3","power4","power5"]] = 0
    power_df["series_name"] = f"{series}"
    power_df["volume_name"] = f"{volume}"
    power_df["genre"] = str(my_dict[f"{series}"]["genre"])
    m = 0

    for i in five_parts:
      inputFile = f"/content/light_novel/{series}/{volume}/{i}"
      outputDir = f"/content/light_novel/{series}/{volume}/output_{i}"
      idd = f"{series}_{volume}_{i}"

      booknlp.process(inputFile, outputDir, idd)

      data=proc(f"{outputDir}/{idd}.book")
      power_list = {}
      for character in get_agent_patient(data,mentions):
        power_list[character["name"]] = calculate_power(character)
      for a in range(4):
        current_name = power_df.iloc[a,0]
        try:
          power_df.iloc[a,2+m] = power_list[current_name]
        except:
          power_df.iloc[a,2+m] = "/"
      m+=1

      print(power_df)
    power_df.to_csv(f'/content/light_novel/{series}/{volume}/{series}_{volume}_power_df.csv', index = False)


rmdir: failed to remove '/content/light_novel/A_Simple_Survey/.ipynb_checkpoints': No such file or directory
rmdir: failed to remove '/content/light_novel/A_Simple_Survey/.DS_Store': No such file or directory
processing A_Simple_Survey
processing A_Simple_Survey Volume1
rmdir: failed to remove '/content/light_novel/A_Simple_Survey/Volume1/.ipynb_checkpoints': No such file or directory
rmdir: failed to remove '/content/light_novel/A_Simple_Survey/Volume1/.DS_Store': No such file or directory


IndexError: ignored

In [None]:
!zip -r /content/light_novel_new.zip /content/light_novel

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume7/splitted_textac (deflated 64%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume7/splitted_textab (deflated 64%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume7/splitted_textaa (deflated 64%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume7/Magika No Kenshi To Shoukan Maou:Volume 7 Chapter 3 (deflated 65%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume4/ (stored 0%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume4/Magika_No_Kenshi_To_Shoukan_Maou_Volume4_power_df1.csv (deflated 66%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume4/Magika No Kenshi To Shoukan Maou:Volume 4 Chapter 4 (deflated 64%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shoukan_Maou/Volume4/full_text (deflated 65%)
  adding: content/light_novel/Magika_No_Kenshi_To_Shouka