
![](https://github.com/jennyjyu/socialgraphs-projectB/blob/main/header-hogwarts.png?raw=true)

## 1. Motivation
Our dataset consists of characters from the Harry Potter books/movies. As the group members have seen the movies
and read the books several times, the Harry Potter universe could provide some results that are easy to analyze 
(at least compared to the Zelda BotW universe that was completely foreign for all the group members). 

Additionally, as Christmas time is upon us, this is a great timing for the class members to get some details 
about Harry Potter. The movies will abound on the screens in the wintery time, and it is always nice to have
some fun facts in the sleeve!

The goal is for the end user to gain some, even if it is just a little, extra information and understanding of the wizarding world of Harry Potter. It should be mentioned that this information and the statistics are deducted from the Harry Potter fandom page, and does not necessarily reflect the author J. K. Rowling’s ideas. 


## 2. Basic stats. Let's understand the dataset better

The ones chosen are the ones mentioned in the [list of characters on Wikipedia](https://en.wikipedia.org/wiki/List_of_Harry_Potter_characters). This was a good list of characters to take as a starting point, and also one of the only explicit lists found. 

To get information about each character, their wikipage on the [Harry Potter fandom wiki](https://harrypotter.fandom.com/wiki/Main_Page) was retrieved through their API, similarly to the methods taught in the course. Attributes of species, gender and house were found through regular expressions, and their links to other characters as well. 

The biggest challenge when retrieving the data, was that the names used on the Wikipedia page did not always correspond to the one used on the fandom wiki, e.g., *Ronald Weasley* and *Ron Weasley*, or *Alastor (Mad-Eye) Moody* and *Alastor Moody*. Therefore, through some trial and error, a handful of the characters needed to manually get aliases. Additionally, some of the characters mentioned in the Wikipedia list did not have their own fandom page. Thus, these characters got excluded.

The resulting network consists of a directed graph with 191 nodes and 2787 edges. Many of the characters do not “fit into” the framework of having a gender and associated house, but since the main characters of interest (and most of the characters) do, it was still chosen to be retracted and saved. Thus, many of the distribution plots show many “unknown” values. More details are presented below and on the website.
 
## 3. Tools, theory and analysis. Describe the process of theory to insight

To extract the wanted parts of the wiki pages of each character, regular expressions were frequently used. The strings are shown below, associated with their use. It is assumed that all readers are familiar with RegEx, and an explanation will therefore not be provided.

**Names:** 

* `(?:\[\[)?([\w \(\)\-.\/]+)(?:\]\])?(?: )?(?:\([\w ]+\))?(?: )?[\–\-] `

**Attributes:**

* `species = \[\[([\w ]+)\]\]`
* `house = \[\[([\w ]+)\]\](?:\<)?`
* `gender = ([\w ]+)(?:\<)?`
	
**Links:**

* `\[\[([\w ]+)(?:[\#.*?])?\]\]'`

First and foremost, the names from the Wikipedia page needed to be extracted for further use. Secondly, while iterating over the collected names, their fandom wiki pages needed to be analyzed for attributes and links. All this data makes the basis for the constructed graph, which can be seen [on the website](https://jennyjyu.github.io/socialgraphs-projectB/posts/post-1/).

### Distributions

After the network was constructed, several different distributions were further analyzed; based on different attributes and degrees (in, out and total). To do so, the `networkx` package was a good help. 


#### Attributes 

The results show that most of the characters who are important are categorized as **humans**. The fandom wiki does not distinguish between if a wizard is a muggle-born, half-blood or pure-blood when categorizing them as human. As most of the plot of the Harry Potter story is about the wizards going to Hogwarts, this makes sense. They do encounter some different species now and then, but all the main characters are (or were originally) humans.

It is important to note that even though many of the species only have one occurrence in the data, it does not mean that it is the only one of its kind in the Harry Potter universe, but rather the only one of their species that has a big enough role to be named, and counted as a character. 

Harry Potter is a male-dominated universe, just as in the universe of computer science. It is reasonable to assume that those characters who have the gender **unknown** are not of the human species, but of a species where both the name and species are not explicitly defined by J.K. Rowling, or define themselves as a gender-fluid. The exact reason why can, sadly, not be decided by the distribution, so we just have to keep speculating.

The distribution of the houses from which the characters belong to is interesting. Surprisingly many characters do not have a house related to them, which on second thought, might make sense. As the almost the only characters that can be related to a house are **humans**, all the characters who are not, will have **unknown** as their value. Additionally, many human characters are mentioned without a context of the house they belong to, such as members from the Ministry of Magic. Therefore, to get a better result, only the characters who had a house association were considered. As expected, most of the mentioned characters belong to **Gryffindor**, which is Harry and his friends' house. Their rivaling house, **Slytherin**, is also well represented.

#### Degrees
The `networkx` package made this part painless. It easily provided the `graph.in_degree()`, `graph.out_degree()`, `graph.degree()`, etc. that could be provided directly on the graph. To gather the in-degree of all nodes, the following lines was run:

`in_counts = [graph.in_degree(node) for node in graph.nodes]`

The execution was corresponding for total degree (`graph.degree(node)`) and out-degree (`graph.out_degree(node)`). Afterwards, the degrees were put in bins and plotted to a histogram. 

The results show that most of the characters have a few (and even no) links to others. Only three characters have a degree that is higher than 200. One can assume that one of these is Harry Potter, most likely the one with the highest. Another one of the high-degree nodes may be Voldemort (Tom Riddle), as he is a central character and enemy of Harry. Or the other ones could be either one of Harry’s friends, Ron or Hermoine, or the master wizard Dumbledore. More details can be found on the website.


#### Harry and his friends

As two of the main questions presented in Project A were *Is the “boy with the scar” in fact the most mentioned character?* and *How important are his friends?*, the analysis gave answers and a deeper understanding of these wonderings.  Therefore some more examinations on different centrality measures of Harry Potter and his two friends, Hermione Granger and Ron Weasley, were conducted.

As for degree analysis, the same help from `networkx` holds for the centrality measures. Centrality measures are, in theory, somewhat advanced and would be time-consuming to program. `networkx` provides `networkx.closeness_centrality`, `networkx.betweenness_centrality` and `networkx.eigenvector_centrality`, and makes it extremely easy to examine these features of the network, or in this case, of the three friends. It is also easy to specify a single node, e.g. the closeness centrality of Hermione Granger:

`networkx.closeness_centrality(graph)["Hermione Granger"]`

The same can be done for all the characters (although it was only applied on the three friends in this project), and to get a specific node’s degree (in, out or total):

`graph.in_degree("Harry Potter")`

For all the measures examined, Harry came in with the highest values each time. Additionally, Hermione came in second, and Ron last. It was surprising that Ron and Hermione were not more similar!


#### The most important characters
Additionally, the attributes of the most important characters were disclosed, revealing the focus of the movies. The attributes were extracted from the `dataframe` containing all the information about each character in the Harry Potter Universe, which was generated for the initial analysis of the network.

By looking at the attributes for the top-5 most important person according to the in- and out-degree of the characters we were able to get insight into what to a large extent characterize the main plot of the movie. 

The main takeaways from what the attributes reveal about the movie is that humans, in particular Wizards, play the most central role, although we find a huge variety of magical creatures in the Harry Potter universe. A huge part of the plot in the series is about the rivalry between the four Houses. Slytherins and Gryffindors are frequently pitted against each other, so there is no surprise that the most important characters belong to these Houses.     



### Wordclouds
As the students at Hogwarts School of Witchcraft and Wizardry belong to one of the four Houses; Gryffindor, Slytherin, Hufflepuff and Ravenclaw, the group found it interesting to look into what each Hogwarts House actually represents. All the four Houses have distinctive characteristics which are important according to the traditions and history of the House.

The group developed wordclouds for each house based on the data acquired from the Wiki Fandom page by looking at the character pages for each individual who belongs to the House.

By utilizing the Harry Potter Fandom API (https://harrypotter.fandom.com/api.php?) we were able to retrieve the wiki page of each individual character. The next step involved preprocessing of the text to get it in the correct format for the final generation of wordclouds . For this purpose we used the Natural Language Toolkit (NLTK) (https://www.nltk.org/),  which is a toolkit built for working with natural language processing in Python. NLTK was used for tokenization, removal of stopwords and lemmatization of the text, by using the built-in methods `WordPunctTokenizer()`, `WordNetLemmatizer()` and `corpus.stopwords.words('english')`. After preprocessing the text, the `wordcloud` Python package was used to generate the clouds for each of the Houses.

The wordclouds gives insight into the most important traits related to each House. E.g. From the Gryffindor wordcloud one can deduce that Voldemort, death and battle are the three most mentioned words on the Gryffindors wikipages. The Gryffindors are in fact known for their chivalry, daring and bravery which reflect why these words are of such importance. 




## 4. Discussion
This is the first time for all the group members to use an automated deployment service for hosting a website. The group is therefore happy with getting the Hugo page up and running quickly. The use of the Hugo website framework speeded up the web development process and the group members would definitely consider using this tool to present and visualize data in future projects. 

Despite the importance to the results in the analyses conducted, the group was happy with being able to make some changes to the Hugo theme imported. As the code was in a somewhat unfamiliar language, this was considered an achievement by the group members, as the Hugo blog is more related to the study of the Harry Potter universe.
As the group consists of two students (not three as recommended...), and both the students are on exchange while taking 15 ETCS at home for a preliminary study for their master’s thesis, there was shortage in time. As the deadlines for the current project and the  preliminary study are at the same time, the group had to prioritize the study. 

Even though the group found the results interesting and in accordance with the goals, some deeper examination would be very interesting to add. Moreover, if more time had been available, a more advanced sentiment analysis would also have been possible. The group members tried to find some transcript from the movies, containing dialogues and lines similar to the dataset provided for Zelda BotW, but did not find anything in the short time available. The most relevant ones were the transcripts from [the fandom](https://warnerbros.fandom.com/wiki/Harry_Potter_and_the_Philosopher%27s_Stone/Transcript), but as the names are on another format than the names our dataset, it would take a little too much time to make aliases for all nodes and other preprocessing. Another possibility to compensate for the data missing, is to perform a sentiment analysis on the book(s), such as [Greg Rafferty has done](https://towardsdatascience.com/basic-nlp-on-the-texts-of-harry-potter-sentiment-analysis-1b474b13651d). Additionally, the group could have developed their own website based on, e.g., React JS, making it more interactive than just showing images with text. 
 
## 5. Contributions
Both group members collaboratively set up Hugo and the website. The rest of the responsibilities were distributed as:

**Hedda**: 
* All video and music editing for Project A
* Blog post 5 and 6

**Jenny**:
* Preparatory work for video in Project A, and blog post 1-4
* Some frontend details: header and favicon 

