An approach to classify the races of characters from Lord of the Rings using their names as feature and naive Bayes
As a huge fan of the Lord of the Rings and Tolkien's work, I was interested in finding a way of using data from the legendarium with machine learning. While searching and pondering about what problem could be interesting, I had the idea of playing around with the names of the characters and the relation to the race of said character.
In this report, I will discuss and show an approach used to predict the races of the characters from Lord of the Rings using a naive Bayes classifier and various techniques for natural language processing. The dataset used consists of 789 observations (characters) and their respective race.
- name: the name of the character
- race: the race of the character. There are five possible races: Man, Ainur, Elf, Dwarf and Hobbit.
- Spark (Pyspark)
- R: for scraping, transforming and preparing the data.
This repo holds the Python script used for the analysis, the R script used for scraping and transforming the data, the original scraped data, and several CSV files with the final data (the one used in the analysis is characters_no_surnames.csv)