Using HDFS and MapReduce to calculate average letter frequencies across a number of languages using the books that are available in Project Gutenberg.
All files downloaded from the website Project Gutenberg and are in Plain Text UTF-8 format.
All files must be named using its language as prefix and we assume that all books from the same language must have the same prefix plus "-".
Example:
- en-book1.txt
- en-book2.txt
- it-book1.txt
- pt-book1.txt …
Software version
- JavaSE 1.7
- Hadoop Virtual Machine (VM) - Ubuntu 64-bit
- Hadoop MapReduce 2.2.0
- Oracle VirtualBox 6.1
- Eclipse IDE - Version: 2020-12 (4.18.0)
For this project was used 6 books from English, Portuguese and Italian.
- Journal of Small Things by Helen Mackay
- Il perduto amore by Umberto Fracchia
- Memorias Posthumas de Braz Cubas by Machado de Assis
- Five Little Friends by Sherred Willcox Adams
- Orlando innamorato by Matteo Maria Boiardo
- Dom Casmurro by Machado de Assis
See the full code explanation here.
hadoop fs -mkdir /books
hadoop fs -put ./sf_VM-Shared-Folder/books/ /books
hadoop jar sf_VM-Shared-Folder/LetterFrequency.jar books output
The JAR file can be found here.
A screenshot from the terminal showing the Counters on how many books and records were processed in Mapper and Reducer:
The Output file that was generated by the MapReduce program:
In order to work on the result just get the file from the HDFS making a copy to your Local Machine using this command:
hadoop fs -copyToLocal /user/soc/output/part-r-00000 ./sf_VM-Shared-Folder/frequency-letter.txt
Here is the final analysis comparing the 3 languages using Python (code here):