Skip to content

HDFS and MapReduce project to calculate average letter frequencies across a number of languages using the books that are available in Project Gutenberg.

License

Notifications You must be signed in to change notification settings

pessini/LetterFrequency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysis of Letter Frequency

Apache Hadoop

Using HDFS and MapReduce to calculate average letter frequencies across a number of languages using the books that are available in Project Gutenberg.

Assumptions

All files downloaded from the website Project Gutenberg and are in Plain Text UTF-8 format.

All files must be named using its language as prefix and we assume that all books from the same language must have the same prefix plus "-".

Example:

  • en-book1.txt
  • en-book2.txt
  • it-book1.txt
  • pt-book1.txt …

Software version

  • JavaSE 1.7
  • Hadoop Virtual Machine (VM) - Ubuntu 64-bit
  • Hadoop MapReduce 2.2.0
  • Oracle VirtualBox 6.1
  • Eclipse IDE - Version: 2020-12 (4.18.0)

Dataset

For this project was used 6 books from English, Portuguese and Italian.

See the full code explanation here.

Loading data into HDFS

hadoop fs -mkdir /books
hadoop fs -put ./sf_VM-Shared-Folder/books/ /books

alt text

Running the program

hadoop jar sf_VM-Shared-Folder/LetterFrequency.jar books output

The JAR file can be found here.

A screenshot from the terminal showing the Counters on how many books and records were processed in Mapper and Reducer:

alt text


The Output file that was generated by the MapReduce program:

alt text

In order to work on the result just get the file from the HDFS making a copy to your Local Machine using this command:

hadoop fs -copyToLocal /user/soc/output/part-r-00000 ./sf_VM-Shared-Folder/frequency-letter.txt

Plotting the results

Here is the final analysis comparing the 3 languages using Python (code here):

alt text

About

HDFS and MapReduce project to calculate average letter frequencies across a number of languages using the books that are available in Project Gutenberg.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published