Hadoop Tweet Wordcounter Job
What is it?
A tweet retriever & hadoop wordcounter job. Built to show the hadoop hdfs usage for a 'CSC338: Parallel & Distributed Processing' group project at Missouri State University.
Why use it?
For sentimental analysis of the most recent tweets containing a particular hashtag. For example, if you run the job on tweets containing the hashtag
#food, you may be able to make conclusions about the most discussed meals within a given amount of time the tweets were retrieved.
How it works
There are 2 major steps in running the project:
- Retrieve tweets via twitter-api by hashtag and put them into a textfile
- Run a wordcounter script with Hadoop to count the number of word occurences in the retrieved tweets file
Steps to setup & run:
Make sure you have hadoop 2.7.3 and python 3 installed on your server or local computer where you will be hosting the project.
Run the tweet retriever script with
./get-tweets.sh #hashtag-word, where
#hashtag-wordis your desired hashtag. To quit the script after desired amount of tweets are retrieved, use 'Ctrl-C'.
To copy the textfile to HDFS & run the hadoop job on the file, run
./job.sh [textfilename.txt]. NOTE: The default textfile created from the tweet retriever in step 2 is 'fetched_tweets.txt', so this should be used unless you plan to use a different textfile in the hadoop wordcounter.
Viewing the Output:
- Assuming the job completes successfully, the output will be placed in the local folder where the repository files are located.
- Open and view the textfile inside the '/completed-wordcount' directory. Words are sorted by most common occurences to least common occurences
- There is a serial wordcount script that can be used in lieu of the Hadoop job script. Run
python serial-wordcount.py [filename.txt]instead of the job.sh script.
- Commands to manipulate the HDFS begin with
hdfs dfs, followed by the command to execute with any other arguments
hdfs dfs -ls [directory name]can be used to verify files were copied to the HDFS
- If you need to create a directory on the HDFS, run
hdfs dfs -mkdir /directory-name
- You can type 'hadoop-streaming-*.jar' instead of remembering the exact version number when accessing the hadoop jar file
$PWDgives the current working directory
External Resources & Documentation:
- The Mapper & Reducer are based on this Hadoop Application Walkthrough