Skip to content


Repository files navigation


This repository has all of my cloud computing related projects source code.

  • Hadoop MapReduce
  • Spark

Hadoop - Map Reduce projects outputs the word count for each distinct word in each file. Output will in the form 'word#####filename count' where '#####' is the delimiter.

Execution :

argument 1 : input directory where files are stored.
argument 2 : output directory. outputs term frequency(TF) for each word in the corpus in the format 'word#####filename TF' where ##### is delimiter

TF(t,d) = No. of times term t appears in document d

TF=1 + log10 (TF(t,d))

Execution :

argument 1 : input directory where files are stored.
argument 2 : output directory. calculates Term Frequency for each word in corpus(TF) and Inverse Document Frequency(IDF) for each word and then outputs TF-IDF in the format 'word#####filename TFIDF' where ##### is delimiter.

TF=1 + log10(TF(t,d))

IDF= log10 (Total no. of documents / No. of documents containing term t)

Execution :

argument 1 : input directory.
argument 2 : output directory.

Basic query search engine that takes user query and outputs list of documents that matches the query in the format 'filename TFIDFWeightSum' and input to mapper is output of

Execution :

argument 1 : input directory.
      Note: Give the output files' directory of as input directory.
argument 2 : output directory.

Given a graph of hyper-links with out-links from one web page to other this calculates page rank and outputs in descending order of the rank.

Execution :

hadoop jar PageRank.jar argument_1 argument_2

argument_1 : input directory.
argument_2 : output directory.

Python-Spark program for finding beta co-efficients of linear regression by computing summation form of closed form expression: β^=(XTX)−1XTY

Execution :

spark-submit argument_1

argument_1 : input file name