Small files merger

This a quick and dirty MR job to merge many small files using a Hadoop Map-Reduce (well - map-only) job. It should run on any Hadoop cluster, but it has specific optimizations for running against Azure Storage on Azure HDInsight.

Usage for HDInsight

From a PowerShell window, with mvn and git in the path. Assuming $clust is an HDInsight cluster, and $cont is the default container for it.

git clone https://github.com/mooso/smallfilesmerge.git
cd smallfilesmerge
mvn package
$inPath = "wasb://myinputcontainer@myaccount.blob.core.windows.net/path/to/files"
$outPath = "wasb://myoutputcontainer@myaccount.blob.core.windows.net/output/path"
Set-AzureStorageBlobContent -Blob "jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -Container $cont.Name -File .\target\microsoft-hadoop-smallfilemerge-0.0.1.jar –Force
$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "/jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -ClassName "com.microsoft.hadoop.smallfilesmerge.FileMergerByDirectory" -Arguments $inPath, $outPath
$job = Start-AzureHDInsightJob -Cluster $clust.Name -JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $job -WaitTimeoutInSeconds 72000
Get-AzureHDInsightJobOutput -Cluster $clust.Name -JobId $job.JobId

Additional features and optimizations

You can specify multiple input directories/otuput directories as a comma-separated list
For testing, you can specify three additional arguments: -popInput <numDirs> <numFiles>, which will just populate numDirectories each with numFiles files in them.

For even faster merge jobs if you have lots of files, you can specify the account key for the input path in the define combine.directory.account.key. This will allow the merger to use storage API directly to more quickly list the input files. PowerShell example:

  $defines = @{ "mapred.task.timeout"="6000000"; "combine.directory.account.key"=$(Get-AzureStorageKey $myStorageAccount).Primary }
  $jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "/jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -ClassName "com.microsoft.hadoop.smallfilesmerge.FileMergerByDirectory" -Arguments $inPath, $outPath -Defines $defines

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src/main/java/com/microsoft/hadoop/smallfilesmerge		src/main/java/com/microsoft/hadoop/smallfilesmerge
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small files merger

Usage for HDInsight

Additional features and optimizations

About

Releases

Packages

Languages

mooso/smallfilesmerge

Folders and files

Latest commit

History

Repository files navigation

Small files merger

Usage for HDInsight

Additional features and optimizations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages