Implement a Hadoop job to merge many small files
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

Small files merger

This a quick and dirty MR job to merge many small files using a Hadoop Map-Reduce (well - map-only) job. It should run on any Hadoop cluster, but it has specific optimizations for running against Azure Storage on Azure HDInsight.

Usage for HDInsight

From a PowerShell window, with mvn and git in the path. Assuming $clust is an HDInsight cluster, and $cont is the default container for it.

git clone
cd smallfilesmerge
mvn package
$inPath = "wasb://"
$outPath = "wasb://"
Set-AzureStorageBlobContent -Blob "jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -Container $cont.Name -File .\target\microsoft-hadoop-smallfilemerge-0.0.1.jar –Force
$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "/jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -ClassName "" -Arguments $inPath, $outPath
$job = Start-AzureHDInsightJob -Cluster $clust.Name -JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $job -WaitTimeoutInSeconds 72000
Get-AzureHDInsightJobOutput -Cluster $clust.Name -JobId $job.JobId

Additional features and optimizations

  • You can specify multiple input directories/otuput directories as a comma-separated list

  • For testing, you can specify three additional arguments: -popInput <numDirs> <numFiles>, which will just populate numDirectories each with numFiles files in them.

  • For even faster merge jobs if you have lots of files, you can specify the account key for the input path in the define This will allow the merger to use storage API directly to more quickly list the input files. PowerShell example:

      $defines = @{ "mapred.task.timeout"="6000000"; ""=$(Get-AzureStorageKey $myStorageAccount).Primary }
      $jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "/jars/microsoft-hadoop-smallfilemerge-0.0.1.jar" -ClassName "" -Arguments $inPath, $outPath -Defines $defines