Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

Common Crawl Scala Example

This project provides examples how to generate an input file for graph generation. This project is just a Scala version of the hostlinks_to_graph.py of Common Crawl project.

Getting Started

  • Clone project

  • Create shadowJar with command ./gradlew shadowJar

  • Move shadow jar from build/libs/cc-scala-all.jar to your spark gateway

  • Run code with following format: spark-submit --class com.ntent.commoncrawl.HostLinksToGraph cc-scala-all.jar args[] where args:

    • args[0] InputParquet
    • args[1] EdgesOutput
    • args[2] VerticesOutput
    • args[3] ValidateHosts
    • args[4] SaveAsText
    • args[5] NumPartitions
    • args[6] VertexIDs

    Please refer to original code for the details of the parameters.

Prerequisites

If you use shadow JAR file, all dependencies are included in the final jar file otherwise you need to import required dependencies to your project. Please refer to build.gradle file for all the dependencies.

Running the tests

This project doesn't have any unit tests yet

Built With

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details

About

common crawl implementation(s) in scala

Resources

License

Releases

No releases published

Packages

No packages published

Languages