common crawl implementation(s) in scala
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
gradle/wrapper
src/main/scala/com/ntent/commoncrawl
.gitignore
LICENSE
README.md
build.gradle
gradle.properties
gradlew
gradlew.bat
settings.gradle

README.md

Common Crawl Scala Example

This project provides examples how to generate an input file for graph generation. This project is just a Scala version of the hostlinks_to_graph.py of Common Crawl project.

Getting Started

  • Clone project

  • Create shadowJar with command ./gradlew shadowJar

  • Move shadow jar from build/libs/cc-scala-all.jar to your spark gateway

  • Run code with following format: spark-submit --class com.ntent.commoncrawl.HostLinksToGraph cc-scala-all.jar args[] where args:

    • args[0] InputParquet
    • args[1] EdgesOutput
    • args[2] VerticesOutput
    • args[3] ValidateHosts
    • args[4] SaveAsText
    • args[5] NumPartitions
    • args[6] VertexIDs

    Please refer to original code for the details of the parameters.

Prerequisites

If you use shadow JAR file, all dependencies are included in the final jar file otherwise you need to import required dependencies to your project. Please refer to build.gradle file for all the dependencies.

Running the tests

This project doesn't have any unit tests yet

Built With

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details