Common Crawl Scala Example
This project provides examples how to generate an input file for graph generation. This project is just a Scala version of the hostlinks_to_graph.py of Common Crawl project.
Create shadowJar with command
Move shadow jar from
build/libs/cc-scala-all.jarto your spark gateway
Run code with following format:
spark-submit --class com.ntent.commoncrawl.HostLinksToGraph cc-scala-all.jar argswhere args:
- args InputParquet
- args EdgesOutput
- args VerticesOutput
- args ValidateHosts
- args SaveAsText
- args NumPartitions
- args VertexIDs
Please refer to original code for the details of the parameters.
If you use shadow JAR file, all dependencies are included in the final jar file otherwise you need to import required dependencies to your project. Please refer to build.gradle file for all the dependencies.
Running the tests
This project doesn't have any unit tests yet
This project is licensed under the Apache 2.0 License - see the LICENSE file for details