In my short experience learning frameworks and programming languages I noticed that my biggest difficulty is environment configuration, specifically with versions, so here's a good practice, I'll write down the versions and configurations that I usually use.
IDE: Intellij
JDK: 11.0.17
Scala: 2.12.10
Spark Version: 3.1.2
Hadoop: 2.7.2
Spark is compatible with JDK 8.x, 11.x, 17.x similar with others frameworks like PlayFramework and not every Spark version is compatible with every Scala version, in this case Spark 3.1.2 is compatible with Scala 2.12.10. About jdk exist most ways of control this, on intellij creating project you can change the jdk version
For Hadoop works in/with Spark you need to create the environment variable. For make this, you need to create new path with name "HADOOP_HOME" and value with path of hadoop directory WITHOUT \bin, after this you need to edit Path and add new value with "%HADOOP_HOME%\bin
ThisBuild / version := "0.0.0-YOUR_VERSION"
ThisBuild / scalaVersion := "2.12.10"
val sparkVersion = "3.1.2"
lazy val root = (project in file("."))
.settings(
name := "YOUR_PROJECT_NAME"
//You can write the dependencies here too
)
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion
You can add more dependencies. If it is another spark then use sparkVersion variable to specify the version
I used a public dataset that contains popularity and other informations of music in spotify. Available in Kaggle (excelent platform for learn data sci/data eng). Link below
Number of rows is 26,173,515 (26 million).
Archive size is.
-
Unzipped 3,401,833 KB (3.4 GB).
-
Zipped 967.994 KB (967 MB).