Environment configuration

In my short experience learning frameworks and programming languages I noticed that my biggest difficulty is environment configuration, specifically with versions, so here's a good practice, I'll write down the versions and configurations that I usually use.

Versions

IDE: Intellij

JDK: 11.0.17

Scala: 2.12.10

Spark Version: 3.1.2

Hadoop: 2.7.2

Explanations:

Spark is compatible with JDK 8.x, 11.x, 17.x similar with others frameworks like PlayFramework and not every Spark version is compatible with every Scala version, in this case Spark 3.1.2 is compatible with Scala 2.12.10. About jdk exist most ways of control this, on intellij creating project you can change the jdk version

For Hadoop works in/with Spark you need to create the environment variable. For make this, you need to create new path with name "HADOOP_HOME" and value with path of hadoop directory WITHOUT \bin, after this you need to edit Path and add new value with "%HADOOP_HOME%\bin

build.sbt

ThisBuild / version := "0.0.0-YOUR_VERSION"

ThisBuild / scalaVersion := "2.12.10"

val sparkVersion = "3.1.2"

lazy val root = (project in file("."))
  .settings(
    name := "YOUR_PROJECT_NAME"
    //You can write the dependencies here too
  )

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion

You can add more dependencies. If it is another spark then use sparkVersion variable to specify the version

Dataset

I used a public dataset that contains popularity and other informations of music in spotify. Available in Kaggle (excelent platform for learn data sci/data eng). Link below

Number of rows is 26,173,515 (26 million).

Archive size is.

Unzipped 3,401,833 KB (3.4 GB).
Zipped 967.994 KB (967 MB).

charts.csv / Kaggle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Environment configuration

Versions

Explanations:

build.sbt

Dataset

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Environment configuration

Versions

Explanations:

build.sbt

Dataset