# Loading Data with Spark

* Github repo link: https://github.com/JoeGanser/teaching/blob/main/Lectures/Spark/loading_data.ipynb

* Reading from CSV, JSON, parquet files
* Reading from a SQL connection
* Specifying Schema

In [3]:
!pip install pyspark
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()



### Not specifying a schema or header

Reading a csv or other static data file in Spark is easy. Simply use `spark.read.csv(...)`.

But what happens if we dont specify what the schema or the header?

In [16]:
apple = spark.read.csv('data/fakefriends.csv')
apple.show(10)

+---+--------+---+---+
|_c0|     _c1|_c2|_c3|
+---+--------+---+---+
|  0|    Will| 33|385|
|  1|Jean-Luc| 26|  2|
|  2|    Hugh| 55|221|
|  3|  Deanna| 40|465|
|  4|   Quark| 68| 21|
|  5|  Weyoun| 59|318|
|  6|  Gowron| 37|220|
|  7|    Will| 54|307|
|  8|  Jadzia| 38|380|
|  9|    Hugh| 27|181|
+---+--------+---+---+
only showing top 10 rows



### Infering schemas & loading headers

* To get spark to infer the schema (data types for each column), we specify `inferSchema=True`
* To get spark to read the header, we specify `header=True`

In [18]:
apple = spark.read.csv('data/fakefriends.csv',inferSchema=True,header=['index','name','age','friends'])
apple.show(10)

Py4JJavaError: An error occurred while calling o87.csv.
: java.lang.Exception: header flag can be true or false
	at org.apache.spark.sql.errors.QueryExecutionErrors$.paramIsNotBooleanValueError(QueryExecutionErrors.scala:1214)
	at org.apache.spark.sql.catalyst.csv.CSVOptions.getBool(CSVOptions.scala:96)
	at org.apache.spark.sql.catalyst.csv.CSVOptions.<init>(CSVOptions.scala:118)
	at org.apache.spark.sql.catalyst.csv.CSVOptions.<init>(CSVOptions.scala:47)
	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210)
	at scala.Option.orElse(Option.scala:447)
	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1589)


## Reading from a CSV file

# Sources
https://github.com/databricks-academy/apache-spark-programming-with-databricks