### Task Description
You're gonna create a dataset reader for the Visual Document Question Answering task.
+ Get the dataset from this [link](https://rrc.cvc.uab.es/?ch=17&com=downloads)
+ You must support reading the train and test datasets.
+ You must support the super simple interface readDataset()
+ The schema should be as provided in the cell below(check details)
+ Provide a github repo, and setup instructions.
+ I will test this in a cluster, so it must serialize things properly(from one executor to another, from one node to another).
+ Write the solution in Scala, with Python wrappers, so that it can be called this way, 

In [None]:
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName("scala_pyspark") \
.config("spark.jars", "./docvqareader_2.12-0.1.jar") \
.getOrCreate()

In [None]:
from jsl.task.docvqa_reader import DocVQA

# this is the path in which the .json file is located
path = "filesystem:/path/to/{val, train}"
df = DocVQA().readDataset(spark, path)

### Schema
This is what the schema of the returned Dataframe should look like. The idea is that each row will contain a number of questions, and each of these questions will have multiple answers, on the same row.
+ path: path to the file, it can be on a DFS.
+ modificationTime: this value comes from the OS.
+ questions: an array of questions.
+ answers: a 2D array with each inner level array providing a set of candidate answers for each question in that same row.
+ a binary buffer containing the image(see 'content' field below).

Check the dataset JSON schema in the dataset for more details.

In [None]:
df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)
 |-- questions: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- answers: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)



### Let's count the number of questions
You should support all these type of operations without problems.

In [None]:
from pyspark.sql.functions import explode
questions = df.select(explode("questions"))

In [None]:
questions.count()

14:31:32, INFO Received command c on object id p0
14:31:32, INFO Received command c on object id p0
Out[27]: 5349

In [None]:
df.rdd.getNumPartitions()

Out[15]: 120