### Task Description
You're gonna create a dataset reader for the Visual Document Question Answering task.
+ Get the dataset from this [link](https://rrc.cvc.uab.es/?ch=17&com=downloads)
+ You must support reading the train and test datasets.
+ You must support the super simple interface readDataset()
+ The schema should be as provided in the cell below(check details)
+ Provide a github repo, and setup instructions.
+ I will test this in a cluster, so it must serialize things properly(from one executor to another, from one node to another).
+ Write the solution in Scala, with Python wrappers, so that it can be called this way, 

In [1]:
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName("scala_pyspark") \
.config("spark.jars", "./docvqareader_2.12-0.1.jar") \
.getOrCreate()

23/12/10 17:43:37 WARN Utils: Your hostname, Sheoldred.local resolves to a loopback address: 127.0.0.1; using 192.168.100.214 instead (on interface en0)
23/12/10 17:43:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/12/10 17:43:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/10 17:43:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
from jsl.task.docvqa_reader import DocVQA

# this is the path in which the .json file is located
# path = "data/DocVQA/train_v1.0_withQT.json"
path = "data/DocVQA/test_v1.0.json"
df = DocVQA().readDataset(spark, path)

### Schema
This is what the schema of the returned Dataframe should look like. The idea is that each row will contain a number of questions, and each of these questions will have multiple answers, on the same row.
+ path: path to the file, it can be on a DFS.
+ modificationTime: this value comes from the OS.
+ questions: an array of questions.
+ answers: a 2D array with each inner level array providing a set of candidate answers for each question in that same row.
+ a binary buffer containing the image(see 'content' field below).

Check the dataset JSON schema in the dataset for more details.

In [3]:
df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = false)
 |-- length: integer (nullable = false)
 |-- content: binary (nullable = true)
 |-- questions: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- answers: array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: string (containsNull = true)



### Let's count the number of questions
You should support all these type of operations without problems.

In [4]:
from pyspark.sql.functions import explode
questions = df.select(explode("questions"))

In [5]:
questions.count()

                                                                                

39463

In [6]:
df.rdd.getNumPartitions()

3