Simple Apache Drill alternative using PySpark inspired by PyDAL
Run terminal command pip install microdrill
PySpark was tested with Spark 1.6
ParquetTable(table_name, schema_index_file=file_name)
- table_name: Table referenced name.
- file_name: File name to search for table schema.
ParquetDAL(file_uri, sc)
- file_uri: It can be the path to files or
hdfs://
or any other location - sc: Spark Context (https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext)
parquet_conn = ParquetDAL(file_uri, sc)
parquet_table = ParquetTable(table_name, schema_index_file=file_name)
parquet_conn.set_table(parquet_table)
parquet_conn(table_name)
parquet_conn(table_name)(field_name)
parquet_conn.select().where(field_object=value)
for select allparquet_conn.select(field_object, [field_object2, ...]).where(field_object=value)
parquet_conn.select(field_object1, field_object2).where(field_object1==value1 & ~field_object2==value2)
parquet_conn.select(field_object1, field_object2).where(field_object1!=value1 | field_object1.regexp(reg_exp))
parquet_conn.groupby(field_object1, [field_object2, ...])
parquet_conn.orderby(field_object1, [field_object2, ...])
parquet_conn.orderby(~field_object)
parquet_conn.limit(number)
df = parquet_conn.execute()
execute() returns a PySpark DataFrame.
parquet_conn(table_name).schema()
Install latest jdk and run in terminal make setup