In [None]:
import findspark
findspark.init()

# create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my app").master("local").getOrCreate()

# get context from the session
sc = spark.sparkContext

### createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
***Creates a DataFrame from an RDD, a list*** or a pandas.DataFrame.

When schema is a list of column names, the type of each column will be inferred from data.

When schema is None, *it will try to infer the schema (column names and types) from data*, which should be an RDD of either Row, namedtuple, or dict.

When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is not pyspark.sql.types.StructType, it will be wrapped into a pyspark.sql.types.StructType as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.

If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None.

Parameters
* data – an RDD of any kind of SQL data representation (e.g. row, tuple, int, boolean, etc.), list, or pandas.DataFrame.
* schema – a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None.
* samplingRatio – the sample ratio of rows used for inferring
* verifySchema – verify data types of every row against schema.

### json(path, schema=None, ...)

Loads JSON files and returns the results as a DataFrame.

JSON Lines (newline-delimited JSON) is supported by default. For JSON (one record per file), set the multiLine parameter to true.

If the schema parameter is not specified, this function goes through the input once to determine the input schema.

Parameters
* path – string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects.
* schema – an optional pyspark.sql.types.StructType for the input schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).

### createOrReplaceTempView(name)
Creates or replaces a local temporary view with this DataFrame.

***It creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL.***

It does not persist to memory unless you cache the dataset that underpins the view.
The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame.

### select(*cols)
Projects a set of expressions and returns a new DataFrame.

Parameters
* cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.

### selectExpr(*expr)
Projects a set of SQL expressions and returns a new DataFrame.

This is a variant of select() that accepts SQL expressions. 