## Create DataFrame

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('Create dataframe').getOrCreate()

#### Prepare data

In [0]:
columns = ['Programming Language', 'Ratings']
data = [('C', 12.54),
('Python', 11.84),
('Java', 11.54),
('C++', 7.36),
('C#', 4.33),
('Visual Basic', 4.01),
('JavaScript', 2.33),
('PHP', 2.21),
('Assembly language', 2.05),
('SQL', 1.88)]

#### Create DataFrame from RDD

In [0]:
rdd = spark.sparkContext.parallelize(data)

#### Using toDF() function

In [0]:
# RDD doesn’t have columns, the DataFrame is created with default column names
dfFromRDD = rdd.toDF()
dfFromRDD.printSchema()

In [0]:
# With columns
dfFromRDDc = rdd.toDF(columns)
dfFromRDDc.printSchema()

#### Using createDataFrame() from SparkSession

In [0]:
dfFromRDDs = spark.createDataFrame(rdd).toDF(*columns)
dfFromRDDs.printSchema()

#### Create DataFrame from List Collection

In [0]:
# Using createDataFrame() from SparkSession
dfFromDataList = spark.createDataFrame(data).toDF(*columns)
dfFromDataList.printSchema()

In [0]:
# Using createDataFrame() with the Row type

rowData = map(lambda x: Row(*x), data) 
dfFromDataRow = spark.createDataFrame(rowData,columns)
dfFromDataRow.printSchema()

#### Create DataFrame with schema

In [0]:
# Populate some data
data2 = [
  ('John', '', 'Smith', '36636', 'M', 2500),
  ('Jane', '', 'Doe', '42114', 'F', 500),
  ('Richard', 'Laurence', 'Marquette', 97086, 'M', 1500),
  ('Israel', '', 'Israeli', '', 'M', 3000),
  ('Edward', 'III', '', 'SL4', 'M', 5000)
]
 
schema = StructType([
  StructField("firstname",StringType(),True),
  StructField("middlename",StringType(),True),
  StructField("lastname",StringType(),True),
  StructField("id", StringType(), True),
  StructField("gender", StringType(), True),
  StructField("salary", IntegerType(), True)
])
 
df = spark.createDataFrame(data=data2, schema=schema)
df.printSchema()
df.show(truncate=False)

#### Create DataFrame from Data sources

In real-time mostly you create DataFrame from data source files like CSV, Text, JSON, XML e.t.c.

PySpark by default supports many data formats out of the box without importing any libraries and to create DataFrame you need to use the appropriate method available in **DataFrameReader** class.

##### Creating DataFrame from CSV

Use `csv()` method of the `DataFrameReader` object to create a DataFrame from CSV file. you can also provide options like what delimiter to use, whether you have quoted data, date formats, infer schema, and many more.

    df2 = spark.read.csv("/src/resources/file.csv")

##### Creating from text (TXT) file

Similarly you can also create a DataFrame by reading a from Text file, use `text()` method of the `DataFrameReader` to do so.

    df2 = spark.read.text("/src/resources/file.txt")

##### Creating from JSON file

PySpark is also used to process semi-structured data files like JSON format. you can use `json()` method of the `DataFrameReader` to read JSON file into DataFrame.

    df2 = spark.read.json("/src/resources/file.json")

In [0]:
v_source_file = "/FileStore/tables/ufo.csv"

df_csv = (
  spark
  .read
  .format('csv')
  .option('header','true')
  .option('inferSchema', 'false')  
  .option('delimiter',',')                                
  .option('quote', '\"')
  .load(v_source_file)
)

df_csv.show(5)

#### The end of the notebook