# Create a DataFrame/Dataset

This notebook covers the Spark Session topics about creating Dataframes, just to refresh:

Candidates are expected to know how to:

* Create a DataFrame/Dataset from a collection (e.g. list or set)
* Create a DataFrame for a range of numbers

## Creating Dataframes

Spark Session has a method called [createDataFrame](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame), which receive data and convert it to a dataframe. This method params are listed below:

<br>
`SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)` 

<br>

* `data`: Data that will populate a Dataframe, you can pass a python list, a pandas dataframe a numpy array or fuck it
* `schema(optional)`: Data types for the Dataframe, if you don't input any schema, spark will try to infer the schema from `data`
* `samplingRatio(optional)`: Control the amount of data used on schema infering, may be good to use with big Datasets
* `verifySchema(optional)`: Validate the input schema for the entire Dataframe

## Creating a dataframe from a python list

Okay, let's create a spark dataframe from a single list

When creating a dataframe from a list, you need to have a nested list structure to have rows, a single list will represent a single row on a Dataframe, unless you pass a schema for the list, then Databricks will consider the list as a one column dataframe

In [4]:
#creating a spark dataframe from a python list

my_list = [1,2,3,4,5,6,7,8]
try:
  my_df = spark.createDataFrame(my_list)
except:
  print('cant infer schema from a single row')

Oops, we didn't passed the schema for the list data, and **Spark can't infer a schema from a single row**, so let's pass a schema to the `createDataFrame` method

In [6]:
import pyspark.sql.types as tp
single_col_df = spark.createDataFrame(my_list,schema=tp.IntegerType())

display(single_col_df.limit(2))

value
1
2


Yaaaay, we just created our **first Dataframe!**


You can create a multi-column dataframe with lists as well, just create a nested list structure and go for it

In [8]:
nested_list = [[1,'nothing'],[2,'happens'],[3,'feijoada']]

nested_list_df = spark.createDataFrame(nested_list)

display(nested_list_df)

_1,_2
1,nothing
2,happens
3,feijoada


Notice that no schema has been passed on the `createDataFrame` method, but Databricks still created the right data types for the dataframe, that's because Spark can **infer the schema from the input data**

But even with infer schema, column names got pretty ugly heh? Let's fix this.

In [10]:
#it's possible to inform column names on the schema parameter, passing column names as a list to the method

nested_list_df_with_names = spark.createDataFrame(nested_list,schema=['order','reality'])

display(nested_list_df_with_names)

order,reality
1,nothing
2,happens
3,feijoada


If Spark aren't infering your schema correctly, you can specify the entire schema, just like this example bellow.

In [12]:
input_schema = tp.StructType([tp.StructField('order',tp.IntegerType()),
                              tp.StructField('reality',tp.StringType())
                             ])

nested_list_df_with_strict_schema = spark.createDataFrame(nested_list,schema=input_schema)

display(nested_list_df_with_strict_schema)

order,reality
1,nothing
2,happens
3,feijoada


If you compare `nested_list_df_with_names` and `nested_list_df_with_strict_schema` you'll see that they have different schemas, because Spark inferred `LongType` for the first column of the list, and in the second one i specified the schema, so no inferring occured.

## Creating a dataframe from pandas
Now that we have lists covered up, the same techniques apply to **pandas Series and pandas Dataframes**. Let's give it a try:

In [15]:
import numpy as np
import pandas as pd

pandas_dataframe = pd.DataFrame(np.column_stack((np.arange(0,300),np.random.rand(300))),columns=['this_is_spartaaa','power_level'])

spark_df_from_pandas_dataframe = spark.createDataFrame(pandas_dataframe)

Spark grab the column names from pandas, which is very nice when working with both.

**Extra:** Spark Dataframes can be converted back to pandas dataframes, using `toPandas()` method.

In [17]:
pandas_dataframe_from_spark = spark_df_from_pandas_dataframe.toPandas()

pandas_dataframe_from_spark.head()

Unnamed: 0,this_is_spartaaa,power_level
0,0.0,0.632697
1,1.0,0.132692
2,2.0,0.253096
3,3.0,0.655194
4,4.0,0.558909


## Create a DataFrame for a range of numbers

Spark has his own ways to generate Dataframes, so you can do it all inside Spark.

[spark.range(start,end,increment,num_partitions)](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.range) create a single column dataframe from a range of integers

In [19]:
generated_dataframe = spark.range(0,10,1,2)

display(generated_dataframe.limit(3))

id
0
1
2


In [20]:
#you can create a random number column as well
import pyspark.sql.functions as F
random_generated_dataframe = generated_dataframe.withColumn('meaning_of_life',F.rand(42))

display(random_generated_dataframe.limit(3))

id,meaning_of_life
0,0.6661236774413726
1,0.8583151351252906
2,0.913996368249518


## Artisan made Spark Dataframe

If any of the above methods didn't solved your problem, you can specify you dataframe totally manually, row by row. Check this out:

In [22]:
#generate a manual Dataframe from rows

from pyspark.sql import Row

row = [Row(name='Lone', surname='Wolf')]
single_row_dataframe = spark.createDataFrame(row)

display(single_row_dataframe)

name,surname
Lone,Wolf


##Conclusion

Now you know how to generate Dataframes from **typed data**, this is useful when you need to complement data with discrete domain, or infuse small data into another Dataframes.

If you know any other way to generate a Spark Dataframe which is not denoted above, feel free to create a PR or contact me.