# Loading CSV into DataFrame

- In the previous exercise, you have seen a method of creating DataFrame but generally, loading data from CSV file is the most common method of creating DataFrames. In this exercise, you'll create a PySpark DataFrame from a `people.csv` file that is already provided to you as a `file_path` and confirm the created object is a PySpark DataFrame.

- Remember, you already have `SparkSession` `spark` and `file_path` variable (which is the path to the `people.csv` file) available in your workspace..

## Instructions

- Create a DataFrame from `file_path` variable which is the path to the `people.csv` file.
- Confirm the output as PySpark DataFrame.

In [None]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [None]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [6]:
file_path = "file:///home/talentum/spark-jupyter/Data_Frame/Dataset/people.csv"

# Create an DataFrame from file_path
people_df = spark.read.csv(file_path, header=True, inferSchema=True)

# Check the type of people_df
print("The type of people_df is", type(people_df))

The type of people_df is <class 'pyspark.sql.dataframe.DataFrame'>


In [7]:
people_df.head(5)

[Row(_c0=0, person_id=100, name='Penelope Lewis', sex='female', date of birth='1990-08-31'),
 Row(_c0=1, person_id=101, name='David Anthony', sex='male', date of birth='1971-10-14'),
 Row(_c0=2, person_id=102, name='Ida Shipp', sex='female', date of birth='1962-05-24'),
 Row(_c0=3, person_id=103, name='Joanna Moore', sex='female', date of birth='2017-03-10'),
 Row(_c0=4, person_id=104, name='Lisandra Ortiz', sex='female', date of birth='2020-08-05')]

In [8]:
people_df.toPandas()

Unnamed: 0,_c0,person_id,name,sex,date of birth
0,0,100,Penelope Lewis,female,1990-08-31
1,1,101,David Anthony,male,1971-10-14
2,2,102,Ida Shipp,female,1962-05-24
3,3,103,Joanna Moore,female,2017-03-10
4,4,104,Lisandra Ortiz,female,2020-08-05
...,...,...,...,...,...
99995,99995,100095,Annette Jones,female,2001-10-31
99996,99996,100096,Angela Meyer,female,1980-04-11
99997,99997,100097,Janet Brann,female,1991-02-02
99998,99998,100098,Melanie Kendrick,female,1978-07-16


In [15]:
people_df.show(truncate=False)

+---+---------+-----------------+------+-------------+
|_c0|person_id|name             |sex   |date of birth|
+---+---------+-----------------+------+-------------+
|0  |100      |Penelope Lewis   |female|1990-08-31   |
|1  |101      |David Anthony    |male  |1971-10-14   |
|2  |102      |Ida Shipp        |female|1962-05-24   |
|3  |103      |Joanna Moore     |female|2017-03-10   |
|4  |104      |Lisandra Ortiz   |female|2020-08-05   |
|5  |105      |David Simmons    |male  |1999-12-30   |
|6  |106      |Edward Hudson    |male  |1983-05-09   |
|7  |107      |Albert Jones     |male  |1990-09-13   |
|8  |108      |Leonard Cavender |male  |1958-08-08   |
|9  |109      |Everett Vadala   |male  |2005-05-24   |
|10 |110      |Freddie Claridge |male  |2002-05-07   |
|11 |111      |Annabelle Rosseau|female|1989-07-13   |
|12 |112      |Eulah Emanuel    |female|1976-01-19   |
|13 |113      |Shaun Love       |male  |1970-05-26   |
|14 |114      |Alejandro Brennan|male  |1980-12-22   |
|15 |115  

In [18]:
people_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- person_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- date of birth: string (nullable = true)



In [33]:
c=people_df.select(['person_id','sex']).groupBy('sex')

In [36]:
c=people_df.groupBy('sex')

In [37]:
c.count().collect()

[Row(sex=None, count=1920),
 Row(sex='female', count=49014),
 Row(sex='male', count=49066)]

In [43]:
people_df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
_c0,100000,49999.5,28867.65779668774,0,99999
person_id,100000,50099.5,28867.65779668774,100,100099
name,100000,,,Aaron Addesso,Zulma Biggs
sex,98080,,,female,male
date of birth,100000,,,1899-08-28,2084-11-17
