## PySpark DataFrame

- Reading the Dataset
- Checking the Datatypes of the Column(Schema)
- Selecting the Columns and Indexing
- Check Describe option similar to Pandas
- Adding the Columns
- Dropping the Columns
- Renaming the Columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Start the SparkSession
spark = SparkSession.builder.appName('DataFrame').getOrCreate()

In [3]:
spark

In [None]:
## Read the Dataset
# Method 1
spark.read.option('header','true').csv('test1.csv')

DataFrame[NAME: string, AGE: string, EXPERIENCE: string]

In [5]:
df_pyspark = spark.read.option('header','true').csv('test1.csv').show()

+--------+---+----------+
|    NAME|AGE|EXPERIENCE|
+--------+---+----------+
|   AGARO| 15|         5|
|BAGGAUTI| 22|        10|
|   LUCCI| 25|         7|
+--------+---+----------+



In [9]:
## Check the Schema
df_pyspark = spark.read.option('header','true').csv('test1.csv', inferSchema=True)
df_pyspark.printSchema()

root
 |-- NAME: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- EXPERIENCE: integer (nullable = true)



In [10]:
## Method 2
df_pyspark = spark.read.csv('test1.csv', header=True, inferSchema=True)
df_pyspark.show()

+--------+---+----------+
|    NAME|AGE|EXPERIENCE|
+--------+---+----------+
|   AGARO| 15|         5|
|BAGGAUTI| 22|        10|
|   LUCCI| 25|         7|
+--------+---+----------+



In [11]:
df_pyspark.printSchema()

root
 |-- NAME: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- EXPERIENCE: integer (nullable = true)



In [12]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [13]:
df_pyspark.columns

['NAME', 'AGE', 'EXPERIENCE']

In [15]:
df_pyspark.head(3)

[Row(NAME='AGARO', AGE=15, EXPERIENCE=5),
 Row(NAME='BAGGAUTI', AGE=22, EXPERIENCE=10),
 Row(NAME='LUCCI', AGE=25, EXPERIENCE=7)]

- View the Column

In [16]:
df_pyspark.select('NAME')

DataFrame[NAME: string]

In [17]:
df_pyspark.select('NAME').show()

+--------+
|    NAME|
+--------+
|   AGARO|
|BAGGAUTI|
|   LUCCI|
+--------+



In [18]:
type(df_pyspark.select('NAME'))

pyspark.sql.dataframe.DataFrame

In [19]:
df_pyspark.select(['NAME', 'EXPERIENCE'])

DataFrame[NAME: string, EXPERIENCE: int]

In [20]:
df_pyspark.select(['NAME', 'EXPERIENCE']).show()

+--------+----------+
|    NAME|EXPERIENCE|
+--------+----------+
|   AGARO|         5|
|BAGGAUTI|        10|
|   LUCCI|         7|
+--------+----------+



- Check the Datatypes

In [21]:
df_pyspark.dtypes

[('NAME', 'string'), ('AGE', 'int'), ('EXPERIENCE', 'int')]

- Describe option

In [22]:
df_pyspark.describe()

DataFrame[summary: string, NAME: string, AGE: string, EXPERIENCE: string]

In [23]:
df_pyspark.describe().show()

+-------+-----+------------------+-----------------+
|summary| NAME|               AGE|       EXPERIENCE|
+-------+-----+------------------+-----------------+
|  count|    3|                 3|                3|
|   mean| NULL|20.666666666666668|7.333333333333333|
| stddev| NULL| 5.131601439446884|2.516611478423583|
|    min|AGARO|                15|                5|
|    max|LUCCI|                25|               10|
+-------+-----+------------------+-----------------+



- Adding the Columns

In [24]:
df_pyspark.withColumn('Experiences After 2 years', df_pyspark['EXPERIENCE']+2)

DataFrame[NAME: string, AGE: int, EXPERIENCE: int, Experiences After 2 years: int]

In [25]:
df_pyspark.withColumn('Experiences After 2 years', df_pyspark['EXPERIENCE']+2).show()

+--------+---+----------+-------------------------+
|    NAME|AGE|EXPERIENCE|Experiences After 2 years|
+--------+---+----------+-------------------------+
|   AGARO| 15|         5|                        7|
|BAGGAUTI| 22|        10|                       12|
|   LUCCI| 25|         7|                        9|
+--------+---+----------+-------------------------+



- Dropping the Column

In [26]:
df_pyspark.drop('Experiences After 2 years')

DataFrame[NAME: string, AGE: int, EXPERIENCE: int]

In [27]:
df_pyspark.show()

+--------+---+----------+
|    NAME|AGE|EXPERIENCE|
+--------+---+----------+
|   AGARO| 15|         5|
|BAGGAUTI| 22|        10|
|   LUCCI| 25|         7|
+--------+---+----------+



- Rename the Columns

In [28]:
df_pyspark.withColumnRenamed('NAME', 'NEW NAME').show()

+--------+---+----------+
|NEW NAME|AGE|EXPERIENCE|
+--------+---+----------+
|   AGARO| 15|         5|
|BAGGAUTI| 22|        10|
|   LUCCI| 25|         7|
+--------+---+----------+

