### Topics
- Pyspark Dataframe
- Reading The Dataset
- Checking the Datatypes of the Column(Schema)
- Selecting Columns And Indexing
- Check Describe option similar to Pandas
- Adding Columns
- Dropping Columns
- Renaming Columns

In [1]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName('Dataframes').getOrCreate()

In [59]:
# Read the dataset
# Note: If inferSchema is True, it will refer the schema of the file otherwise, it will create its own schema (i.e., all values will be string)
df = spark.read.option('header', 'true').csv('table2.csv', inferSchema=True)

In [60]:
# Check the Schema
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [61]:
df.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
| Test| 31|         9|
| User| 21|         3|
|Mongo| 11|         6|
+-----+---+----------+



In [20]:
# Alternate way to add column name as header and inferSchema (in one line).
df = spark.read.csv('table2.csv', header=True, inferSchema=True)
df.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
| Test| 31|         9|
| User| 21|         3|
|Mongo| 11|         6|
+-----+---+----------+



In [29]:
df.columns

['Name', 'Age', 'Experience']

In [63]:
df.head(2)

[Row(Name='Test', Age=31, Experience=9),
 Row(Name='User', Age=21, Experience=3)]

In [39]:
# Shows the data of selected columns.
df.select(['Name', 'Age']).show()

+-----+---+
| Name|Age|
+-----+---+
| Test| 31|
| User| 21|
|Mongo| 11|
+-----+---+



In [43]:
# Check the datatyoes of the column.
df.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

In [50]:
# Summary of the dataframe with describe method.
df.describe()

DataFrame[summary: string, Name: string, Age: string, Experience: string]

In [51]:
# Data with summary in detailed manner.
df.describe().show()

+-------+-----+----+----------+
|summary| Name| Age|Experience|
+-------+-----+----+----------+
|  count|    3|   3|         3|
|   mean| NULL|21.0|       6.0|
| stddev| NULL|10.0|       3.0|
|    min|Mongo|  11|         3|
|    max| User|  31|         9|
+-------+-----+----+----------+



In [83]:
# Adding columns in Pyspark dataframe.
add_df = df.withColumn('Experience after 2 years', df['Experience']+2)
add_df.show()

+-----+---+----------+------------------------+
| Name|Age|Experience|Experience after 2 years|
+-----+---+----------+------------------------+
| Test| 31|         9|                      11|
| User| 21|         3|                       5|
|Mongo| 11|         6|                       8|
+-----+---+----------+------------------------+



In [84]:
# Drop the columns
drop_df = add_df.drop('Experience after 2 years')
drop_df.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
| Test| 31|         9|
| User| 21|         3|
|Mongo| 11|         6|
+-----+---+----------+



In [None]:
# Rename the column
rename_df = df.withColumnRenamed('Name', 'Full Name')