### This tutorial will contain:

1. Pyspark Dataframe
2. Reading the dataset
3. Checking the datatypes of the columns
4. Selecting columns and indexing
5. Check Describe option similar to Pandas
6. Adding Columns
7. Dropping Columns
8. Renaming Columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("df_practice").getOrCreate()

In [3]:
spark

In [4]:
# Read the dataset
df_spark = spark.read.option('header','true').csv('test1.csv')

In [5]:
df_spark.show()

+-------+---+----------+
|   Name|Age|Experience|
+-------+---+----------+
|    Ali| 35|        15|
| Prince| 31|        11|
|   Ploy| 35|        15|
| Dipesh| 30|         6|
| Pouyeh| 35|        15|
|Vincent| 65|        30|
|Randall| 30|        10|
+-------+---+----------+



In [6]:
df_spark

DataFrame[Name: string, Age: string, Experience: string]

In [8]:
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



As seen above, all the columns are in String. String is the by default read type for Spark. We need to use the parameter inferSchema to make the String default False.

In [9]:
df_spark = spark.read.option('header','true').csv('test1.csv', inferSchema = True)
df_spark

DataFrame[Name: string, Age: int, Experience: int]

Now the variable types are appropriate and as expected

In [10]:
df_spark.show()

+-------+---+----------+
|   Name|Age|Experience|
+-------+---+----------+
|    Ali| 35|        15|
| Prince| 31|        11|
|   Ploy| 35|        15|
| Dipesh| 30|         6|
| Pouyeh| 35|        15|
|Vincent| 65|        30|
|Randall| 30|        10|
+-------+---+----------+



Another way of reading a csv file in spark is:

In [11]:
df_spark = spark.read.csv('test1.csv', header=True, inferSchema=True)
df_spark

DataFrame[Name: string, Age: int, Experience: int]

In [12]:
df_spark.show()

+-------+---+----------+
|   Name|Age|Experience|
+-------+---+----------+
|    Ali| 35|        15|
| Prince| 31|        11|
|   Ploy| 35|        15|
| Dipesh| 30|         6|
| Pouyeh| 35|        15|
|Vincent| 65|        30|
|Randall| 30|        10|
+-------+---+----------+



In [13]:
type(df_spark)

pyspark.sql.dataframe.DataFrame

Get All Column Names

In [14]:
df_spark.columns

['Name', 'Age', 'Experience']

In [15]:
df_spark.head(3)

[Row(Name='Ali', Age=35, Experience=15),
 Row(Name='Prince', Age=31, Experience=11),
 Row(Name='Ploy', Age=35, Experience=15)]

Select a particular column

In [16]:
df_spark.select('Name')

DataFrame[Name: string]

In [17]:
df_spark.select('Name').show()

+-------+
|   Name|
+-------+
|    Ali|
| Prince|
|   Ploy|
| Dipesh|
| Pouyeh|
|Vincent|
|Randall|
+-------+



In [18]:
NameCol = df_spark.select('Name')
NameCol.show()

+-------+
|   Name|
+-------+
|    Ali|
| Prince|
|   Ploy|
| Dipesh|
| Pouyeh|
|Vincent|
|Randall|
+-------+



In [19]:
MultipleCol = df_spark.select(['Name','Experience'])
MultipleCol.show()

+-------+----------+
|   Name|Experience|
+-------+----------+
|    Ali|        15|
| Prince|        11|
|   Ploy|        15|
| Dipesh|         6|
| Pouyeh|        15|
|Vincent|        30|
|Randall|        10|
+-------+----------+



In [20]:
MultipleCol.dtypes

[('Name', 'string'), ('Experience', 'int')]

In [22]:
df_spark.describe()

DataFrame[summary: string, Name: string, Age: string, Experience: string]

In [23]:
df_spark.describe().show()

+-------+-------+------------------+------------------+
|summary|   Name|               Age|        Experience|
+-------+-------+------------------+------------------+
|  count|      7|                 7|                 7|
|   mean|   null|37.285714285714285|14.571428571428571|
| stddev|   null|12.446074156325835| 7.590721152765897|
|    min|    Ali|                30|                 6|
|    max|Vincent|                65|                30|
+-------+-------+------------------+------------------+



Adding Columns to a dataframe

In [24]:
df_spark = df_spark.withColumn('Age * Exp', df_spark['Age'] * df_spark['Experience']) 
df_spark.show()

+-------+---+----------+---------+
|   Name|Age|Experience|Age * Exp|
+-------+---+----------+---------+
|    Ali| 35|        15|      525|
| Prince| 31|        11|      341|
|   Ploy| 35|        15|      525|
| Dipesh| 30|         6|      180|
| Pouyeh| 35|        15|      525|
|Vincent| 65|        30|     1950|
|Randall| 30|        10|      300|
+-------+---+----------+---------+



Drop the Columns

In [25]:
df_spark = df_spark.drop('Age * Exp')
df_spark.show()

+-------+---+----------+
|   Name|Age|Experience|
+-------+---+----------+
|    Ali| 35|        15|
| Prince| 31|        11|
|   Ploy| 35|        15|
| Dipesh| 30|         6|
| Pouyeh| 35|        15|
|Vincent| 65|        30|
|Randall| 30|        10|
+-------+---+----------+



Rename the columns

In [26]:
df_spark = df_spark.withColumnRenamed('Name', 'New Name')
df_spark.show()

+--------+---+----------+
|New Name|Age|Experience|
+--------+---+----------+
|     Ali| 35|        15|
|  Prince| 31|        11|
|    Ploy| 35|        15|
|  Dipesh| 30|         6|
|  Pouyeh| 35|        15|
| Vincent| 65|        30|
| Randall| 30|        10|
+--------+---+----------+



In [29]:
lst = [1, 1]
lst = lst.append(lst)