**Agenda**
- Read the Dataset
- Check the datatypes of the column(Schema)
- Selecting Columns And indexing
- Check Describe option similar to pandas
- Adding columns
- Dropping columns
- Rename columns

In [36]:
import pandas as pd
data = pd.DataFrame({'Name':['Krish','Marry','Raghu'],'Age':[31,29,30],'Experience':[10,8,4]})
data.to_csv('test1_pyspark.csv',index=False)

In [1]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Dataframe').getOrCreate()
spark

**Read Dataset**

In [10]:
df_pyspark = spark.read.option('header','true').csv('test1_pyspark.csv').show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
|Krish| 31|        10|
|Marry| 29|         8|
|Raghu| 30|         4|
+-----+---+----------+



In [14]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

- Note : Dataframe is a type of Data structure

In [16]:
df_pyspark.head()

Row(Name='Krish', Age=31, Experience=10)

**Check the Schema**

In [11]:
df_pyspark = spark.read.option('header','true').csv('test1_pyspark.csv')
df_pyspark.printSchema()
#Bydefault it shows all attributes as string

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



In [12]:
df_pyspark = spark.read.option('header','true').csv('test1_pyspark.csv',inferSchema=True)
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [13]:
df_pyspark = spark.read.csv('test1_pyspark.csv',header=True,inferSchema=True)
df_pyspark.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
|Krish| 31|        10|
|Marry| 29|         8|
|Raghu| 30|         4|
+-----+---+----------+



**Selecting columns and Indexing**

In [15]:
df_pyspark.columns

['Name', 'Age', 'Experience']

In [18]:
df_pyspark.select('Name')

DataFrame[Name: string]

In [19]:
df_pyspark.select('Name').show()

+-----+
| Name|
+-----+
|Krish|
|Marry|
|Raghu|
+-----+



In [21]:
type(df_pyspark.select('Name'))

pyspark.sql.dataframe.DataFrame

In [22]:
df_pyspark.select(['Name','Experience']).show()

+-----+----------+
| Name|Experience|
+-----+----------+
|Krish|        10|
|Marry|         8|
|Raghu|         4|
+-----+----------+



In [25]:
df_pyspark['Name']
#df_pyspark['Name'].show() #TypeError: 'Column' object is not callable

Column<'Name'>

In [26]:
df_pyspark.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

**Check Describe option similar to pandas**

In [27]:
df_pyspark.describe()

DataFrame[summary: string, Name: string, Age: string, Experience: string]

In [29]:
df_pyspark.describe().show()
#For NAme column, min and max are taken based on index

+-------+-----+----+-----------------+
|summary| Name| Age|       Experience|
+-------+-----+----+-----------------+
|  count|    3|   3|                3|
|   mean| null|30.0|7.333333333333333|
| stddev| null| 1.0|3.055050463303893|
|    min|Krish|  29|                4|
|    max|Raghu|  31|               10|
+-------+-----+----+-----------------+



**Adding and dropping columns**

In [31]:
df_pyspark.withColumn('Experience after 2 year',df_pyspark['Experience']+2).show()

+-----+---+----------+-----------------------+
| Name|Age|Experience|Experience after 2 year|
+-----+---+----------+-----------------------+
|Krish| 31|        10|                     12|
|Marry| 29|         8|                     10|
|Raghu| 30|         4|                      6|
+-----+---+----------+-----------------------+



In [33]:
df_pyspark = df_pyspark.withColumn('Experience after 2 year',df_pyspark['Experience']+2)
df_pyspark.show()

+-----+---+----------+-----------------------+
| Name|Age|Experience|Experience after 2 year|
+-----+---+----------+-----------------------+
|Krish| 31|        10|                     12|
|Marry| 29|         8|                     10|
|Raghu| 30|         4|                      6|
+-----+---+----------+-----------------------+



In [34]:
df_pyspark = df_pyspark.drop('Experience after 2 year')
df_pyspark.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
|Krish| 31|        10|
|Marry| 29|         8|
|Raghu| 30|         4|
+-----+---+----------+



**Rename Columns**

In [35]:
df_pyspark.withColumnRenamed('Name','New Name').show()

+--------+---+----------+
|New Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|   Marry| 29|         8|
|   Raghu| 30|         4|
+--------+---+----------+

