# Topic Covered
- PySpark Dataframe
- Reading the Dataset
- Checking the Datatypes of the Column(Schema)
- Selecting Columns and Indexing
- Check Describe option similar to Pandas
- Adding Columns
- Dropping Columns

In [18]:
import pandas as pd 
df = pd.DataFrame([["Tom", 31, 10],["Daniel",30,8],["Ron",29,4]], columns = (["Name","Age","Experience"]))
df.to_csv("test2.csv", index = False)


In [19]:
df

Unnamed: 0,Name,Age,Experience
0,Tom,31,10
1,Daniel,30,8
2,Ron,29,4


### Start Session

In [20]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Dataframe').getOrCreate()

In [21]:
spark

### Read Dataframe

#### 1st way to read file

In [22]:
df_2 = spark.read.option('header','true').csv('test2.csv')

In [23]:
df_2.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|   Tom| 31|        10|
|Daniel| 30|         8|
|   Ron| 29|         4|
+------+---+----------+



### Check the Schema / DataTypes 

In [24]:
df_2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



all the datatypes above are strings, because if we don't add inferSchema=True to .csv("filename",inferSchema = True) until then everything will be string

In [25]:
df_2 = spark.read.option('header','true').csv('test2.csv',inferSchema = True )

In [26]:
df_2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



now we have string and intergers

#### 2nd way to read file

In [27]:
df_3 = spark.read.csv("test2.csv", header = True, inferSchema=True)
df_3.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|   Tom| 31|        10|
|Daniel| 30|         8|
|   Ron| 29|         4|
+------+---+----------+



In [28]:
df_3.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [29]:
type(df_3)

pyspark.sql.dataframe.DataFrame

## Selecting Columns and indexing

In [30]:
# print columns
df_3.columns

['Name', 'Age', 'Experience']

In [31]:
# print head, (it prints in list format instead of dataframe format)
df_3.head(3)

[Row(Name='Tom', Age=31, Experience=10),
 Row(Name='Daniel', Age=30, Experience=8),
 Row(Name='Ron', Age=29, Experience=4)]

In [33]:
# selecting a column
df_3.select('Name').show()

+------+
|  Name|
+------+
|   Tom|
|Daniel|
|   Ron|
+------+



In [34]:
# selecting multiple columns
df_3.select(['Name','Experience']).show()

+------+----------+
|  Name|Experience|
+------+----------+
|   Tom|        10|
|Daniel|         8|
|   Ron|         4|
+------+----------+



In [36]:
# showing datatypes of columns
df_3.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

In [40]:
# descibe
df_3.select(['Age','Experience']).describe().show()

+-------+----+-----------------+
|summary| Age|       Experience|
+-------+----+-----------------+
|  count|   3|                3|
|   mean|30.0|7.333333333333333|
| stddev| 1.0|3.055050463303893|
|    min|  29|                4|
|    max|  31|               10|
+-------+----+-----------------+



### Adding column and Dropping columns

In [44]:
# adding column in pyspark dataframe

df_3 = df_3.withColumn('Experience after 2 year', df_3["Experience"]+2)

In [45]:
df_3.show()

+------+---+----------+-----------------------+
|  Name|Age|Experience|Experience after 2 year|
+------+---+----------+-----------------------+
|   Tom| 31|        10|                     12|
|Daniel| 30|         8|                     10|
|   Ron| 29|         4|                      6|
+------+---+----------+-----------------------+



In [47]:
# dropping column
df_3 = df_3.drop('Experience after 2 year')
df_3.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|   Tom| 31|        10|
|Daniel| 30|         8|
|   Ron| 29|         4|
+------+---+----------+



In [49]:
# rename the columns
df_3.withColumnRenamed('Name','New Name').show()

+--------+---+----------+
|New Name|Age|Experience|
+--------+---+----------+
|     Tom| 31|        10|
|  Daniel| 30|         8|
|     Ron| 29|         4|
+--------+---+----------+

