# TUTORIAL 2

## In this Notebook we would cover:

 - How to read a dataset using PySpark
 - Check the datatype of each column
 - How and select the specific columns
 - Indexing operations
 - Check describe and info - similar to pandas functionality
 - How to add a new column
 - How to drop/delete an existing column

In [1]:
 from pyspark.sql import SparkSession

In [5]:
spark=SparkSession.builder.appName('Training').getOrCreate()

In [52]:
#Reading the dataset
#We are using read.option to read the first row as header, 
#otherwise we could just have used spark.read.csv('bhp.csv')

df_spk=spark.read.option('header','true').csv('bhp.csv')

In [53]:
df_spk.show()

+--------------------+---------+----------+----+-----+---+--------------+
|            location|     size|total_sqft|bath|price|bhk|price_per_sqft|
+--------------------+---------+----------+----+-----+---+--------------+
|Electronic City P...|    2 BHK|    1056.0| 2.0|39.07|  2|          3699|
|    Chikka Tirupathi|4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|
|         Uttarahalli|    3 BHK|    1440.0| 2.0| 62.0|  3|          4305|
|  Lingadheeranahalli|    3 BHK|    1521.0| 3.0| 95.0|  3|          6245|
|            Kothanur|    2 BHK|    1200.0| 2.0| 51.0|  2|          4250|
|          Whitefield|    2 BHK|    1170.0| 2.0| 38.0|  2|          3247|
|    Old Airport Road|    4 BHK|    2732.0| 4.0|204.0|  4|          7467|
|        Rajaji Nagar|    4 BHK|    3300.0| 4.0|600.0|  4|         18181|
|        Marathahalli|    3 BHK|    1310.0| 3.0|63.25|  3|          4828|
|               other|6 Bedroom|    1020.0| 6.0|370.0|  6|         36274|
|          Whitefield|    3 BHK|    18

In [54]:
#Checking the schema

df_spk.printSchema()

# We can observe that all the features considered here as string since we haven't used the option to
#infer the orginal schema and hence it's considering every other feature as in String type

root
 |-- location: string (nullable = true)
 |-- size: string (nullable = true)
 |-- total_sqft: string (nullable = true)
 |-- bath: string (nullable = true)
 |-- price: string (nullable = true)
 |-- bhk: string (nullable = true)
 |-- price_per_sqft: string (nullable = true)



In [55]:
#Let's use inferschema option so that our dataframe is read more accurately

df_spk=spark.read.option('header','true').csv('bhp.csv',inferSchema=True)
df_spk.show()


+--------------------+---------+----------+----+-----+---+--------------+
|            location|     size|total_sqft|bath|price|bhk|price_per_sqft|
+--------------------+---------+----------+----+-----+---+--------------+
|Electronic City P...|    2 BHK|    1056.0| 2.0|39.07|  2|          3699|
|    Chikka Tirupathi|4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|
|         Uttarahalli|    3 BHK|    1440.0| 2.0| 62.0|  3|          4305|
|  Lingadheeranahalli|    3 BHK|    1521.0| 3.0| 95.0|  3|          6245|
|            Kothanur|    2 BHK|    1200.0| 2.0| 51.0|  2|          4250|
|          Whitefield|    2 BHK|    1170.0| 2.0| 38.0|  2|          3247|
|    Old Airport Road|    4 BHK|    2732.0| 4.0|204.0|  4|          7467|
|        Rajaji Nagar|    4 BHK|    3300.0| 4.0|600.0|  4|         18181|
|        Marathahalli|    3 BHK|    1310.0| 3.0|63.25|  3|          4828|
|               other|6 Bedroom|    1020.0| 6.0|370.0|  6|         36274|
|          Whitefield|    3 BHK|    18

In [56]:
### Now we can see that datatype of columns in dataframe is read as per the values present in that column
df_spk.printSchema()

root
 |-- location: string (nullable = true)
 |-- size: string (nullable = true)
 |-- total_sqft: double (nullable = true)
 |-- bath: double (nullable = true)
 |-- price: double (nullable = true)
 |-- bhk: integer (nullable = true)
 |-- price_per_sqft: integer (nullable = true)



## Easy Way for reading the dataset

In [57]:
df_spk=spark.read.csv('bhp.csv', header=True, inferSchema=True)
df_spk.show()

+--------------------+---------+----------+----+-----+---+--------------+
|            location|     size|total_sqft|bath|price|bhk|price_per_sqft|
+--------------------+---------+----------+----+-----+---+--------------+
|Electronic City P...|    2 BHK|    1056.0| 2.0|39.07|  2|          3699|
|    Chikka Tirupathi|4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|
|         Uttarahalli|    3 BHK|    1440.0| 2.0| 62.0|  3|          4305|
|  Lingadheeranahalli|    3 BHK|    1521.0| 3.0| 95.0|  3|          6245|
|            Kothanur|    2 BHK|    1200.0| 2.0| 51.0|  2|          4250|
|          Whitefield|    2 BHK|    1170.0| 2.0| 38.0|  2|          3247|
|    Old Airport Road|    4 BHK|    2732.0| 4.0|204.0|  4|          7467|
|        Rajaji Nagar|    4 BHK|    3300.0| 4.0|600.0|  4|         18181|
|        Marathahalli|    3 BHK|    1310.0| 3.0|63.25|  3|          4828|
|               other|6 Bedroom|    1020.0| 6.0|370.0|  6|         36274|
|          Whitefield|    3 BHK|    18

In [58]:
#Know all the columns
df_spk.columns

['location', 'size', 'total_sqft', 'bath', 'price', 'bhk', 'price_per_sqft']

In [59]:
#selecting a single column and displaying it

df_spk.select('location').show()

+--------------------+
|            location|
+--------------------+
|Electronic City P...|
|    Chikka Tirupathi|
|         Uttarahalli|
|  Lingadheeranahalli|
|            Kothanur|
|          Whitefield|
|    Old Airport Road|
|        Rajaji Nagar|
|        Marathahalli|
|               other|
|          Whitefield|
|          Whitefield|
|  7th Phase JP Nagar|
|           Gottigere|
|            Sarjapur|
|         Mysore Road|
|       Bisuvanahalli|
|Raja Rajeshwari N...|
|               other|
|               other|
+--------------------+
only showing top 20 rows



In [60]:
#selecting the multiple columns and displaying it

df_spk.select(['location','price_per_sqft']).show()

+--------------------+--------------+
|            location|price_per_sqft|
+--------------------+--------------+
|Electronic City P...|          3699|
|    Chikka Tirupathi|          4615|
|         Uttarahalli|          4305|
|  Lingadheeranahalli|          6245|
|            Kothanur|          4250|
|          Whitefield|          3247|
|    Old Airport Road|          7467|
|        Rajaji Nagar|         18181|
|        Marathahalli|          4828|
|               other|         36274|
|          Whitefield|          3888|
|          Whitefield|         10592|
|  7th Phase JP Nagar|          3800|
|           Gottigere|          3636|
|            Sarjapur|          6577|
|         Mysore Road|          6255|
|       Bisuvanahalli|          4067|
|Raja Rajeshwari N...|          3896|
|               other|         10469|
|               other|          4363|
+--------------------+--------------+
only showing top 20 rows



In [61]:
#Describe functionality of PySpark Dataframes:

df_spk.describe('size', 'total_sqft', 'bath', 'price', 'bhk', 'price_per_sqft').show()

+-------+---------+------------------+------------------+------------------+------------------+-----------------+
|summary|     size|        total_sqft|              bath|             price|               bhk|   price_per_sqft|
+-------+---------+------------------+------------------+------------------+------------------+-----------------+
|  count|    13200|             13200|             13200|             13200|             13200|            13200|
|   mean|     null|1555.3027829545451|2.6911363636363634|  112.276177651515|2.8008333333333333|7920.336742424242|
| stddev|     null|1237.3234454015146|1.3389150868179531|149.17599517809657| 1.292843421272534|106727.1603281085|
|    min|    1 BHK|               1.0|               1.0|               8.0|                 1|              267|
|    max|9 Bedroom|           52272.0|              40.0|            3600.0|                43|         12000000|
+-------+---------+------------------+------------------+------------------+------------

 - We can see that we are getting null in mean and stddev for column 'size', so we need to clean the data to get the correct figures

In [65]:
#Adding a new column to our dataset (based on condition)
from pyspark.sql import functions as f

df_spk=df_spk.withColumn('Parking', f.when(f.col('bhk')>3,'Yes').otherwise('No'))

In [67]:
df_spk.show()

+--------------------+---------+----------+----+-----+---+--------------+-------+
|            location|     size|total_sqft|bath|price|bhk|price_per_sqft|Parking|
+--------------------+---------+----------+----+-----+---+--------------+-------+
|Electronic City P...|    2 BHK|    1056.0| 2.0|39.07|  2|          3699|     No|
|    Chikka Tirupathi|4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|    Yes|
|         Uttarahalli|    3 BHK|    1440.0| 2.0| 62.0|  3|          4305|     No|
|  Lingadheeranahalli|    3 BHK|    1521.0| 3.0| 95.0|  3|          6245|     No|
|            Kothanur|    2 BHK|    1200.0| 2.0| 51.0|  2|          4250|     No|
|          Whitefield|    2 BHK|    1170.0| 2.0| 38.0|  2|          3247|     No|
|    Old Airport Road|    4 BHK|    2732.0| 4.0|204.0|  4|          7467|    Yes|
|        Rajaji Nagar|    4 BHK|    3300.0| 4.0|600.0|  4|         18181|    Yes|
|        Marathahalli|    3 BHK|    1310.0| 3.0|63.25|  3|          4828|     No|
|               

In [71]:
#Dropping the columns from the dataframe

df_spk=df_spk.drop('Parking')

In [72]:
df_spk.show()

+--------------------+---------+----------+----+-----+---+--------------+
|            location|     size|total_sqft|bath|price|bhk|price_per_sqft|
+--------------------+---------+----------+----+-----+---+--------------+
|Electronic City P...|    2 BHK|    1056.0| 2.0|39.07|  2|          3699|
|    Chikka Tirupathi|4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|
|         Uttarahalli|    3 BHK|    1440.0| 2.0| 62.0|  3|          4305|
|  Lingadheeranahalli|    3 BHK|    1521.0| 3.0| 95.0|  3|          6245|
|            Kothanur|    2 BHK|    1200.0| 2.0| 51.0|  2|          4250|
|          Whitefield|    2 BHK|    1170.0| 2.0| 38.0|  2|          3247|
|    Old Airport Road|    4 BHK|    2732.0| 4.0|204.0|  4|          7467|
|        Rajaji Nagar|    4 BHK|    3300.0| 4.0|600.0|  4|         18181|
|        Marathahalli|    3 BHK|    1310.0| 3.0|63.25|  3|          4828|
|               other|6 Bedroom|    1020.0| 6.0|370.0|  6|         36274|
|          Whitefield|    3 BHK|    18

In [77]:
#Renaming the columns:

df_spk=df_spk.withColumnRenamed('size','Number of Rooms')

In [79]:
df_spk.show()

+--------------------+---------------+----------+----+-----+---+--------------+
|            location|Number of Rooms|total_sqft|bath|price|bhk|price_per_sqft|
+--------------------+---------------+----------+----+-----+---+--------------+
|Electronic City P...|          2 BHK|    1056.0| 2.0|39.07|  2|          3699|
|    Chikka Tirupathi|      4 Bedroom|    2600.0| 5.0|120.0|  4|          4615|
|         Uttarahalli|          3 BHK|    1440.0| 2.0| 62.0|  3|          4305|
|  Lingadheeranahalli|          3 BHK|    1521.0| 3.0| 95.0|  3|          6245|
|            Kothanur|          2 BHK|    1200.0| 2.0| 51.0|  2|          4250|
|          Whitefield|          2 BHK|    1170.0| 2.0| 38.0|  2|          3247|
|    Old Airport Road|          4 BHK|    2732.0| 4.0|204.0|  4|          7467|
|        Rajaji Nagar|          4 BHK|    3300.0| 4.0|600.0|  4|         18181|
|        Marathahalli|          3 BHK|    1310.0| 3.0|63.25|  3|          4828|
|               other|      6 Bedroom|  