---

<center><h1> DataFrames Operations </h1></center>

---



* 1. **Print Schema**
* 2. **Column Names**
* 3. **Check the Dimensions of the Data**
* 4. **Select Columns**
* 5. **Add new columns**
* 6. **Sorting**
* 7. **GroupBy & Aggregation Functions**

---

We are going to use the Healthcare Analytics Data which has 18 different columns -

 - case_id
 - hospital_code
 - hospital_type_code
 - city_code_hospital
 - hospital_region_code
 - extra_room_available
 - department
 - ward_type
 - ward_facility_code
 - bed_grade
 - patient_id
 - city_code_patient
 - admission_type
 - severity_of_illness
 - visitors_with_patient
 - age
 - admission_deposit
 - stay

---



---

#### `Importing the Required Libraries`

---

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.types as tp

In [2]:
spark = SparkSession.builder.getOrCreate()
spark

---
---
#### `Define the Schema of the Data`

We need to define the schema of the data before loading it. To define the schema, we will use [StructType](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType) object that will contain the list of [StructField](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructField) objects. Each `StructField` object will contain the name of the column and the data type of the column.

Let's define the schema of the given data.

---

In [3]:
# Define the schema of the data
my_schema = tp.StructType([
    tp.StructField(name= "case_id",               dataType= tp.IntegerType()),
    tp.StructField(name= "hospital_code",         dataType= tp.IntegerType()),
    tp.StructField(name= "hospital_type_code",    dataType= tp.StringType()),
    tp.StructField(name= "city_code_hospital",    dataType= tp.IntegerType()),
    tp.StructField(name= "hospital_region_code",  dataType= tp.StringType()),
    tp.StructField(name= "extra_room_available",  dataType= tp.IntegerType()),
    tp.StructField(name= "department",            dataType= tp.StringType()),
    tp.StructField(name= "ward_type",             dataType= tp.StringType()),
    tp.StructField(name= "ward_facility_code",    dataType= tp.StringType()),
    tp.StructField(name= "bed_grade",             dataType= tp.IntegerType()),
    tp.StructField(name= "patient_id",            dataType= tp.IntegerType()),
    tp.StructField(name= "city_code_patient",     dataType= tp.IntegerType()),
    tp.StructField(name= "admission_type",        dataType= tp.StringType()),
    tp.StructField(name= "severity_of_illness",   dataType= tp.StringType()),
    tp.StructField(name= "visitors_with_patient", dataType= tp.IntegerType()),
    tp.StructField(name= "age",                   dataType= tp.StringType()),
    tp.StructField(name= "admission_deposit",     dataType= tp.FloatType()),
    tp.StructField(name= "stay",                  dataType= tp.StringType()),
])

---
---

#### `Load the Data`

We will load the `module_8_train.csv` file and pass the schema that we have defined in the last step.


---

In [4]:
healthcare_data = spark.read.csv('data/module_8_train.csv', schema=my_schema, header=True)

---
---

#### `Get the schema of the dataframe`

To print the schema, we can use the [printSchema](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.printSchema) function.


---

In [5]:
healthcare_data.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- hospital_code: integer (nullable = true)
 |-- hospital_type_code: string (nullable = true)
 |-- city_code_hospital: integer (nullable = true)
 |-- hospital_region_code: string (nullable = true)
 |-- extra_room_available: integer (nullable = true)
 |-- department: string (nullable = true)
 |-- ward_type: string (nullable = true)
 |-- ward_facility_code: string (nullable = true)
 |-- bed_grade: integer (nullable = true)
 |-- patient_id: integer (nullable = true)
 |-- city_code_patient: integer (nullable = true)
 |-- admission_type: string (nullable = true)
 |-- severity_of_illness: string (nullable = true)
 |-- visitors_with_patient: integer (nullable = true)
 |-- age: string (nullable = true)
 |-- admission_deposit: float (nullable = true)
 |-- stay: string (nullable = true)



---
---

#### `Get the column names`


To print the column names, we can use the columns attribute of the dataframe.

---

In [6]:
healthcare_data.columns

['case_id',
 'hospital_code',
 'hospital_type_code',
 'city_code_hospital',
 'hospital_region_code',
 'extra_room_available',
 'department',
 'ward_type',
 'ward_facility_code',
 'bed_grade',
 'patient_id',
 'city_code_patient',
 'admission_type',
 'severity_of_illness',
 'visitors_with_patient',
 'age',
 'admission_deposit',
 'stay']

---
---
#### `Check the number of rows & columns`

You can check the dimension of any data, using the following way.

----

In [7]:
(healthcare_data.count(), len(healthcare_data.columns))

(318438, 18)

---
---

#### `Select columns from the dataframe`


If you want to view only few columns at a single time. You can use the select function and pass the column names to view separated by ",".

Let's see how to do that in the following cell.

---

In [8]:
# View only selected columns
sample_data = healthcare_data.select("hospital_code",
                                     "department",
                                     "ward_type",
                                     "patient_id",
                                     "age",
                                     "visitors_with_patient")

# Display data
sample_data.show()

+-------------+------------+---------+----------+-----+---------------------+
|hospital_code|  department|ward_type|patient_id|  age|visitors_with_patient|
+-------------+------------+---------+----------+-----+---------------------+
|            8|radiotherapy|        R|     31397|51-60|                    2|
|            2|radiotherapy|        S|     31397|51-60|                    2|
|           10|  anesthesia|        S|     31397|51-60|                    2|
|           26|radiotherapy|        R|     31397|51-60|                    2|
|           26|radiotherapy|        S|     31397|51-60|                    2|
|           23|  anesthesia|        S|     31397|51-60|                    2|
|           32|radiotherapy|        S|     31397|51-60|                    2|
|           23|radiotherapy|        Q|     31397|51-60|                    2|
|            1|  gynecology|        R|     31397|51-60|                    2|
|           10|  gynecology|        S|     31397|51-60|         

---
---
#### `Drop Column`

If you want to drop a single column from the dataframe, you can simply pass the column name in the [drop](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.drop) function.

---

In [9]:
# Drop column
sample_data_without_age = sample_data.drop("age")

sample_data_without_age.show()

+-------------+------------+---------+----------+---------------------+
|hospital_code|  department|ward_type|patient_id|visitors_with_patient|
+-------------+------------+---------+----------+---------------------+
|            8|radiotherapy|        R|     31397|                    2|
|            2|radiotherapy|        S|     31397|                    2|
|           10|  anesthesia|        S|     31397|                    2|
|           26|radiotherapy|        R|     31397|                    2|
|           26|radiotherapy|        S|     31397|                    2|
|           23|  anesthesia|        S|     31397|                    2|
|           32|radiotherapy|        S|     31397|                    2|
|           23|radiotherapy|        Q|     31397|                    2|
|            1|  gynecology|        R|     31397|                    2|
|           10|  gynecology|        S|     31397|                    2|
|           22|radiotherapy|        S|     31397|               

---
---
#### `Drop Multiple Columns`

If you want to drop multiple columns you need to pass the list of columns followed by asterick ( * ) in the drop column.

---

In [10]:
# Drop multiple columns
sample_data_drop = sample_data.drop(*["age", "department", "visitors_with_patient"])

sample_data_drop.show()

+-------------+---------+----------+
|hospital_code|ward_type|patient_id|
+-------------+---------+----------+
|            8|        R|     31397|
|            2|        S|     31397|
|           10|        S|     31397|
|           26|        R|     31397|
|           26|        S|     31397|
|           23|        S|     31397|
|           32|        S|     31397|
|           23|        Q|     31397|
|            1|        R|     31397|
|           10|        S|     31397|
|           22|        S|     31397|
|           26|        R|     31397|
|           16|        R|     31397|
|            9|        S|     31397|
|            6|        Q|     63418|
|            6|        Q|     63418|
|           23|        Q|     63418|
|           29|        S|     63418|
|           32|        S|     63418|
|           12|        Q|     63418|
+-------------+---------+----------+
only showing top 20 rows



---
---
#### `Retrieve specific records - single criteria`

If you want to retrieve records that meet a specific condition, then use the *where* function.

---

In [11]:
# Retrieve records where hospital_code is 8
sample_data.where(sample_data.hospital_code==8).show()

+-------------+------------+---------+----------+-----+---------------------+
|hospital_code|  department|ward_type|patient_id|  age|visitors_with_patient|
+-------------+------------+---------+----------+-----+---------------------+
|            8|radiotherapy|        R|     31397|51-60|                    2|
|            8|radiotherapy|        R|     33340|31-40|                    2|
|            8|  gynecology|        Q|    117334|31-40|                    4|
|            8|  gynecology|        R|     52406|71-80|                    2|
|            8|  gynecology|        Q|     52406|71-80|                    2|
|            8|  gynecology|        R|     90761|41-50|                    2|
|            8|radiotherapy|        R|     92488|51-60|                    6|
|            8|  gynecology|        R|    100741|31-40|                    2|
|            8|  gynecology|        Q|     29799|51-60|                    3|
|            8|  gynecology|        R|     28680|31-40|         

---
---
#### `Retrieve specific records - multiple criterias`

Retrieve records based on multiple criterias.

---

In [12]:
# Retrieve records where hospital_code is 8 and department is "radiotherapy"
sample_data.where((sample_data.hospital_code==8) & (sample_data.department=="radiotherapy")).show()

+-------------+------------+---------+----------+-----+---------------------+
|hospital_code|  department|ward_type|patient_id|  age|visitors_with_patient|
+-------------+------------+---------+----------+-----+---------------------+
|            8|radiotherapy|        R|     31397|51-60|                    2|
|            8|radiotherapy|        R|     33340|31-40|                    2|
|            8|radiotherapy|        R|     92488|51-60|                    6|
|            8|radiotherapy|        R|     25066|61-70|                    2|
|            8|radiotherapy|        R|      9753|31-40|                    2|
|            8|radiotherapy|        R|     12480|21-30|                    2|
|            8|radiotherapy|        Q|     47488|81-90|                    2|
|            8|radiotherapy|        R|     35348|61-70|                    6|
|            8|radiotherapy|        Q|     55774|71-80|                    4|
|            8|radiotherapy|        Q|     27140|51-60|         