---
---

<center><h1>Assignment: DataFrames</h1></center>

---

In the notebook, we will work with a cricket commentary data.


---

#### `Importing the required libraries`

---

In [1]:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.types as tp
from pyspark.sql import functions as F

In [2]:
spark = SparkSession.builder.getOrCreate()
spark

---

#### `Read the CSV File`

---

In [3]:
df = spark.read.csv("data/ind-ban-comment.csv",inferSchema=True, header=True)

In [4]:
## check the schema of the dataframe
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Batsman: integer (nullable = true)
 |-- Batsman_Name: string (nullable = true)
 |-- Bowler: integer (nullable = true)
 |-- Bowler_Name: string (nullable = true)
 |-- Commentary: string (nullable = true)
 |-- Detail: string (nullable = true)
 |-- Dismissed: double (nullable = true)
 |-- Id: integer (nullable = true)
 |-- Isball: boolean (nullable = true)
 |-- Isboundary: double (nullable = true)
 |-- Iswicket: double (nullable = true)
 |-- Over: double (nullable = true)
 |-- Runs: integer (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Timestamp: string (nullable = true)
 |-- ZAD: string (nullable = true)



In [5]:
# VIEW THE TOP 4 ROWS OF THE DATA USING THE SHOW FUNCTION
df.show(4)

+---+-------+-----------------+------+-----------------+--------------------+------+---------+---+------+----------+--------+----+----+-------+-------------------+-------+
|_c0|Batsman|     Batsman_Name|Bowler|      Bowler_Name|          Commentary|Detail|Dismissed| Id|Isball|Isboundary|Iswicket|Over|Runs|Summary|          Timestamp|    ZAD|
+---+-------+-----------------+------+-----------------+--------------------+------+---------+---+------+----------+--------+----+----+-------+-------------------+-------+
|  0|  28994|   Mohammed Shami| 63881|Mustafizur Rahman|OUT! Bowled! 5-fe...|     W|  28994.0|346|  true|      null|     1.0|49.6|   0|   null|2019-07-02 13:18:47|   null|
|  1|   5132|Bhuvneshwar Kumar| 63881|Mustafizur Rahman|WIDE AND RUN OUT!...|  W+wd|   5132.0|344|  true|      null|     1.0|49.6|   1|   null|2019-07-02 13:17:28|   null|
|  2|  28994|   Mohammed Shami| 63881|Mustafizur Rahman|Back of a length ...|  null|     null|343|  true|      null|    null|49.5|   1|   nu

In [6]:
df.columns

['_c0',
 'Batsman',
 'Batsman_Name',
 'Bowler',
 'Bowler_Name',
 'Commentary',
 'Detail',
 'Dismissed',
 'Id',
 'Isball',
 'Isboundary',
 'Iswicket',
 'Over',
 'Runs',
 'Summary',
 'Timestamp',
 'ZAD']

In [7]:
# Num rows and cols
(df.count(), len(df.columns))

(605, 17)

---

#### `View only the following columns of the dataframe`

    - Batsman_Name
    - Bowler_Name
    - Dismissed
    - Isboundary
    - Runs

---

In [8]:
# WRITE YOUR CODE HERE
# View only selected columns
df1 = df.select('Batsman_Name','Bowler_Name','Dismissed','Isboundary','Runs')
# Display data
df1.show()

+-----------------+------------------+---------+----------+----+
|     Batsman_Name|       Bowler_Name|Dismissed|Isboundary|Runs|
+-----------------+------------------+---------+----------+----+
|   Mohammed Shami| Mustafizur Rahman|  28994.0|      null|   0|
|Bhuvneshwar Kumar| Mustafizur Rahman|   5132.0|      null|   1|
|   Mohammed Shami| Mustafizur Rahman|     null|      null|   1|
|Bhuvneshwar Kumar| Mustafizur Rahman|     null|      null|   1|
|         MS Dhoni| Mustafizur Rahman|   3676.0|      null|   0|
|         MS Dhoni| Mustafizur Rahman|     null|      null|   0|
|         MS Dhoni| Mustafizur Rahman|     null|      null|   0|
|         MS Dhoni|Mohammad Saifuddin|     null|      null|   1|
|         MS Dhoni|Mohammad Saifuddin|     null|       1.0|   4|
|         MS Dhoni|Mohammad Saifuddin|     null|      null|   0|
|         MS Dhoni|Mohammad Saifuddin|     null|      null|   0|
|         MS Dhoni|Mohammad Saifuddin|     null|       1.0|   4|
|         MS Dhoni|Mohamm

---

#### Find out the number of runs scored by each batsman

---

In [9]:
from pyspark.sql.functions import sum

In [10]:
#### WRITE YOUR CODE HERE
df_grouped = df1.groupBy('Batsman_Name')
df_grouped

<pyspark.sql.group.GroupedData at 0x7ff7a9b44240>

In [11]:
# Total Runs scored
df_runs = df_grouped.agg(sum('Runs').alias('Runs_Scored'))
df_runs

Batsman_Name,Runs_Scored
Soumya Sarkar,34
Mashrafe Mortaza,8
Shakib Al Hasan,68
Mushfiqur Rahim,24
Mohammad Saifuddin,55
Liton Das,22
Rishabh Pant,55
Mohammed Shami,1
Tamim Iqbal,23
Hardik Pandya,0


In [12]:
# SHOW THE BATSMAN SCORED BY EACH RUN IN DESCENDING ORDER
df_runs_sort = df_runs.sort(df_runs.Runs_Scored.desc())
df_runs_sort

Batsman_Name,Runs_Scored
Rohit Sharma,107
KL Rahul,79
Shakib Al Hasan,68
Rishabh Pant,55
Mohammad Saifuddin,55
Sabbir Rahman,40
MS Dhoni,35
Soumya Sarkar,34
Virat Kohli,26
Mushfiqur Rahim,24


---

#### Which batsman scored the highest number of boundaries

---

In [13]:
## WRITE YOUR CODE HERE 
# Total boundaries scored
df_boundary = df_grouped.agg(sum('Isboundary').alias('Num_Boundary'))
df_boundary

Batsman_Name,Num_Boundary
Soumya Sarkar,4.0
Mashrafe Mortaza,1.0
Shakib Al Hasan,6.0
Mushfiqur Rahim,3.0
Mohammad Saifuddin,9.0
Liton Das,1.0
Rishabh Pant,7.0
Mohammed Shami,
Tamim Iqbal,3.0
Hardik Pandya,


---

**Define a `udf` function that will create a new column on the basis of following condition**

If the value of `Runs` is less than 2, then assign `A`, if value is between `3 to 5` then assign `B` else assign `C`


---

In [14]:
## WRITE YOUR CODE HERE
# Define the function to encode ward_type
def func_new_col(run):
    
    if run <= 2:
        return 'A'
    elif run >= 3 & run <= 5:
        return 'B'
    else:
        return 'C'

In [15]:
from pyspark.sql.functions import udf

In [16]:
# Convert to udf function
function_with_udf = udf(f= func_new_col, returnType= tp.StringType())

In [17]:
# Create new column
df1_2 = df1.withColumn("new_column_using_udf",function_with_udf(df1['Runs']))
df1_2

Batsman_Name,Bowler_Name,Dismissed,Isboundary,Runs,new_column_using_udf
Mohammed Shami,Mustafizur Rahman,28994.0,,0,A
Bhuvneshwar Kumar,Mustafizur Rahman,5132.0,,1,A
Mohammed Shami,Mustafizur Rahman,,,1,A
Bhuvneshwar Kumar,Mustafizur Rahman,,,1,A
MS Dhoni,Mustafizur Rahman,3676.0,,0,A
MS Dhoni,Mustafizur Rahman,,,0,A
MS Dhoni,Mustafizur Rahman,,,0,A
MS Dhoni,Mohammad Saifuddin,,,1,A
MS Dhoni,Mohammad Saifuddin,,1.0,4,B
MS Dhoni,Mohammad Saifuddin,,,0,A
