In [1]:
#!pip install pyspark

In [2]:
import pyspark

In [3]:
import pandas as pd
pd.read_csv('test1.csv')

Unnamed: 0,Name,Age,Experience
0,Krish,31,10
1,Sudhansh,30,8
2,Sunny,29,4


In [4]:
type(pd.read_csv('test1.csv'))

pandas.core.frame.DataFrame

In [5]:
from pyspark.sql import SparkSession

This starts the spark session and enables us to run it in a single-node cluster called the 'Master' node, or host.

In [6]:
spark = SparkSession.builder.appName('Practise').getOrCreate()

In [7]:
spark

In [8]:
df_pyspark = spark.read.csv('test1.csv')

In [9]:
df_pyspark.show()

+--------+---+----------+
|     _c0|_c1|       _c2|
+--------+---+----------+
|    Name|Age|Experience|
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



In [10]:
spark.read.option('header','true').csv('test1.csv')

DataFrame[Name: string, Age: string, Experience: string]

In [11]:
spark.read.option('header','true').csv('test1.csv').show()

+--------+---+----------+
|    Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



In [12]:
df_pyspark = spark.read.option('header','true').csv('test1.csv')

In [13]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [14]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



This is a SQL dataframe, similar to a Pandas dataframe. Let's check to see if we can read the first few rows?

In [15]:
df_pyspark.head(3)

[Row(Name='Krish', Age='31', Experience='10'),
 Row(Name='Sudhansh', Age='30', Experience='8'),
 Row(Name='Sunny', Age='29', Experience='4')]

In [16]:
df_pyspark.show()

+--------+---+----------+
|    Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



The 'select()' function must be used to identify a column name to show.

In [17]:
df_pyspark.select('Name')

DataFrame[Name: string]

In [18]:
df_pyspark.select('Name').show()

+--------+
|    Name|
+--------+
|   Krish|
|Sudhansh|
|   Sunny|
+--------+



In [19]:
type(df_pyspark.select('Name'))

pyspark.sql.dataframe.DataFrame

Selecting more than one column to reveal the row entries.

In [20]:
df_pyspark.select(['Name','Experience']).show()

+--------+----------+
|    Name|Experience|
+--------+----------+
|   Krish|        10|
|Sudhansh|         8|
|   Sunny|         4|
+--------+----------+



In [21]:
df_pyspark['Name']

Column<'Name'>

In [22]:
df_pyspark.dtypes

[('Name', 'string'), ('Age', 'string'), ('Experience', 'string')]

In [23]:
df_pyspark.describe()

DataFrame[summary: string, Name: string, Age: string, Experience: string]

In [24]:
df_pyspark.describe().show()

+-------+-----+----+-----------------+
|summary| Name| Age|       Experience|
+-------+-----+----+-----------------+
|  count|    3|   3|                3|
|   mean| null|30.0|7.333333333333333|
| stddev| null| 1.0|3.055050463303893|
|    min|Krish|  29|               10|
|    max|Sunny|  31|                8|
+-------+-----+----+-----------------+



Obviously no numeric values can be used for the string 'Name' variable. The min and max values for the 'Name' variable have been determined by the index number values which happen to be lowest for Krish and highest for Sunny.

## Adding a Column

In [25]:
df_pyspark.withColumn('Experience After 2 Years', df_pyspark['Experience']+2)

DataFrame[Name: string, Age: string, Experience: string, Experience After 2 Years: double]

In order for this 'withColumn' method to be reflected it must be assigned to a variable:

In [26]:
df_pyspark = df_pyspark.withColumn('Experience After 2 Years', df_pyspark['Experience']+2)

In [27]:
df_pyspark

DataFrame[Name: string, Age: string, Experience: string, Experience After 2 Years: double]

In [28]:
df_pyspark.show()

+--------+---+----------+------------------------+
|    Name|Age|Experience|Experience After 2 Years|
+--------+---+----------+------------------------+
|   Krish| 31|        10|                    12.0|
|Sudhansh| 30|         8|                    10.0|
|   Sunny| 29|         4|                     6.0|
+--------+---+----------+------------------------+



## Dropping Columns

In [29]:
df_pyspark.drop('Experience After 2 Years').show()

+--------+---+----------+
|    Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



Once again, assign this method to a variable, so in order to see that the column has been dropped assign it to the df_pyspark variable once again:

In [30]:
df_pyspark = df_pyspark.drop('Experience After 2 Years')
df_pyspark.show()

+--------+---+----------+
|    Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



## Re-naming a Column

In [31]:
df_pyspark.withColumnRenamed('Name', 'New Name').show()

+--------+---+----------+
|New Name|Age|Experience|
+--------+---+----------+
|   Krish| 31|        10|
|Sudhansh| 30|         8|
|   Sunny| 29|         4|
+--------+---+----------+



## PySpark Handling Missing Values
1. Dropping Columns
2. Dropping Rows
3. Various Parameter in Dropping Functionalities
4. Handling Missing Values by Mean, Median and Mode

In [32]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Practise').getOrCreate()

In [33]:
spark.read.csv('test2.csv', header=True, inferSchema=True)

DataFrame[Name: string, Age: int, Experience: int, Salary: int]

To see the entire dataset.

In [34]:
spark.read.csv('test2.csv', header=True, inferSchema=True).show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
|    null|  34|        10| 38000|
|    null|  36|      null|  null|
+--------+----+----------+------+



Save the dataset as a dataframe variable.

In [35]:
df_pyspark = spark.read.csv('test2.csv', header=True, inferSchema=True)

In [36]:
df_pyspark.show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
|    null|  34|        10| 38000|
|    null|  36|      null|  null|
+--------+----+----------+------+



## Dropping the Columns (Again)

In [37]:
df_pyspark.drop('Name').show()

+----+----------+------+
| Age|Experience|Salary|
+----+----------+------+
|  31|        10| 30000|
|  30|         8| 25000|
|  29|         4| 20000|
|  24|         3| 20000|
|  21|         1| 15000|
|  23|         2| 18000|
|null|      null| 40000|
|  34|        10| 38000|
|  36|      null|  null|
+----+----------+------+



In [38]:
df_pyspark.show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
|    null|  34|        10| 38000|
|    null|  36|      null|  null|
+--------+----+----------+------+



## Dropping Specific Rows

This will drop any rows with Null values.

In [39]:
df_pyspark.na.drop().show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|   Krish| 31|        10| 30000|
|Sudhansh| 30|         8| 25000|
|   Sunny| 29|         4| 20000|
|    Paul| 24|         3| 20000|
|  Harsha| 21|         1| 15000|
| Shubham| 23|         2| 18000|
+--------+---+----------+------+



### Drop Function

Looking at the arguments in the drop( ) function we have: 'any', 'thresh' and 'subset'. I can view these simply by placing the cursor at the function parentheses and typing Shift-Tab. This will show a little drop down comment bubble explaining configuration options for each function argument. (Actually, it's also important to note that key-value args always follow positional args in the order).

Hitting the '+' icon in the top right of the dropdown bubble expands the explanation options.

In [40]:
# how argument
df_pyspark.na.drop(how='any').show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|   Krish| 31|        10| 30000|
|Sudhansh| 30|         8| 25000|
|   Sunny| 29|         4| 20000|
|    Paul| 24|         3| 20000|
|  Harsha| 21|         1| 15000|
| Shubham| 23|         2| 18000|
+--------+---+----------+------+



If an instance (or row) only has a null value in any field then I can use how='any'. If the entire row or instance has null values in all fields then I would set the how argument to 'all'.

In [41]:
# thresh
df_pyspark.na.drop(thresh=2).show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
|    null|  34|        10| 38000|
+--------+----+----------+------+



The 'threshold' argument means that the row or instance will only be dropped if there are more non-null values than the threshold specified! If set to two then there must be at least 3 non-null values in the row before it's dropped.

In [42]:
# subset
df_pyspark.na.drop(how='any', subset=['Experience']).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|   Krish| 31|        10| 30000|
|Sudhansh| 30|         8| 25000|
|   Sunny| 29|         4| 20000|
|    Paul| 24|         3| 20000|
|  Harsha| 21|         1| 15000|
| Shubham| 23|         2| 18000|
|    null| 34|        10| 38000|
+--------+---+----------+------+



Only those null values which appear in the Experience column will have their rows dropped. This is how slicing is performed in PySpark.

In [43]:
df_pyspark.show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
|    null|  34|        10| 38000|
|    null|  36|      null|  null|
+--------+----+----------+------+



## Filling Missing Values

In [55]:
df_pyspark.na.fill('Missing').show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|   Krish|  31|        10| 30000|
|Sudhansh|  30|         8| 25000|
|   Sunny|  29|         4| 20000|
|    Paul|  24|         3| 20000|
|  Harsha|  21|         1| 15000|
| Shubham|  23|         2| 18000|
|  Mahesh|null|      null| 40000|
| Missing|  34|        10| 38000|
| Missing|  36|      null|  null|
+--------+----+----------+------+



This doesn't seem to be working! Why not?

The fill m

In [None]:
spark.stop()