<a href="https://colab.research.google.com/github/pranayb-konverge/pyspark-tutorial/blob/main/PySpark_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install pyspark

In [95]:
!pip install pyspark



In [96]:
import pyspark

In [97]:
import pandas as pd
pd.read_csv('pyspark_dummy_data.csv')

Unnamed: 0,Name,Age
0,Fair Glowach,34
1,Alberik McGuiness,35
2,Marys Coweuppe,30
3,Ursula Finlaison,35
4,Marchall Danslow,33
...,...,...
95,Gideon Stoll,30
96,Adelheid Wicks,35
97,Alastair Blasio,31
98,Glad MacClay,30


In [98]:
from pyspark.sql import SparkSession

In [99]:
spark = SparkSession.builder.appName('Practice').getOrCreate()

In [100]:
spark

Will cover:

1.   PySpark Dataframe
2.   Reading The Dataset
3. Checking the Datatypes of the Column(Schema)
4. Selecting Columns And Indexing
5. Check Describe option similar to Pandas
6. Adding Columns
7. Dropping columns
8. Renaming Columns










In [101]:
## read the data set
df_pyspark = spark.read.csv('pyspark_dummy_data2.csv', header=True,inferSchema=True)

In [102]:
df_pyspark

DataFrame[Name: string, Age: int, Experience: int]

In [103]:
# Check the Schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [104]:
# selecting column/s
df_pyspark.columns

['Name', 'Age', 'Experience']

In [105]:
df_pyspark.head(3)

[Row(Name='Meridith Marklow', Age=31, Experience=6),
 Row(Name='Rudie Kirkbride', Age=38, Experience=12),
 Row(Name='Gerik Kilrow', Age=41, Experience=15)]

In [106]:
df_pyspark.select('Name').head(3)

[Row(Name='Meridith Marklow'),
 Row(Name='Rudie Kirkbride'),
 Row(Name='Gerik Kilrow')]

In [107]:
df_pyspark.select(['Name','age']).head(3)

[Row(Name='Meridith Marklow', age=31),
 Row(Name='Rudie Kirkbride', age=38),
 Row(Name='Gerik Kilrow', age=41)]

In [108]:
# describe option similar to Pandas
df_pyspark.describe().show()

+-------+---------------+-----------------+----------------+
|summary|           Name|              Age|      Experience|
+-------+---------------+-----------------+----------------+
|  count|             50|               50|              50|
|   mean|           null|            36.54|            9.68|
| stddev|           null|4.612361385561674|3.12618956950627|
|    min|Aimee Retallick|               30|               5|
|    max|   Zabrina Titt|               45|              15|
+-------+---------------+-----------------+----------------+



In [109]:
# adding columns in data frame
df_pyspark = df_pyspark.withColumn('Experience after 2 years', df_pyspark['Experience'] + 2)

In [110]:
df_pyspark.show()

+------------------+---+----------+------------------------+
|              Name|Age|Experience|Experience after 2 years|
+------------------+---+----------+------------------------+
|  Meridith Marklow| 31|         6|                       8|
|   Rudie Kirkbride| 38|        12|                      14|
|      Gerik Kilrow| 41|        15|                      17|
|     Ileane Ablott| 31|        11|                      13|
|   Anthony Selland| 35|         8|                      10|
|  Jock Duckinfield| 43|         5|                       7|
|     Temple Latour| 37|        14|                      16|
|   Delphinia Arnet| 41|        10|                      12|
|  Langston Izakson| 43|         7|                       9|
|      Doria Figura| 30|        12|                      14|
|      Mellie Eyles| 32|        14|                      16|
|      Almire Bertl| 37|         8|                      10|
| Rollins Rignoldes| 30|         7|                       9|
|Jacklin Champerlen| 32|

In [111]:
# drop the columns
df_pyspark_dropped = df_pyspark.drop('Experience after 2 years')

In [112]:
df_pyspark_dropped.show()

+------------------+---+----------+
|              Name|Age|Experience|
+------------------+---+----------+
|  Meridith Marklow| 31|         6|
|   Rudie Kirkbride| 38|        12|
|      Gerik Kilrow| 41|        15|
|     Ileane Ablott| 31|        11|
|   Anthony Selland| 35|         8|
|  Jock Duckinfield| 43|         5|
|     Temple Latour| 37|        14|
|   Delphinia Arnet| 41|        10|
|  Langston Izakson| 43|         7|
|      Doria Figura| 30|        12|
|      Mellie Eyles| 32|        14|
|      Almire Bertl| 37|         8|
| Rollins Rignoldes| 30|         7|
|Jacklin Champerlen| 32|         9|
| Gretta Sprackling| 33|        11|
|     Gill Edgerton| 38|         5|
|  Jaquith Austwick| 39|        13|
|     Bartie Edwins| 40|        15|
|     Maurine Frude| 33|         8|
|    Tibold Norwell| 37|         9|
+------------------+---+----------+
only showing top 20 rows



In [113]:
df_pyspark = df_pyspark.withColumnRenamed('Experience' , 'Current Experience')

In [114]:
df_pyspark.show()

+------------------+---+------------------+------------------------+
|              Name|Age|Current Experience|Experience after 2 years|
+------------------+---+------------------+------------------------+
|  Meridith Marklow| 31|                 6|                       8|
|   Rudie Kirkbride| 38|                12|                      14|
|      Gerik Kilrow| 41|                15|                      17|
|     Ileane Ablott| 31|                11|                      13|
|   Anthony Selland| 35|                 8|                      10|
|  Jock Duckinfield| 43|                 5|                       7|
|     Temple Latour| 37|                14|                      16|
|   Delphinia Arnet| 41|                10|                      12|
|  Langston Izakson| 43|                 7|                       9|
|      Doria Figura| 30|                12|                      14|
|      Mellie Eyles| 32|                14|                      16|
|      Almire Bertl| 37|          

Pyspark Handling Missing Values

*   Dropping Columns
*   Dropping Rows
*   Various Parameter In Dropping functionalities
*   Handling Missing values by Mean, MEdian And Mode



In [115]:
df_pyspark=spark.read.csv('pyspark_dummy_data3.csv',header=True,inferSchema=True)

In [116]:
df_pyspark.show()

+-------------------+----+----------+-------+
|               Name| Age|Experience| Salary|
+-------------------+----+----------+-------+
|  Heriberto Crebott|  34|         5|1584303|
|      Loren Gossage|  42|        15|1363913|
|   Ruperta Le febre|  43|         9| 929016|
|Jaquenette Ratledge|  45|        13|1622024|
|        Becky Tiner|  38|        12|1908402|
|               null|  31|      null| 145214|
|               null|  32|      null| 187452|
|              Nitin|null|        12| 215214|
+-------------------+----+----------+-------+



In [117]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [119]:
##drop the columns
df_pyspark.drop('Name').show()

+----+----------+-------+
| Age|Experience| Salary|
+----+----------+-------+
|  34|         5|1584303|
|  42|        15|1363913|
|  43|         9| 929016|
|  45|        13|1622024|
|  38|        12|1908402|
|  31|      null| 145214|
|  32|      null| 187452|
|null|        12| 215214|
+----+----------+-------+



In [120]:
# drop only the null rows
df_pyspark.na.drop().show()

+-------------------+---+----------+-------+
|               Name|Age|Experience| Salary|
+-------------------+---+----------+-------+
|  Heriberto Crebott| 34|         5|1584303|
|      Loren Gossage| 42|        15|1363913|
|   Ruperta Le febre| 43|         9| 929016|
|Jaquenette Ratledge| 45|        13|1622024|
|        Becky Tiner| 38|        12|1908402|
+-------------------+---+----------+-------+



In [121]:
### any==how
df_pyspark.na.drop(how="any").show()

+-------------------+---+----------+-------+
|               Name|Age|Experience| Salary|
+-------------------+---+----------+-------+
|  Heriberto Crebott| 34|         5|1584303|
|      Loren Gossage| 42|        15|1363913|
|   Ruperta Le febre| 43|         9| 929016|
|Jaquenette Ratledge| 45|        13|1622024|
|        Becky Tiner| 38|        12|1908402|
+-------------------+---+----------+-------+



In [124]:
### all==how
df_pyspark.na.drop(how="all").show()

+-------------------+----+----------+-------+
|               Name| Age|Experience| Salary|
+-------------------+----+----------+-------+
|  Heriberto Crebott|  34|         5|1584303|
|      Loren Gossage|  42|        15|1363913|
|   Ruperta Le febre|  43|         9| 929016|
|Jaquenette Ratledge|  45|        13|1622024|
|        Becky Tiner|  38|        12|1908402|
|               null|  31|      null| 145214|
|               null|  32|      null| 187452|
|              Nitin|null|        12| 215214|
+-------------------+----+----------+-------+



In [126]:
##threshold, atleast these many non null values should be present
df_pyspark.na.drop(how="any",thresh=3).show()

+-------------------+---+----------+-------+
|               Name|Age|Experience| Salary|
+-------------------+---+----------+-------+
|  Heriberto Crebott| 34|         5|1584303|
|      Loren Gossage| 42|        15|1363913|
|   Ruperta Le febre| 43|         9| 929016|
|Jaquenette Ratledge| 45|        13|1622024|
|        Becky Tiner| 38|        12|1908402|
+-------------------+---+----------+-------+



In [123]:
##Subset - drop only values from that perticular column
df_pyspark.na.drop(how="any",subset=['Age']).show()

+-------------------+---+----------+-------+
|               Name|Age|Experience| Salary|
+-------------------+---+----------+-------+
|  Heriberto Crebott| 34|         5|1584303|
|      Loren Gossage| 42|        15|1363913|
|   Ruperta Le febre| 43|         9| 929016|
|Jaquenette Ratledge| 45|        13|1622024|
|        Becky Tiner| 38|        12|1908402|
|               null| 31|      null| 145214|
|               null| 32|      null| 187452|
+-------------------+---+----------+-------+



In [138]:
### Filling the Missing Value
df_pyspark.na.fill('Missing Values',['Name','Age','Experience']).show()

+-------------------+----+----------+-------+
|               Name| Age|Experience| Salary|
+-------------------+----+----------+-------+
|  Heriberto Crebott|  34|         5|1584303|
|      Loren Gossage|  42|        15|1363913|
|   Ruperta Le febre|  43|         9| 929016|
|Jaquenette Ratledge|  45|        13|1622024|
|        Becky Tiner|  38|        12|1908402|
|     Missing Values|  31|      null| 145214|
|     Missing Values|  32|      null| 187452|
|              Nitin|null|        12| 215214|
+-------------------+----+----------+-------+



In [131]:
# include the mean of the column in the null cels
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=['Age', 'Experience', 'Salary'], 
    outputCols=["{}_imputed".format(c) for c in ['Age', 'Experience', 'Salary']]
    ).setStrategy("median")

In [132]:
# Add imputation cols to df
imputer.fit(df_pyspark).transform(df_pyspark).show()

+-------------------+----+----------+-------+-----------+------------------+--------------+
|               Name| Age|Experience| Salary|Age_imputed|Experience_imputed|Salary_imputed|
+-------------------+----+----------+-------+-----------+------------------+--------------+
|  Heriberto Crebott|  34|         5|1584303|         34|                 5|       1584303|
|      Loren Gossage|  42|        15|1363913|         42|                15|       1363913|
|   Ruperta Le febre|  43|         9| 929016|         43|                 9|        929016|
|Jaquenette Ratledge|  45|        13|1622024|         45|                13|       1622024|
|        Becky Tiner|  38|        12|1908402|         38|                12|       1908402|
|               null|  31|      null| 145214|         31|                12|        145214|
|               null|  32|      null| 187452|         32|                12|        187452|
|              Nitin|null|        12| 215214|         38|                12|    