<a href="https://colab.research.google.com/github/keshavvprabhu/python-tutorials/blob/main/PySparkIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark Introduction

## Installing PySpark

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 30 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 11.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=51ad3aae307f83157b489db355143a49c1cc090e6ece844303cd26285667c2b6
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [None]:
from pyspark.sql import SparkSession

In [None]:
spark=SparkSession.builder.appName('Exercise1').getOrCreate()

In [None]:
spark

In [None]:
df = spark.read.option("Header", "True").csv("Social_Network_Ads.csv", inferSchema=True)

In [None]:
df

DataFrame[Age: int, EstimatedSalary: int, Purchased: int]

In [None]:
df.show()

+---+---------------+---------+
|Age|EstimatedSalary|Purchased|
+---+---------------+---------+
| 19|          19000|        0|
| 35|          20000|        0|
| 26|          43000|        0|
| 27|          57000|        0|
| 19|          76000|        0|
| 27|          58000|        0|
| 27|          84000|        0|
| 32|         150000|        1|
| 25|          33000|        0|
| 35|          65000|        0|
| 26|          80000|        0|
| 26|          52000|        0|
| 20|          86000|        0|
| 32|          18000|        0|
| 18|          82000|        0|
| 29|          80000|        0|
| 47|          25000|        1|
| 45|          26000|        1|
| 46|          28000|        1|
| 48|          29000|        1|
+---+---------------+---------+
only showing top 20 rows



In [None]:
type(df)

pyspark.sql.dataframe.DataFrame

In [None]:
df.head(3)

[Row(Age=19, EstimatedSalary=19000, Purchased=0),
 Row(Age=35, EstimatedSalary=20000, Purchased=0),
 Row(Age=26, EstimatedSalary=43000, Purchased=0)]

## Part 1

In this section we will do the following:


*   Reading a dataset
*   Checking the datatypes of the column(Schema)
*   Seleting Columns and Indexing
*   Check Describe option (Similar to Pandas)
*   Adding Columns
*   Dropping Columns



## Checking the datatypes schema

In [None]:
df.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- EstimatedSalary: integer (nullable = true)
 |-- Purchased: integer (nullable = true)



In [None]:
df.columns

['Age', 'EstimatedSalary', 'Purchased']

In [None]:
df.head(3)

[Row(Age=19, EstimatedSalary=19000, Purchased=0),
 Row(Age=35, EstimatedSalary=20000, Purchased=0),
 Row(Age=26, EstimatedSalary=43000, Purchased=0)]

In [None]:
## Selecting columns from dataframe

In [None]:
df.select('Age', 'EstimatedSalary').show()

+---+---------------+
|Age|EstimatedSalary|
+---+---------------+
| 19|          19000|
| 35|          20000|
| 26|          43000|
| 27|          57000|
| 19|          76000|
| 27|          58000|
| 27|          84000|
| 32|         150000|
| 25|          33000|
| 35|          65000|
| 26|          80000|
| 26|          52000|
| 20|          86000|
| 32|          18000|
| 18|          82000|
| 29|          80000|
| 47|          25000|
| 45|          26000|
| 46|          28000|
| 48|          29000|
+---+---------------+
only showing top 20 rows



In [None]:
df.dtypes

[('Age', 'int'), ('EstimatedSalary', 'int'), ('Purchased', 'int')]

In [None]:
df.describe().show()

+-------+------------------+----------------+------------------+
|summary|               Age| EstimatedSalary|         Purchased|
+-------+------------------+----------------+------------------+
|  count|               400|             400|               400|
|   mean|            37.655|         69742.5|            0.3575|
| stddev|10.482876597307927|34096.9602824248|0.4798639635968691|
|    min|                18|           15000|                 0|
|    max|                60|          150000|                 1|
+-------+------------------+----------------+------------------+



In [None]:
## Adding columns
new_df = df.withColumn('Age after 2 years', df['Age']+2)

In [None]:
new_df.show()

+---+---------------+---------+-----------------+
|Age|EstimatedSalary|Purchased|Age after 2 years|
+---+---------------+---------+-----------------+
| 19|          19000|        0|               21|
| 35|          20000|        0|               37|
| 26|          43000|        0|               28|
| 27|          57000|        0|               29|
| 19|          76000|        0|               21|
| 27|          58000|        0|               29|
| 27|          84000|        0|               29|
| 32|         150000|        1|               34|
| 25|          33000|        0|               27|
| 35|          65000|        0|               37|
| 26|          80000|        0|               28|
| 26|          52000|        0|               28|
| 20|          86000|        0|               22|
| 32|          18000|        0|               34|
| 18|          82000|        0|               20|
| 29|          80000|        0|               31|
| 47|          25000|        1|               49|


In [None]:
new_df = new_df.withColumnRenamed('Age after 2 years', 'Age+2yrs')

In [None]:
new_df.show()

+---+---------------+---------+--------+
|Age|EstimatedSalary|Purchased|Age+2yrs|
+---+---------------+---------+--------+
| 19|          19000|        0|      21|
| 35|          20000|        0|      37|
| 26|          43000|        0|      28|
| 27|          57000|        0|      29|
| 19|          76000|        0|      21|
| 27|          58000|        0|      29|
| 27|          84000|        0|      29|
| 32|         150000|        1|      34|
| 25|          33000|        0|      27|
| 35|          65000|        0|      37|
| 26|          80000|        0|      28|
| 26|          52000|        0|      28|
| 20|          86000|        0|      22|
| 32|          18000|        0|      34|
| 18|          82000|        0|      20|
| 29|          80000|        0|      31|
| 47|          25000|        1|      49|
| 45|          26000|        1|      47|
| 46|          28000|        1|      48|
| 48|          29000|        1|      50|
+---+---------------+---------+--------+
only showing top

In [None]:
new_df = new_df.drop("Age+2yrs")

In [None]:
new_df.show()

+---+---------------+---------+
|Age|EstimatedSalary|Purchased|
+---+---------------+---------+
| 19|          19000|        0|
| 35|          20000|        0|
| 26|          43000|        0|
| 27|          57000|        0|
| 19|          76000|        0|
| 27|          58000|        0|
| 27|          84000|        0|
| 32|         150000|        1|
| 25|          33000|        0|
| 35|          65000|        0|
| 26|          80000|        0|
| 26|          52000|        0|
| 20|          86000|        0|
| 32|          18000|        0|
| 18|          82000|        0|
| 29|          80000|        0|
| 47|          25000|        1|
| 45|          26000|        1|
| 46|          28000|        1|
| 48|          29000|        1|
+---+---------------+---------+
only showing top 20 rows



In [None]:
df.na.drop().show()

+---+---------------+---------+
|Age|EstimatedSalary|Purchased|
+---+---------------+---------+
| 19|          19000|        0|
| 35|          20000|        0|
| 26|          43000|        0|
| 27|          57000|        0|
| 19|          76000|        0|
| 27|          58000|        0|
| 27|          84000|        0|
| 32|         150000|        1|
| 25|          33000|        0|
| 35|          65000|        0|
| 26|          80000|        0|
| 26|          52000|        0|
| 20|          86000|        0|
| 32|          18000|        0|
| 18|          82000|        0|
| 29|          80000|        0|
| 47|          25000|        1|
| 45|          26000|        1|
| 46|          28000|        1|
| 48|          29000|        1|
+---+---------------+---------+
only showing top 20 rows



In [None]:
df.na.drop(how="any", thresh=2).show()

+---+---------------+---------+
|Age|EstimatedSalary|Purchased|
+---+---------------+---------+
| 19|          19000|        0|
| 35|          20000|        0|
| 26|          43000|        0|
| 27|          57000|        0|
| 19|          76000|        0|
| 27|          58000|        0|
| 27|          84000|        0|
| 32|         150000|        1|
| 25|          33000|        0|
| 35|          65000|        0|
| 26|          80000|        0|
| 26|          52000|        0|
| 20|          86000|        0|
| 32|          18000|        0|
| 18|          82000|        0|
| 29|          80000|        0|
| 47|          25000|        1|
| 45|          26000|        1|
| 46|          28000|        1|
| 48|          29000|        1|
+---+---------------+---------+
only showing top 20 rows

