<a href="https://colab.research.google.com/github/kvraagul/Pyspark/blob/main/Pyspark_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Pyspark

* PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. 

* In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.

# Installation

In [51]:
!pip install pyspark



In [52]:
import pyspark

**When you are working with pyspark you always need to start a spark session**

In [53]:
from pyspark.sql import SparkSession

In [54]:
spark = SparkSession.builder.appName('Pyspark Tutorial').getOrCreate()
spark

# Reading the Dataset in pyspark

In [55]:
df_pyspark = spark.read.csv('/content/sample_data/california_housing_test.csv')

In [56]:
df_pyspark.show()

+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|        _c0|      _c1|               _c2|        _c3|           _c4|        _c5|       _c6|          _c7|               _c8|
+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population|households|median_income|median_house_value|
|-122.050000|37.370000|         27.000000|3885.000000|    661.000000|1537.000000|606.000000|     6.608500|     344700.000000|
|-118.300000|34.260000|         43.000000|1510.000000|    310.000000| 809.000000|277.000000|     3.599000|     176500.000000|
|-117.810000|33.780000|         27.000000|3589.000000|    507.000000|1484.000000|495.000000|     5.793400|     270500.000000|
|-118.360000|33.820000|         28.000000|  67.000000|     15.000000|  49.000000| 11.000000|     6.135900|     330000.

In [57]:
df_spark = spark.read.option('header','true').csv('/content/sample_data/california_housing_test.csv')

In [58]:
df_spark.show()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|-122.050000|37.370000|         27.000000|3885.000000|    661.000000|1537.000000| 606.000000|     6.608500|     344700.000000|
|-118.300000|34.260000|         43.000000|1510.000000|    310.000000| 809.000000| 277.000000|     3.599000|     176500.000000|
|-117.810000|33.780000|         27.000000|3589.000000|    507.000000|1484.000000| 495.000000|     5.793400|     270500.000000|
|-118.360000|33.820000|         28.000000|  67.000000|     15.000000|  49.000000|  11.000000|     6.135900|     330000.000000|
|-119.670000|36.330000|         19.000000|1241.000000|    244.000000| 850.000000| 237.000000|     2.937500|    

In [59]:
type(df_spark)

pyspark.sql.dataframe.DataFrame

In [60]:
df_spark.head(5)

[Row(longitude='-122.050000', latitude='37.370000', housing_median_age='27.000000', total_rooms='3885.000000', total_bedrooms='661.000000', population='1537.000000', households='606.000000', median_income='6.608500', median_house_value='344700.000000'),
 Row(longitude='-118.300000', latitude='34.260000', housing_median_age='43.000000', total_rooms='1510.000000', total_bedrooms='310.000000', population='809.000000', households='277.000000', median_income='3.599000', median_house_value='176500.000000'),
 Row(longitude='-117.810000', latitude='33.780000', housing_median_age='27.000000', total_rooms='3589.000000', total_bedrooms='507.000000', population='1484.000000', households='495.000000', median_income='5.793400', median_house_value='270500.000000'),
 Row(longitude='-118.360000', latitude='33.820000', housing_median_age='28.000000', total_rooms='67.000000', total_bedrooms='15.000000', population='49.000000', households='11.000000', median_income='6.135900', median_house_value='330000.0

**printSchema tells the no of columns in the dataset**

In [61]:
df_spark.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)



In [62]:
df.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'Population',
 'households',
 'median_income',
 'median_house_value']

**If we dont add inferSchema to read the dataset by default it will take all the values as string**

In [63]:
df = spark.read.csv('/content/sample_data/california_housing_test.csv',header=True,inferSchema=True)

In [64]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)



In [65]:
df.dtypes

[('longitude', 'double'),
 ('latitude', 'double'),
 ('housing_median_age', 'double'),
 ('total_rooms', 'double'),
 ('total_bedrooms', 'double'),
 ('population', 'double'),
 ('households', 'double'),
 ('median_income', 'double'),
 ('median_house_value', 'double')]

In [66]:
df.describe().show()

+-------+-------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|    total_bedrooms|        population|        households|     median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+
|  count|               3000|              3000|              3000|             3000|              3000|              3000|              3000|              3000|              3000|
|   mean|-119.58920000000029| 35.63538999999999|28.845333333333333|2599.578666666667| 529.9506666666666|1402.7986666666666|           489.912| 3.807271799999998|        205846.275|
| stddev| 1.9949362939550166|2.1296695233438334|12.555395554955757|2155.593331625582|415.654368

**To select single or more columns**

In [67]:
latitude = df.select('latitude')

In [68]:
latitude.show()

+--------+
|latitude|
+--------+
|   37.37|
|   34.26|
|   33.78|
|   33.82|
|   36.33|
|   36.51|
|   38.63|
|   35.48|
|    38.4|
|   34.08|
|   33.98|
|   35.85|
|   37.25|
|   32.97|
|   33.73|
|   33.81|
|   37.53|
|   38.69|
|   34.21|
|   38.01|
+--------+
only showing top 20 rows



In [69]:
population_medianincome = df.select(['population','median_income'])

In [70]:
population_medianincome.show()

+----------+-------------+
|population|median_income|
+----------+-------------+
|    1537.0|       6.6085|
|     809.0|        3.599|
|    1484.0|       5.7934|
|      49.0|       6.1359|
|     850.0|       2.9375|
|     663.0|       1.6635|
|     604.0|       1.6641|
|    1341.0|        3.225|
|    1446.0|       3.6696|
|    2830.0|       2.3333|
|    1288.0|       2.2054|
|     564.0|       2.4167|
|     535.0|         4.69|
|    1935.0|       4.5625|
|    1217.0|       5.7121|
|     157.0|          2.2|
|     189.0|        1.875|
|    1603.0|       2.7174|
|     654.0|       6.5851|
|    3450.0|       6.1724|
+----------+-------------+
only showing top 20 rows



In [71]:
type(population_medianincome)

pyspark.sql.dataframe.DataFrame

**Adding columns in the Dataset**

In [72]:
add_column = df.withColumn('double median income',df['median_income']*2)

In [73]:
add_column.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|double median income|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|              13.217|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|               7.198|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|             11.5868|
|  -118.36|   33.82|              28.0|       67.0|          15.0|      49.0|      11.0|       6.1359|          330000.0|             12.2718|

**Dropping the Column**

In [74]:
droping_column = add_column.drop('double median income')

In [75]:
droping_column.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|
|  -118.36|   33.82|              28.0|       67.0|          15.0|      49.0|      11.0|       6.1359|          330000.0|
|  -119.67|   36.33|              19.0|     1241.0|         244.0|     850.0|     237.0|       2.9375|           81700.0|
|  -119.56|   36.51|    

**Renaming the Column**

In [76]:
rename_column_name = df.withColumnRenamed('population','Population')

In [77]:
rename_column_name.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|Population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|
|  -118.36|   33.82|              28.0|       67.0|          15.0|      49.0|      11.0|       6.1359|          330000.0|
|  -119.67|   36.33|              19.0|     1241.0|         244.0|     850.0|     237.0|       2.9375|           81700.0|
|  -119.56|   36.51|    