# PySpark on google colab

## [Optional] Connecting to google drive directories

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Running Spark Code

### Install pyspark library

In [4]:
!pip install pyspark



### Start Spark Session

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("PySparkOnColab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [3]:
spark

### Exploring the dataframe

I have donwloaded a [Friends Series Dataset](https://www.kaggle.com/datasets/rezaghari/friends-series-dataset?resource=download) and upload it at public_datasets directory in my drive and read it to create a spark dataframe.

In [9]:
df = spark.read.csv('/content/drive/MyDrive/public_datasets/friends_episodes_v3.csv', header=True)

Now it is available to explore and use as I want.

In [14]:
df.printSchema()

root
 |-- Year_of_prod: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- Episode Number: string (nullable = true)
 |-- Episode_Title: string (nullable = true)
 |-- Duration: string (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Director: string (nullable = true)
 |-- Stars: string (nullable = true)
 |-- Votes: string (nullable = true)



In [11]:
df.show(5)

+------------+------+--------------+--------------------+--------+--------------------+-------------+-----+-----+
|Year_of_prod|Season|Episode Number|       Episode_Title|Duration|             Summary|     Director|Stars|Votes|
+------------+------+--------------+--------------------+--------+--------------------+-------------+-----+-----+
|        1994|     1|             1|The One Where Mon...|      22|"Monica and the g...|James Burrows|  8.3| 7440|
|        1994|     1|             2|The One with the ...|      22|Ross finds out hi...|James Burrows|  8.1| 4888|
|        1994|     1|             3|The One with the ...|      22|Monica becomes ir...|James Burrows|  8.2| 4605|
|        1994|     1|             4|The One with Geor...|      22|Joey and Chandler...|James Burrows|  8.1| 4468|
|        1994|     1|             5|The One with the ...|      22|Eager to spend ti...|Pamela Fryman|  8.5| 4438|
+------------+------+--------------+--------------------+--------+--------------------+-

In [12]:
df.count()

236

In [15]:
df.select("Duration", "Stars", "Votes").summary().show()

+-------+------------------+-------------------+--------------------+
|summary|          Duration|              Stars|               Votes|
+-------+------------------+-------------------+--------------------+
|  count|               236|                236|                 236|
|   mean|22.338983050847457|   8.45991379310345|  3348.1850427350423|
| stddev| 1.514302530177835|0.39824940670237874|   920.1202921862531|
|    min|                22|             Rachel| Phoebe and Chand...|
|    25%|              22.0|                8.2|              2871.0|
|    50%|              22.0|                8.4|              3150.0|
|    75%|              22.0|                8.7|              3591.0|
|    max|                30|       Kevin Bright|      Gary Halvorson|
+-------+------------------+-------------------+--------------------+



Do you want to know more about PySpark?
[Start here](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html)