In [43]:
!pip install pyspark
!pip install findspark




KeyboardInterrupt: 

In [None]:
import os, sys
os.environ.pop("SPARK_HOME", None)

In [None]:
import pyspark as py
import pandas as pd
import numpy as np
import findspark
findspark.init()
pd.read_csv('/content/sample_data/mnist_test.csv')

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('colab-lab').getOrCreate()

## Verify installation

### Subtask:
Verify the installation by running a simple Spark command.

**Reasoning**:
Verify the installation by running a simple Spark command to create a SparkSession and display a small range of numbers.

In [None]:
import os

print("SPARK_HOME:", os.environ.get('SPARK_HOME'))
print("PATH:", os.environ.get('PATH'))

In [None]:
spark = SparkSession.Builder().appName('colab-lab').getOrCreate()

In [None]:
df_spark = spark.read.option('header','true').csv('/content/sample_data/test1.csv')
df_spark.show()


In [None]:
edited_spark = df_spark.limit(5)

In [None]:
df_spark.printSchema()
edited_spark.printSchema()
type(edited_spark)

# Task
Explain how to download and install Spark binaries in Colab.

## Download spark binaries

### Subtask:
Download the desired Spark binary version from the Apache Spark website.


**Reasoning**:
Download the Spark binary using wget.



In [None]:
!wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

**Reasoning**:
The previous download link resulted in a 404 error. I need to find a valid download link for a recent stable version of Spark. I will try a different version.



In [None]:
!wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

**Reasoning**:
The previous download links resulted in 404 errors. I need to find a valid download link for a recent stable version of Spark. I will try a different version.



In [None]:
!wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

## Extract the spark binaries

### Subtask:
Extract the downloaded archive to a directory in your Colab environment.


**Reasoning**:
Extract the downloaded Spark archive using the `tar` command to the specified directory.



In [None]:
!tar -xvzf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/

## Set environment variables

### Subtask:
Set the `SPARK_HOME` and `PATH` environment variables to point to the extracted Spark directory.


**Reasoning**:
Set the SPARK_HOME and update the PATH environment variables.



In [None]:
import os
os.environ['SPARK_HOME'] = '/usr/local/spark-3.3.0-bin-hadoop3'
os.environ['PATH'] = os.environ['PATH'] + ':' + os.environ['SPARK_HOME'] + '/bin'

## Verify installation

### Subtask:
Verify the installation by running a simple Spark command.


**Reasoning**:
The previous attempt to create a SparkSession failed. This is likely due to a conflict with an existing Spark context or an issue with the Java gateway. The current code will attempt to explicitly stop any existing Spark context before creating a new SparkSession, which might resolve the issue.



In [None]:
if 'spark' in locals() and spark is not None:
    spark.stop()

spark = SparkSession.builder.appName('golab-test').getOrCreate()
spark.range(5).show()

**Reasoning**:
The error indicates that `spark` is being interpreted as a function instead of a SparkSession object, and thus does not have a `.stop()` method. This could be due to a variable name conflict. To fix this, I will explicitly remove the `spark` variable from the local scope before attempting to create a new SparkSession. This should ensure that the `SparkSession.builder.appName('test').getOrCreate()` call creates a new SparkSession and assigns it to a variable named `spark` that is the correct type. Then, I will run the `spark.range(5).show()` command to verify the installation.



In [None]:
if 'spark' in locals():
    del spark

spark = SparkSession.builder.appName('test').getOrCreate()
spark.range(5).show()

## Verify installation retry 1

### Subtask:
Verify the installation by running a simple Spark command.


**Reasoning**:
The previous attempt to create a SparkSession failed with a Java gateway error. It is possible that the Java environment variable was not set correctly or there is a conflict with a previous Spark instance. I will try to stop any existing Spark session explicitly and then recreate the SparkSession, ensuring the necessary environment variable is set.



In [None]:
from pyspark.sql import SparkSession
import os

# Set JAVA_HOME explicitly again before creating the SparkSession
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"

# Stop any existing Spark session
if 'spark' in locals() and isinstance(spark, SparkSession):
    spark.stop()
    print("Stopped existing Spark session.")

# Create a new SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()

# Verify the installation by running a simple command
spark.range(5).show()

## Verify installation retry 2

### Subtask:
Verify the installation by running a simple Spark command.


**Reasoning**:
Based on the previous attempts, the Spark session creation is failing due to the Java gateway exiting. The previous code already set `JAVA_HOME` and tried to stop any existing Spark sessions. The next step is to retry creating the Spark session and running a simple command to see if the issue is resolved.



In [None]:
# Set JAVA_HOME explicitly again before creating the SparkSession
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"

# Stop any existing Spark session
if 'spark' in locals() and isinstance(spark, SparkSession):
    spark.stop()
    print("Stopped existing Spark session.")

# Create a new SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()

# Verify the installation by running a simple command
spark.range(5).show()

## Summary:

### Data Analysis Key Findings

*   Attempts to download Spark versions 3.5.1 and 3.5.0 from the primary server failed with "404 Not Found" errors.
*   Spark version 3.3.0 was successfully downloaded from the Apache archive server.
*   The downloaded `spark-3.3.0-bin-hadoop3.tgz` file was successfully extracted to `/usr/local/`.
*   The `SPARK_HOME` environment variable was set to `/usr/local/spark-3.3.0-bin-hadoop3`.
*   The Spark bin directory was successfully added to the `PATH` environment variable.
*   Verification of the Spark installation by creating a `SparkSession` failed repeatedly with a `PySparkRuntimeError: [JAVA_GATEWAY_EXITED]`, indicating an issue with the Java gateway process.
*   Explicitly setting the `JAVA_HOME` environment variable and stopping existing sessions did not resolve the `JAVA_GATEWAY_EXITED` error.

### Insights or Next Steps

*   The persistent `JAVA_GATEWAY_EXITED` error suggests a potential incompatibility or configuration issue with the Java environment in the Colab instance and the installed Spark version.
*   Investigate the Java version and configuration on the Colab environment and ensure compatibility with Spark 3.3.0. Explore alternative Java versions or configurations if necessary.
