<a href="https://www.kaggle.com/code/dsptlp/spark?scriptVersionId=163359200" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# SPARK 
- Reasons to Use Spark
- This notebook will compare Spark VS Pandas

# NOTE 
- Spark is designed to work in a distributed computing environment and is most effective when dealing with large datasets and clusters of machines. 
- In Kaggle's limited environment, we are not using a distributed computing environment but will be able to use all the computer resources which will be the only benefit. 

# SPARK ADVANTAGES

1. **Speed:** Spark is known for its speed, as it can perform in-memory processing, reducing the need to write intermediate results to disk. This makes Spark well-suited for iterative algorithms and interactive data analysis.

2. **Ease of Use:** Spark provides high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of users. It also offers built-in libraries for various tasks like SQL, machine learning (MLlib), graph processing (GraphX), and stream processing (Spark Streaming).

3. **Scalability:** Spark is designed for distributed computing, allowing it to scale horizontally across a cluster of machines. This makes it suitable for handling large datasets and processing tasks that would be challenging for single-node systems.

4. **Versatility:** Spark supports a variety of data processing scenarios, including batch processing, interactive queries, streaming analytics, and machine learning. This versatility makes it a preferred choice for organizations with diverse data analysis needs.

5. **Fault Tolerance:** Spark provides fault tolerance through lineage information and resilient distributed datasets (RDDs). If a node fails, Spark can recompute the lost data using the lineage information, ensuring the reliability of data processing.

6. **Integration with Big Data Ecosystem:** Spark seamlessly integrates with other big data tools and technologies, such as Hadoop Distributed File System (HDFS), Apache Hive, Apache HBase, and more. This allows users to leverage existing data storage and processing systems.

7. **Community Support:** Spark has a large and active open-source community. This means continuous development, improvements, and a wealth of resources, including documentation, forums, and tutorials.

8. **In-Memory Processing:** Spark's ability to store intermediate data in memory rather than writing to disk can significantly improve performance, especially for iterative algorithms and interactive data analysis, compared to traditional disk-based processing.


In [1]:
# Install PySpark
try:
    import pyspark
except ImportError:
    print("pyspark not found. Installing...")
    !pip install pyspark > pyspark.log.txt
    print("pyspark installed successfully!")

pyspark not found. Installing...
pyspark installed successfully!


In [2]:
# Import necessary libraries
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from matplotlib.lines import Line2D
from matplotlib import cm
import numpy as np 
import pandas as pd
import seaborn as sns
import warnings
import timeit

# Suppress all warnings
warnings.filterwarnings("ignore")

In [3]:
# Create a Spark session
spark = SparkSession.builder.appName("Spark").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/19 00:03:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
file_path  = "/kaggle/input/tabular-dataset-ready-for-malicious-url-detection/train_dataset.csv"

In [5]:
%%time

def load_csv_using_spark():
    df = spark.read.csv(file_path, header=True, inferSchema=True)

    # Perform the summary: count number of records grouped by a column
    summary_df = df.groupBy("label").count()
    
    # Perform an action (triggers execution, note that spark uses Laxy Execution)
    summary_df.show()
    
# Measure the execution time
execution_time = timeit.timeit(load_csv_using_spark, number=10)

# Print the result
print(f"Execution time using SPARK: {execution_time} seconds")

                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+



                                                                                

+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+





+-----+-------+
|label|  count|
+-----+-------+
|    1|1445673|
|    0|5283175|
+-----+-------+

Execution time using SPARK: 384.093118476 seconds
CPU times: user 300 ms, sys: 88.1 ms, total: 388 ms
Wall time: 6min 24s


                                                                                

In [6]:
%%time

def load_csv_using_pandas():
    data_df = pd.read_csv(file_path, delimiter=',') 
    summary_df = data_df[['url_has_login','label']].groupby(['label']).count()
    summary_df
    
# Measure the execution time
execution_time = timeit.timeit(load_csv_using_pandas, number=10)

# Print the result
print(f"Execution time using PANDAS: {execution_time} seconds")

Execution time using PANDAS: 267.5176722770002 seconds
CPU times: user 4min 6s, sys: 20.6 s, total: 4min 27s
Wall time: 4min 27s
