<a href="https://www.kaggle.com/code/dsptlp/spark?scriptVersionId=163345039" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# SPARK 
- Reasons to Use Spark
- This notebook will compare Spark VS Panda

# NOTE 
- Spark is designed to work in a distributed computing environment and is most effective when dealing with large datasets and clusters of machines. 
- In Kaggle's limited environment, we due not get a distributed computing environment but will be able to use all the computer resources which will be the only benefit. 

# SPARK ADVANTAGES

1. **Speed:** Spark is known for its speed, as it can perform in-memory processing, reducing the need to write intermediate results to disk. This makes Spark well-suited for iterative algorithms and interactive data analysis.

2. **Ease of Use:** Spark provides high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of users. It also offers built-in libraries for various tasks like SQL, machine learning (MLlib), graph processing (GraphX), and stream processing (Spark Streaming).

3. **Scalability:** Spark is designed for distributed computing, allowing it to scale horizontally across a cluster of machines. This makes it suitable for handling large datasets and processing tasks that would be challenging for single-node systems.

4. **Versatility:** Spark supports a variety of data processing scenarios, including batch processing, interactive queries, streaming analytics, and machine learning. This versatility makes it a preferred choice for organizations with diverse data analysis needs.

5. **Fault Tolerance:** Spark provides fault tolerance through lineage information and resilient distributed datasets (RDDs). If a node fails, Spark can recompute the lost data using the lineage information, ensuring the reliability of data processing.

6. **Integration with Big Data Ecosystem:** Spark seamlessly integrates with other big data tools and technologies, such as Hadoop Distributed File System (HDFS), Apache Hive, Apache HBase, and more. This allows users to leverage existing data storage and processing systems.

7. **Community Support:** Spark has a large and active open-source community. This means continuous development, improvements, and a wealth of resources, including documentation, forums, and tutorials.

8. **In-Memory Processing:** Spark's ability to store intermediate data in memory rather than writing to disk can significantly improve performance, especially for iterative algorithms and interactive data analysis, compared to traditional disk-based processing.


In [1]:
# Install PySpark
try:
    import pyspark
except ImportError:
    print("pyspark not found. Installing...")
    !pip install pyspark > pyspark.log.txt
    print("pyspark installed successfully!")

pyspark not found. Installing...
pyspark installed successfully!


In [2]:
# Import necessary libraries
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from matplotlib.lines import Line2D
from matplotlib import cm
import numpy as np 
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import warnings
import timeit

# Suppress all warnings
warnings.filterwarnings("ignore")

In [3]:
# Create a Spark session
spark = SparkSession.builder.appName("Spark").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/18 20:00:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
file_path  = "/kaggle/input/tabular-dataset-ready-for-malicious-url-detection/train_dataset.csv"

In [5]:
%%time

df = spark.read.csv(file_path, header=True, inferSchema=True)

# Perform an action (triggers execution, note that spark uses Laxy Execution)
df.show(10)

24/02/18 20:01:21 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+--------------------+-----+--------------------+-------------+--------------+--------------+-------------+----------+-------------+-------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------+---------------+--------------+--------------+----------------+-------------+----------------+--------------+-------------------+--------------------+--------------+---------------+-------------+----------------+---------------+-----------------------------------+-------------------------+-----------------------+--------+--------------------+----------------------+---------------+-------------------+----------------------------+----------------+----------------+-------------------+----------------------+-----------------+---------+----------------------+-----------+--------------------+--------------------+--------------------------+-------------------+-------+------+----------+-------

In [6]:
%%time

data_df = pd.read_csv(file_path, delimiter=',') 
data_df.head(10)

CPU times: user 47.3 s, sys: 7.92 s, total: 55.3 s
Wall time: 55.3 s


Unnamed: 0,url,label,source,url_has_login,url_has_client,url_has_server,url_has_admin,url_has_ip,url_isshorted,url_len,...,pdomain_count_hyphen,pdomain_count_atrate,pdomain_count_non_alphanum,pdomain_count_digit,tld_len,tld,tld_is_sus,pdomain_min_distance,subdomain_len,subdomain_count_dot
0,irs-profilepaymentservice.com/home,1,phishtank,0,0,0,0,0,0,34,...,0,0,0,0,3,com,0,17,0,0
1,cpuggsukabumi.id,0,majestic_million,0,0,0,0,0,0,16,...,0,0,0,0,2,id,1,10,0,0
2,members.tripod.com/~don_rc/ring.htm,0,data_clean_test_mendel,0,0,0,0,0,0,35,...,0,0,0,0,3,com,0,2,7,0
3,optuswebmailadminprovider.weebly.com/,1,phishtank,0,0,0,1,0,0,37,...,0,0,0,0,3,com,0,3,25,0
4,topoz.com.pl,0,dmoz_harvard,0,0,0,0,0,0,12,...,0,0,0,0,6,com.pl,0,3,0,0
5,akopos.lt,0,dmoz_harvard,0,0,0,0,0,0,9,...,0,0,0,0,2,lt,0,3,0,0
6,paha.org.uk,0,majestic_million,0,0,0,0,0,0,11,...,0,0,0,0,6,org.uk,0,3,0,0
7,edwardsandlien.com/,0,data_clean_train_mendel,0,0,0,0,0,0,19,...,0,0,0,0,3,com,0,9,0,0
8,vercontracheque.com.br,0,alexatop1m,0,0,0,0,0,0,22,...,0,0,0,0,6,com.br,0,10,0,0
9,centreforcomposers.org,0,domcop,0,0,0,0,0,0,22,...,0,0,0,0,3,org,0,12,0,0


In [7]:
# Create a simple DataFrame and display it
data = [("John", 25), ("Alice", 30), ("Bob", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

df.show()

                                                                                

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Alice| 30|
|  Bob| 22|
+-----+---+



                                                                                