<a href="https://www.kaggle.com/code/dsptlp/spark?scriptVersionId=163495518" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# SPARK 
- Reasons to Use Spark
- This notebook will compare Spark VS Pandas

# NOTE 
- Spark is designed to work in a distributed computing environment and is most effective when dealing with large datasets and clusters of machines. 
- In Kaggle's limited environment, we are not using a distributed computing environment but will be able to use all the computer resources which will be the only benefit. 

# SPARK ADVANTAGES

1. **Speed:** Spark is known for its speed, as it can perform in-memory processing, reducing the need to write intermediate results to disk. This makes Spark well-suited for iterative algorithms and interactive data analysis.

2. **Ease of Use:** Spark provides high-level APIs in languages such as Scala, Java, Python, and R, making it accessible to a wide range of users. It also offers built-in libraries for various tasks like SQL, machine learning (MLlib), graph processing (GraphX), and stream processing (Spark Streaming).

3. **Scalability:** Spark is designed for distributed computing, allowing it to scale horizontally across a cluster of machines. This makes it suitable for handling large datasets and processing tasks that would be challenging for single-node systems.

4. **Versatility:** Spark supports a variety of data processing scenarios, including batch processing, interactive queries, streaming analytics, and machine learning. This versatility makes it a preferred choice for organizations with diverse data analysis needs.

5. **Fault Tolerance:** Spark provides fault tolerance through lineage information and resilient distributed datasets (RDDs). If a node fails, Spark can recompute the lost data using the lineage information, ensuring the reliability of data processing.

6. **Integration with Big Data Ecosystem:** Spark seamlessly integrates with other big data tools and technologies, such as Hadoop Distributed File System (HDFS), Apache Hive, Apache HBase, and more. This allows users to leverage existing data storage and processing systems.

7. **Community Support:** Spark has a large and active open-source community. This means continuous development, improvements, and a wealth of resources, including documentation, forums, and tutorials.

8. **In-Memory Processing:** Spark's ability to store intermediate data in memory rather than writing to disk can significantly improve performance, especially for iterative algorithms and interactive data analysis, compared to traditional disk-based processing.


In [8]:
# Install PySpark
try:
    import pyspark
except ImportError:
    print("pyspark not found. Installing...")
    !pip install pyspark > pyspark.log.txt
    print("pyspark installed successfully!")

In [9]:
# Import necessary libraries
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from matplotlib.lines import Line2D
from matplotlib import cm
import numpy as np 
import pandas as pd
import seaborn as sns
import warnings
import timeit

from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Suppress all warnings
warnings.filterwarnings("ignore")

In [10]:
# Create a Spark session
spark = SparkSession.builder.appName("Spark").getOrCreate()

# Set log level to OFF 
spark.sparkContext.setLogLevel("OFF")

In [11]:
file_path  = "/kaggle/input/tabular-dataset-ready-for-malicious-url-detection/train_dataset.csv"

In [12]:
measures = []

# LOADING DATA

## SPARK

In [14]:
%%time

def load_csv_using_spark():
    df = spark.read.csv(file_path, header=True, inferSchema=True)

    # Perform the summary: count number of records grouped by a column
    summary_df = df.groupBy("label").count()
    
    # Perform an action (triggers execution, note that spark uses Lazy Execution)
    summary_df.collect() #show()
    
# Measure the execution time
execution_time = timeit.timeit(load_csv_using_spark, number=10)

# Print the result
print(f"Execution time using SPARK: {execution_time} seconds")
measures.append(('SPARK','load_csv',execution_time))



Execution time using SPARK: 603.830106264 seconds
CPU times: user 386 ms, sys: 92.4 ms, total: 478 ms
Wall time: 10min 3s


                                                                                

## PANDAS

In [None]:
%%time

def load_csv_using_pandas():
    data_df = pd.read_csv(file_path, delimiter=',') 
    summary_df = data_df[['url_has_login','label']].groupby(['label']).count()
    
# Measure the execution time
execution_time = timeit.timeit(load_csv_using_pandas, number=10)

# Print the result
print(f"Execution time using PANDAS: {execution_time} seconds")
measures.append(('PANDAS','load_csv',execution_time))

# FORMAT DATA

## SPARK

## PANDAS

In [None]:
data_subset_df = data_df[['label','url_has_login','url_has_client','url_has_server','url_len']]

# RANDOMFOREST

## SPARK

In [None]:
# Example: Assuming you have a binary classification problem
assembler = VectorAssembler(inputCols=["feature1", "feature2", ...], outputCol="features")
df = assembler.transform(df)

# Split the data into training and testing sets
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

rf = RandomForestClassifier(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[assembler, rf])

paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [10, 20, 30])
             .addGrid(rf.maxDepth, [5, 10, 15])
             .build())

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

cv_model = crossval.fit(train_data)

predictions = cv_model.transform(test_data)

area_under_roc = evaluator.evaluate(predictions)
print(f"Area under ROC: {area_under_roc}")

## PANDAS

In [None]:
# Prepare the data (handle missing values, convert categorical features, etc.)

# Split the data into features (X) and target variable (y)
X = df.drop("target_column", axis=1)  # Replace "target_column" with the actual target column name
y = df["target_column"]

# Feature Engineering (if needed)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the RandomForest model
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='roc_auc')

# Fit the model on the full training set
rf.fit(X_train, y_train)

# Make predictions on the test set
predictions = rf.predict(X_test)

# Evaluate the model
area_under_roc = roc_auc_score(y_test, predictions)
print(f"Area under ROC on test set: {area_under_roc}")

# Display cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print(f"Mean Cross-Validation Score: {cv_scores.mean()}")

# FINAL RESULTS

In [None]:
print(measures)