<a href="https://colab.research.google.com/github/honyango/test/blob/main/SPARK_MAPREDDUCE_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# Install Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download and install Spark
!wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz

# Install required Python package
!pip install -q findspark

Step 2: Set Environment Variables

In [3]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"


Step 3: Initialize Spark

In [7]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc

# Start Spark session
spark = SparkSession.builder.appName("job_skills").getOrCreate()


Step 4: Upload Sample Dataset

In [9]:
from google.colab import files
uploaded = files.upload()


Saving linkedin_job_postings.csv to linkedin_job_postings (1).csv


Step 5: Load and Process the Data with Spark

In [11]:
# Read CSV into Spark DataFrame
df = spark.read.csv("transactions.csv", header=True, inferSchema=True)

# Show schema
df.printSchema()

# Display first few rows
df.show()


root
 |-- job_link: string (nullable = true)
 |-- last_processed_time: string (nullable = true)
 |-- got_summary: string (nullable = true)
 |-- got_ner: string (nullable = true)
 |-- is_being_worked: string (nullable = true)
 |-- job_title: string (nullable = true)
 |-- company: string (nullable = true)
 |-- job_location: string (nullable = true)
 |-- first_seen: string (nullable = true)
 |-- search_city: string (nullable = true)
 |-- search_country: string (nullable = true)
 |-- search_position: string (nullable = true)
 |-- job_level: string (nullable = true)
 |-- job_type: string (nullable = true)

+--------------------+--------------------+-----------+-------+---------------+--------------------+--------------------+--------------------+----------+---------------+--------------+--------------------+----------+--------+
|            job_link| last_processed_time|got_summary|got_ner|is_being_worked|           job_title|             company|        job_location|first_seen|    search_c

Step 6: Perform Data Transformations (EDA)

a. Filter Transactions > $100

In [21]:
high_value_tx = df.filter(col("job_level") == "Mid senior")
print("Jobs with Mid senior job level:")
high_value_tx.show()

Jobs with Mid senior job level:
+--------------------+--------------------+-----------+-------+---------------+--------------------+--------------------+--------------------+----------+---------------+--------------+--------------------+----------+--------+
|            job_link| last_processed_time|got_summary|got_ner|is_being_worked|           job_title|             company|        job_location|first_seen|    search_city|search_country|     search_position| job_level|job_type|
+--------------------+--------------------+-----------+-------+---------------+--------------------+--------------------+--------------------+----------+---------------+--------------+--------------------+----------+--------+
|https://www.linke...|2024-01-21 07:12:...|          t|      t|              f|Account Executive...|                  BD|       San Diego, CA|2024-01-15|       Coronado| United States|         Color Maker|Mid senior|  Onsite|
|https://www.linke...|2024-01-21 07:39:...|          t|      t| 

b. Average Transaction Amount by Type

In [17]:
avg_by_type = df.groupBy("job_type").agg(count("*").alias("count"))
print("Count by Job Type")
avg_by_type.show()

Count by Job Type
+--------------------+-------+
|            job_type|  count|
+--------------------+-------+
|              Remote|   4259|
|              Onsite|1337569|
|              Driver|      1|
|                NULL|     69|
|              Hybrid|   6562|
|          Mid senior|     16|
|           Associate|     11|
|Radiation-Therapy...|      1|
+--------------------+-------+



c. Top Locations by Transaction Count

In [19]:
top_locations = df.groupBy("job_location").agg(count("*").alias("transaction_count")) \
                  .orderBy(desc("transaction_count"))
print("Top Locations by Transaction Count")
top_locations.show()

Top Locations by Transaction Count
+--------------------+-----------------+
|        job_location|transaction_count|
+--------------------+-----------------+
|        New York, NY|            13435|
|London, England, ...|            11756|
|         Houston, TX|            10999|
|         Chicago, IL|            10790|
|     Los Angeles, CA|            10262|
|         Atlanta, GA|             9208|
|          Boston, MA|             8403|
|          Dallas, TX|             8005|
|          Austin, TX|             7678|
|      Washington, DC|             7322|
|         Phoenix, AZ|             7058|
|       San Diego, CA|             6883|
|       Charlotte, NC|             6774|
|    Philadelphia, PA|             6685|
|         Seattle, WA|             6584|
|     San Antonio, TX|             6419|
|   San Francisco, CA|             6014|
|           Tampa, FL|             5896|
|        Columbus, OH|             5798|
|       Las Vegas, NV|             5691|
+--------------------+

8. Save and Download Results

In [None]:
# Convert Spark DataFrame to Pandas and download
pandas_df = avg_by_type.toPandas()
pandas_df.to_csv("average_by_type.csv", index=False)

files.download("average_by_type.csv")
