# Duplicate Job Listings

## LinkedIn SQL Interview Question

### Question

Assume you're given a table containing job postings from various companies on the LinkedIn platform. Write a query to retrieve the count of companies that have posted duplicate job listings.

---

### Definition:

Duplicate job listings are defined as two job listings within the same company that share identical titles and descriptions.

---

### Table: `job_listings`

| Column Name   | Type       |
|---------------|------------|
| job_id        | integer    |
| company_id    | integer    |
| title         | string     |
| description   | string     |

---

### Example Input for `job_listings` Table:

| job_id | company_id | title          | description                                                                 |
|--------|------------|----------------|-----------------------------------------------------------------------------|
| 248    | 827        | Business Analyst | Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations. |
| 149    | 845        | Business Analyst | Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations. |
| 945    | 345        | Data Analyst   | Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems. |
| 164    | 345        | Data Analyst   | Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems. |
| 172    | 244        | Data Engineer  | Data engineer works in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. |

---

### Example Output:

| duplicate_companies |
|---------------------|
| 1                   |

---

### Explanation

There is one company ID **345** that posted duplicate job listings. The duplicate listings, IDs **945** and **164**, have identical titles and descriptions.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from datetime import datetime

# Create Spark session
spark = SparkSession.builder.appName("JobListings").getOrCreate()
sc = spark.sparkContext

# Define the data for job_listings table
df = sc.parallelize([
    (248, 827, "Business Analyst", "Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations."),
    (149, 845, "Business Analyst", "Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations."),
    (945, 345, "Data Analyst", "Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems."),
    (164, 345, "Data Analyst", "Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems."),
    (172, 244, "Data Engineer", "Data engineer works in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret.")
])


# Show the DataFrame
df.toDF().show(truncate=False)


+---+---+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2 |_3              |_4                                                                                                                                                                                      |
+---+---+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|248|827|Business Analyst|Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations.                                            |
|149|845|Business Analyst|Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within 

In [None]:
df.map(lambda x:((x[1],x[2],x[3]),x[0]))\
  .redcuceBy