# Data Science Skills

## LinkedIn SQL Interview Question

### Question

Given a table of candidates and their skills, you're tasked with finding the candidates best suited for an open Data Science job.  
You want to find candidates who are proficient in **Python**, **Tableau**, and **PostgreSQL**.

Write a query to list the candidates who possess **all of the required skills** for the job.  
Sort the output by `candidate_id` in ascending order.

**Assumption:**  
There are no duplicates in the `candidates` table.

---

### Table: `candidates`

| Column Name   | Type     |
|---------------|----------|
| candidate_id  | integer  |
| skill         | varchar  |

---

### Example Input:

| candidate_id | skill       |
|--------------|-------------|
| 123          | Python      |
| 123          | Tableau     |
| 123          | PostgreSQL  |
| 234          | R           |
| 234          | PowerBI     |
| 234          | SQL Server  |
| 345          | Python      |
| 345          | Tableau     |

---

### Example Output:

| candidate_id |
|--------------|
| 123          |

---

### Explanation

Candidate **123** is displayed because they have all the required skills:  
**Python**, **Tableau**, and **PostgreSQL**.

Candidate **345** is not included because they are missing **PostgreSQL**.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import *

# Create Spark session
spark = SparkSession.builder.master('local').appName("CandidateSkills").getOrCreate()
sc= spark.sparkContext

# Define the data
df = sc.parallelize([
    (123, "Python"),
    (123, "Tableau"),
    (123, "PostgreSQL"),
    (234, "R"),
    (234, "PowerBI"),
    (234, "SQL Server"),
    (345, "Python"),
    (345, "Tableau")
])



# Show the DataFrame
df.toDF().show()


+---+----------+
| _1|        _2|
+---+----------+
|123|    Python|
|123|   Tableau|
|123|PostgreSQL|
|234|         R|
|234|   PowerBI|
|234|SQL Server|
|345|    Python|
|345|   Tableau|
+---+----------+



In [None]:
df2=df.filter(lambda x:(x[1] in ['Python','Tableau','PostgreSQL']))\
      .map(lambda x:(x[0],1))\
      .reduceByKey(lambda x,y : x+y)\
      .filter(lambda x:(x[1]>2))
      
df2.toDF().show()

+---+---+
| _1| _2|
+---+---+
|123|  3|
+---+---+



: 