# Data Science Skills

## LinkedIn SQL Interview Question

### Question

Given a table of candidates and their skills, you're tasked with finding the candidates best suited for an open Data Science job.  
You want to find candidates who are proficient in **Python**, **Tableau**, and **PostgreSQL**.

Write a query to list the candidates who possess **all of the required skills** for the job.  
Sort the output by `candidate_id` in ascending order.

**Assumption:**  
There are no duplicates in the `candidates` table.

---

### Table: `candidates`

| Column Name   | Type     |
|---------------|----------|
| candidate_id  | integer  |
| skill         | varchar  |

---

### Example Input:

| candidate_id | skill       |
|--------------|-------------|
| 123          | Python      |
| 123          | Tableau     |
| 123          | PostgreSQL  |
| 234          | R           |
| 234          | PowerBI     |
| 234          | SQL Server  |
| 345          | Python      |
| 345          | Tableau     |

---

### Example Output:

| candidate_id |
|--------------|
| 123          |

---

### Explanation

Candidate **123** is displayed because they have all the required skills:  
**Python**, **Tableau**, and **PostgreSQL**.

Candidate **345** is not included because they are missing **PostgreSQL**.


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import *

# Create Spark session
spark = SparkSession.builder.master('local').appName("CandidateSkills").getOrCreate()

# Define the schema
schema = StructType([
    StructField("candidate_id", IntegerType(), True),
    StructField("skill", StringType(), True)
])

# Define the data
data = [
    (123, "Python"),
    (123, "Tableau"),
    (123, "PostgreSQL"),
    (234, "R"),
    (234, "PowerBI"),
    (234, "SQL Server"),
    (345, "Python"),
    (345, "Tableau")
]

# Create the DataFrame
candidates_df = spark.createDataFrame(data, schema)

# Show the DataFrame
candidates_df.show()


+------------+----------+
|candidate_id|     skill|
+------------+----------+
|         123|    Python|
|         123|   Tableau|
|         123|PostgreSQL|
|         234|         R|
|         234|   PowerBI|
|         234|SQL Server|
|         345|    Python|
|         345|   Tableau|
+------------+----------+



In [9]:
candidates_df.filter(col('skill').isin('Python','Tableau','PostgreSQL'))\
             .groupBy('candidate_id').count()\
             .where('count > 2')\
             .select('candidate_id')\
             .show()

+------------+
|candidate_id|
+------------+
|         123|
+------------+



In [11]:
candidates_df.createOrReplaceTempView('candidates')

In [12]:
%%sql
SELECT candidate_id FROM candidates
where 
skill = 'Python' or skill ='Tableau' or skill ='PostgreSQL'
group by candidate_id
having count(candidate_id)=3
order by candidate_id;

+------------+
|candidate_id|
+------------+
|         123|
+------------+

