# Candidate Data Extraction and Loading
This notebook extracts data from the `candidates.csv` file and loads it into the `raw_candidates` table of a PostgreSQL database. It is the first step of the ETL process, preparing the data for subsequent analysis or transformations.

## Database Setup
Run the `setup.py` script to create the database `etl_workshop_db` and the tables `raw_candidates` and `applicant` if they don't already exist.

In [26]:
%run ../scripts/setup.py

Setting up the database and tables...
Database 'etl_workshop_db' already exists.
Tables 'raw_candidates' and 'applicant' created successfully (if they didn't exist).
Indices created successfully (if they didn't exist).
Setup completed successfully!


## Initial Configuration
Here, the necessary libraries are imported, and a connection to PostgreSQL is established using secure credentials from a `.env` file.

In [27]:
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()
connection_string = f"postgresql://{os.getenv('DB_USER')}:{os.getenv('DB_PASSWORD')}@{os.getenv('DB_HOST')}:{os.getenv('DB_PORT')}/{os.getenv('DB_NAME')}"
engine = create_engine(connection_string)

## CSV File Reading
This section reads the `candidates.csv` file using pandas. It includes error handling to ensure the process does not fail if something goes wrong.

In [28]:
try:
    df = pd.read_csv('../data/candidates.csv', sep=';')
except FileNotFoundError:
    print("Error: candidates.csv not found")
    raise

# Validate expected columns
expected_columns = ['First Name', 'Last Name', 'Email', 'Application Date', 'Country', 'YOE', 'Seniority', 'Technology', 'Code Challenge Score', 'Technical Interview Score']
if list(df.columns) != expected_columns:
    print("Error: CSV does not have the expected columns.")
    raise ValueError("Column mismatch in candidates.csv")

In [29]:
if not df.empty:
    print(f"Data loaded successfully. Rows: {len(df)}, Columns: {len(df.columns)}")
else:
    print("The DataFrame is empty.")

Data loaded successfully. Rows: 50000, Columns: 10


## Data Loading to the Database
The data is loaded into the `raw_candidates` table of PostgreSQL, replacing the table if it already exists.

In [30]:
try:
    df.to_sql('raw_candidates', engine, if_exists='append', index=False)
    print("Data loaded into raw_candidates.")
except Exception as e:
    print(f"Error loading data into PostgreSQL: {e}")

Data loaded into raw_candidates.


## Verification of Data in raw_candidates
The following section displays the first 5 rows of the `raw_candidates` table to confirm that the data was loaded correctly:

In [31]:
# Verify that everything is correct
query = "SELECT * FROM raw_candidates LIMIT 5;"
pd.read_sql(query, engine)

Unnamed: 0,id,First Name,Last Name,Email,Application Date,Country,YOE,Seniority,Technology,Code Challenge Score,Technical Interview Score
0,1,Bernadette,Langworth,leonard91@yahoo.com,2021-02-26,Norway,2,Intern,Data Engineer,3,3
1,2,Camryn,Reynolds,zelda56@hotmail.com,2021-09-09,Panama,10,Intern,Data Engineer,2,10
2,3,Larue,Spinka,okey_schultz41@gmail.com,2020-04-14,Belarus,4,Mid-Level,Client Success,10,9
3,4,Arch,Spinka,elvera_kulas@yahoo.com,2020-10-01,Eritrea,25,Trainee,QA Manual,7,1
4,5,Larue,Altenwerth,minnie.gislason@gmail.com,2020-05-20,Myanmar,13,Mid-Level,Social Media Community Management,9,7


## Summary
- **Data Loaded**: 50,000 rows, 10 columns.
- **Target Table**: `raw_candidates`.
- **Next Steps**: Proceed to `02_explore_data.ipynb` for data exploration and analysis.