# Notebook 1: Extraction 

The process conducted in this notebook involves reading the *candidates* dataset, transforming it into a Pandas DataFrame for easy manipulation, and then writing the processed data to a MySQL database. 

### Importing libraries and modules

The os and dotenv libraries are used to manage environment variables securely, allowing for the seamless loading of database credentials from a .env file. The sqlalchemy library, including its create_engine and text modules, provides a powerful ORM (Object-Relational Mapping) capability, enabling efficient interaction with the MySQL database. Finally, the pandas library is utilized to handle the candidates dataset, transforming it into a DataFrame for easy manipulation, analysis, and eventually writing it to the MySQL database.

In [1]:
import sys
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine, text, types as sqltypes
import pandas as pd

### Establishing the database connection

In [2]:
# Add the 'src' directory to the PYTHONPATH
sys.path.append(os.path.abspath('../src'))

# Importing the utility function from the db_utils module within the mypackage package
from connection.db_utils import get_db_connection, read_candidates_table

# Get the database connection
connection = get_db_connection()

Connected to the database successfully


### Reading the dataset and transforming it into a dataframe

In [3]:
csv_path = "../data/candidates.csv"

df = pd.read_csv(csv_path, sep=";")

df

Unnamed: 0,First Name,Last Name,Email,Application Date,Country,YOE,Seniority,Technology,Code Challenge Score,Technical Interview Score
0,Bernadette,Langworth,leonard91@yahoo.com,26/02/2021,Norway,2,Intern,Data Engineer,3,3
1,Camryn,Reynolds,zelda56@hotmail.com,09/09/2021,Panama,10,Intern,Data Engineer,2,10
2,Larue,Spinka,okey_schultz41@gmail.com,14/04/2020,Belarus,4,Mid-Level,Client Success,10,9
3,Arch,Spinka,elvera_kulas@yahoo.com,01/10/2020,Eritrea,25,Trainee,QA Manual,7,1
4,Larue,Altenwerth,minnie.gislason@gmail.com,20/05/2020,Myanmar,13,Mid-Level,Social Media Community Management,9,7
...,...,...,...,...,...,...,...,...,...,...
49995,Bethany,Shields,rocky_mitchell@hotmail.com,09/01/2022,Dominican Republic,27,Trainee,Security,2,1
49996,Era,Swaniawski,dolores.roob@hotmail.com,02/06/2020,Morocco,21,Lead,Game Development,1,2
49997,Martin,Lakin,savanah.stracke@gmail.com,15/12/2018,Uganda,20,Trainee,System Administration,6,1
49998,Aliya,Abernathy,vivienne.fritsch@yahoo.com,30/05/2020,Czech Republic,20,Senior,Database Administration,0,0


### Data profiling

#### Profiling data types

In [4]:
def pandas_to_mysql_type(pandas_dtype):
    if pd.api.types.is_integer_dtype(pandas_dtype):
        return sqltypes.BIGINT  
    elif pd.api.types.is_numeric_dtype(pandas_dtype) and not pd.api.types.is_integer_dtype(pandas_dtype):  
        return sqltypes.FLOAT  
    elif pd.api.types.is_datetime64_any_dtype(pandas_dtype):
        return sqltypes.DATETIME  
    elif pd.api.types.is_bool_dtype(pandas_dtype):
        return sqltypes.BOOLEAN
    elif isinstance(pandas_dtype, pd.CategoricalDtype):  
        return sqltypes.ENUM  
    else: 
        return sqltypes.VARCHAR(255)  

for col_name, col_dtype in df.dtypes.items():
    mysql_type = pandas_to_mysql_type(col_dtype)
    print(f"Column '{col_name}': Pandas dtype = {col_dtype}, Suggested MySQL type = {mysql_type}")

Column 'First Name': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Last Name': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Email': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Application Date': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Country': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'YOE': Pandas dtype = int64, Suggested MySQL type = <class 'sqlalchemy.sql.sqltypes.BIGINT'>
Column 'Seniority': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Technology': Pandas dtype = object, Suggested MySQL type = VARCHAR(255)
Column 'Code Challenge Score': Pandas dtype = int64, Suggested MySQL type = <class 'sqlalchemy.sql.sqltypes.BIGINT'>
Column 'Technical Interview Score': Pandas dtype = int64, Suggested MySQL type = <class 'sqlalchemy.sql.sqltypes.BIGINT'>


#### Profiling the lenght of numerical values

In [5]:
# Display the maximum values of numeric columns
print("Maximum values in numeric columns:")
print(df.max(numeric_only=True))

Maximum values in numeric columns:
YOE                          30
Code Challenge Score         10
Technical Interview Score    10
dtype: int64


#### Profilingh the lenght of the string values

In [6]:
text_columns = df.select_dtypes(include=['object'])

# Calculate the length of each string in the text columns
lengths = text_columns.apply(lambda col: col.map(lambda x: len(str(x))))

# Find the maximum string lengths for each text column
max_lengths = lengths.max()

# Display the results
print("Maximum string lengths in text columns:")
print(max_lengths)

Maximum string lengths in text columns:
First Name          11
Last Name           13
Email               36
Application Date    10
Country             51
Seniority            9
Technology          39
dtype: int64


In [7]:
csv_path = "../data/candidates.csv"

try:
    # Read CSV using semicolon (;) as separator
    df = pd.read_csv(csv_path, sep=";")

    # Rename columns to match MySQL table schema
    df.rename(columns={
        "First Name": "first_name",
        "Last Name": "last_name",
        "Email": "email",
        "Application Date": "application_date",
        "Country": "country",
        "YOE": "yoe",
        "Seniority": "seniority",
        "Technology": "technology",
        "Code Challenge Score": "code_challenge_score",
        "Technical Interview Score": "technical_interview"
    }, inplace=True)

    # Convert 'application_date' to DATE format
    df["application_date"] = pd.to_datetime(df["application_date"], dayfirst=True).dt.date

    # Insert data into MySQL table
    df.to_sql(name="candidates", con=engine, if_exists='append', index=False)
   


    print(f"✅ Data successfully inserted into '{"candidates"}' in database '{dbname}'!")
except Exception as e:
    print(f"❌ Error: {e}")


❌ Error: name 'engine' is not defined
