# Database Connection and Data Retrieval in Python

This notebook demonstrates how to establish a connection to a database using Python and retrieve data into a pandas DataFrame. We will go through the steps of connecting to a database, executing a query, and processing the data for further analysis.

## Prerequisites

Before running this notebook, ensure that you have the following requirements:

- Python 3.6 or above
- pandas library installed (for data manipulation)
- SQLAlchemy library installed (for database connection)
- A database server running and accessible

## Environment Setup

To set up your environment with the required dependencies, use the `pyproject.toml` and `poetry.lock` files to install the necessary packages using Poetry:

```bash
poetry install


## Databases and Python

This section includes the necessary imports and reads a CSV file into a pandas DataFrame. The CSV file is assumed to contain data extracted from a database.


In [1]:
import pandas as pd

file_path = '/Users/User/Desktop/_SELECT_FROM_employees_e_JOIN_salaries_s_ON_e_emp_no_s_emp_no_WH_202311080607.csv'

df = pd.read_csv(file_path)
print(f'{df.shape=}')
df

df.shape=(20798, 10)


Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,emp_no.1,salary,from_date,to_date
0,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,44276,1999-04-30,2000-04-29
1,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46946,2000-04-29,2001-04-29
2,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46775,2001-04-29,2002-04-29
3,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,50032,2002-04-29,9999-01-01
4,10105,1962-02-05,Hironoby,Piveteau,M,1999-03-23,10105,59258,1999-05-17,2000-05-16
...,...,...,...,...,...,...,...,...,...,...
20793,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,43845,2000-08-03,2001-08-03
20794,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,47398,2001-08-03,9999-01-01
20795,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,52282,1999-12-21,2000-12-19
20796,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,54221,2000-12-19,2001-12-19


## Raw Connection to Database

In the following code, we establish a raw connection to a MySQL database using the mysql.connector package. Replace the connection parameters with your actual database credentials.

In [2]:
import mysql.connector

connection = mysql.connector.connect(
    user='root',
    password='college',
    host='localhost',
    database='employees',
    ssl_disabled=True
)

cursor = connection.cursor()

query = """
    SELECT *
    FROM employees e 
    JOIN salaries s 
    ON e.emp_no = s.emp_no 
    WHERE e.hire_date > '1998-01-01';
"""
cursor.execute(query)

results = []
for i, data in enumerate(cursor):
    results.append(data)

cursor.close()
connection.close()

## Dataframe Creation from Query Results
The results from the query are converted into a pandas DataFrame, and column names are assigned to the DataFrame for better readability.

In [3]:
df_db = pd.DataFrame(results)
print(f'{df_db.shape=}')
df_db

df_db.shape=(20798, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,44276,1999-04-30,2000-04-29
1,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46946,2000-04-29,2001-04-29
2,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46775,2001-04-29,2002-04-29
3,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,50032,2002-04-29,9999-01-01
4,10105,1962-02-05,Hironoby,Piveteau,M,1999-03-23,10105,59258,1999-05-17,2000-05-16
...,...,...,...,...,...,...,...,...,...,...
20793,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,43845,2000-08-03,2001-08-03
20794,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,47398,2001-08-03,9999-01-01
20795,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,52282,1999-12-21,2000-12-19
20796,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,54221,2000-12-19,2001-12-19


After creating the DataFrame, we assign column names to match the data:

In [4]:
df_db.columns=['emp_no', 'birth_date', 'first_name', 'last_name', 'gender', 'hire_date', 'emp_no.1', 'salary', 'from_date', 'to_date']
df_db

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,emp_no.1,salary,from_date,to_date
0,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,44276,1999-04-30,2000-04-29
1,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46946,2000-04-29,2001-04-29
2,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46775,2001-04-29,2002-04-29
3,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,50032,2002-04-29,9999-01-01
4,10105,1962-02-05,Hironoby,Piveteau,M,1999-03-23,10105,59258,1999-05-17,2000-05-16
...,...,...,...,...,...,...,...,...,...,...
20793,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,43845,2000-08-03,2001-08-03
20794,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,47398,2001-08-03,9999-01-01
20795,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,52282,1999-12-21,2000-12-19
20796,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,54221,2000-12-19,2001-12-19


## Pandas Read SQL (Alternative Approach)

In addition to the raw SQL connection demonstrated earlier, we can also leverage pandas' built-in SQL functionality for a more streamlined approach.

This section of the code begins by importing the necessary libraries and printing the version of SQLAlchemy to confirm that it's installed correctly.

In [5]:
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
print('sqlalchemy version = ', sqlalchemy.__version__)

sqlalchemy version =  2.0.23


Next, we create a SQL Alchemy engine that will be used to connect to the MySQL database. The connection string contains the necessary credentials which should be secured in a production environment.

The pd.read_sql function is then used to execute a SQL query directly and read the results into a pandas DataFrame.

In [8]:
connection_string = 'mysql+pymysql://root:college@localhost/employees'
engine = create_engine(connection_string)

query = """
    SELECT *
    FROM employees e 
    JOIN salaries s 
    ON e.emp_no = s.emp_no 
    WHERE e.hire_date > '1998-01-01';
"""

df_db_read_sql = pd.read_sql(sql=query, con=engine)

In [9]:
df_db_read_sql

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,emp_no.1,salary,from_date,to_date
0,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,44276,1999-04-30,2000-04-29
1,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46946,2000-04-29,2001-04-29
2,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,46775,2001-04-29,2002-04-29
3,10019,1953-01-23,Lillian,Haddadi,M,1999-04-30,10019,50032,2002-04-29,9999-01-01
4,10105,1962-02-05,Hironoby,Piveteau,M,1999-03-23,10105,59258,1999-05-17,2000-05-16
...,...,...,...,...,...,...,...,...,...,...
20793,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,43845,2000-08-03,2001-08-03
20794,499924,1963-06-08,Angus,Swan,M,1998-08-04,499924,47398,2001-08-03,9999-01-01
20795,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,52282,1999-12-21,2000-12-19
20796,499987,1961-09-05,Rimli,Dusink,F,1998-09-20,499987,54221,2000-12-19,2001-12-19
