# NVCU Database Generator

By Kenneth Burchfiel

Released under the MIT License

This script will create a SQLite database for the (fictional) Northern Virginia Catholic University that can be referenced within other sections of Python for Nonprofits. The data within this database will be fictional also.

In [1]:
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
# Setting up random number generation capabilities:
rng = np.random.default_rng(2325) 
# Based on https://numpy.org/doc/stable/reference/random/generator.html 
# The faker library is a great tool for creating fictional
# database records. I set the locale to 'en_US' because NVCU
# is based in the United States.
from faker import Faker; fake = Faker('en_US')  



## Connecting to our NVCU database:

(This code will work even if no database exists at the path shown below just yet.)


In [2]:
e = create_engine('sqlite:///nvcu_db.db')
    # Based on: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite


In [3]:
student_count = 2**14
student_count

16384

# Creating fictional database data

The [Faker documentation for the 'en_US' (US English) locale](https://faker.readthedocs.io/en/master/locales/en_US.html) was a useful resource in writing this code.

## Creating lists of names:

We'll create equal numbers of female and male first names, then concatenate these lists to create a single list of first names.

In [4]:
student_count
# It's convenient to also create a variable for the value equal
# to half of the student count. (Dividing the student count by
# 2 produces a float by default, so we'll convert this value to 
# an int so that it can be used as a range within list comprehensions.)
half_student_count = int(student_count / 2)
half_student_count

8192

In [5]:
female_first_names = [
    fake.first_name_female() 
    for i in range(half_student_count)]
male_first_names = [
    fake.first_name_male() for i in range(half_student_count)]
first_names = female_first_names + male_first_names
last_names = [
    fake.last_name() for i in range(student_count)]

In [6]:
# In order to make our genders match our names, we'll 
# make the first half of our gender list female and the second 
# half male (as the first and second halves of our first names
# list show male and female names, respectively.)
genders = (['F' for i in range(half_student_count)] 
+ ['M' for i in range(half_student_count)])

## Initializing our current enrollment table:

In [7]:
df_curr_enrollment = pd.DataFrame(
    index = np.arange(0,student_count), 
data = {'first_name':first_names,
        'last_name':last_names,
        'gender':genders})
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender
0,Veronica,Wilkerson,F
1,Michele,Jones,F
2,Sheri,Jennings,F
3,Erika,Nelson,F
4,Natalie,Horton,F
...,...,...,...
16379,Michael,Lozano,M
16380,Charles,Clark,M
16381,Cody,Allen,M
16382,Edwin,Sharp,M


Creating matriculation years:

In order to simulate increasing enrollment over time, weights were added to the rng.choice() call so that recent years would appear more frequently.

In [8]:
rng.choice(
    [2020, 2021, 2022, 2023], 
    p=[0.20, 0.22, 0.25, 0.33], size = student_count)

array([2023, 2023, 2020, ..., 2023, 2022, 2023])

In [9]:
df_curr_enrollment['matriculation_year'] = rng.choice(
    [2020, 2021, 2022, 2023], p=[0.20, 0.22, 0.25, 0.33], 
    size=student_count)
# https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender,matriculation_year
0,Veronica,Wilkerson,F,2020
1,Michele,Jones,F,2020
2,Sheri,Jennings,F,2020
3,Erika,Nelson,F,2022
4,Natalie,Horton,F,2023
...,...,...,...,...
16379,Michael,Lozano,M,2020
16380,Charles,Clark,M,2020
16381,Cody,Allen,M,2022
16382,Edwin,Sharp,M,2020


## Creating student IDs:
Student IDs will use the format matriculation_year-matriculation_number. matriculation_number represents the order in which students enrolled for a given year; these numbers are unique within each year, but not between years. This number can then be combined with students' matriculation years to form a unique ID.

In [10]:
# Calculating matriculation numbers by grouping the DataFrame by
# matriculation year, then assigning each student within each year a unique
# number:
# (This can be achieved via df.groupby() and df.rank(). See
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html )
df_curr_enrollment['matriculation_number'] = df_curr_enrollment.groupby(
    'matriculation_year')['matriculation_year'].rank(
    method = 'first').astype('int')
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number
0,Veronica,Wilkerson,F,2020,1
1,Michele,Jones,F,2020,2
2,Sheri,Jennings,F,2020,3
3,Erika,Nelson,F,2022,1
4,Natalie,Horton,F,2023,1
...,...,...,...,...,...
16379,Michael,Lozano,M,2020,3287
16380,Charles,Clark,M,2020,3288
16381,Cody,Allen,M,2022,3999
16382,Edwin,Sharp,M,2020,3289


In [11]:
# Adding matriculation years, matriculation numbers, and hyphens together
# to create student IDs:
df_curr_enrollment['student_id'] = (
    df_curr_enrollment['matriculation_year'].astype('str') 
    + '-' 
    + df_curr_enrollment['matriculation_number'].astype('str'))
# Sorting the DataFrame by matriculation year and matriculation number:
# (The DataFrame could also be sorted by student_id, but because these numbers
# would be interpreted as strings, ids like 2020-10 would appear 
# in front of ones like 2020-2.)
df_curr_enrollment.sort_values(
    ['matriculation_year', 'matriculation_number'], inplace = True)
df_curr_enrollment.reset_index(drop=True,inplace=True)
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id
0,Veronica,Wilkerson,F,2020,1,2020-1
1,Michele,Jones,F,2020,2,2020-2
2,Sheri,Jennings,F,2020,3,2020-3
3,Frances,Bright,F,2020,4,2020-4
4,Sarah,Calhoun,F,2020,5,2020-5
...,...,...,...,...,...,...
16379,Tyler,Huynh,M,2023,5439,2023-5439
16380,Eric,Molina,M,2023,5440,2023-5440
16381,Adam,Braun,M,2023,5441,2023-5441
16382,Brent,Fry,M,2023,5442,2023-5442


## Assigning students to different colleges:

(I used Wikipedia's ['List of patron saints by occupation and activity'](https://en.wikipedia.org/wiki/List_of_patron_saints_by_occupation_and_activity) page to determine after which saint each college would be named.)

NVCU has four different colleges:

1. St. Luke's, a humanities college. (St. Luke is one of the patron saints of artists.) Abbreviation: STL
2. St. Benedict's, a STEM college. (St. Benedict is one of the patron saints of engineers.) Abbreviation: STB
3. St. Matthew's, a business college. (St. Matthew is the patron saint of accountants.) Abbreviation: STM
4. St. Catherine's, a health college. (St. Catherine of Alexandria is one of the patron saints of nurses.) Abbreviation: STC

We can use np.choice to assign students to different colleges. However, to make the data more interesting, we'll have one college (STL) increase in popularity over time; another (STC) decrease in popularity; and the two remaining colleges remain roughly constant in popularity. We can simulate these changes by (1) creating filtered versions of the DataFrame for each year; (2) calling np.choice() with different probability sets for each year in order to create the 'college' column; and (3) recreating df_curr_enrollment by adding these year-specific DataFrames back together.

In [12]:
df_list = []
for year in df_curr_enrollment['matriculation_year'].unique():
    print(f"Now adding in college enrollments for {year}.")
    df = df_curr_enrollment.query("matriculation_year == @year").copy()
    if year == 2020:
        probabilities = [0.15, 0.25, 0.3, 0.3]
    elif year == 2021:
        probabilities = [0.19, 0.26, 0.29, 0.26]
    elif year == 2022:
        probabilities = [0.25, 0.27, 0.27, 0.21]
    elif year == 2023:
        probabilities = [0.27, 0.25, 0.31, 0.17]
    else:
        raise ValueError(
            f"A probability list needs to be added in for {year}.")
    df['college'] = rng.choice(
        ['STL', 'STB', 'STM', 'STC'], 
        p = probabilities, size = len(df))
    df_list.append(df)
df_curr_enrollment = pd.concat([df for df in df_list])

Now adding in college enrollments for 2020.
Now adding in college enrollments for 2021.
Now adding in college enrollments for 2022.
Now adding in college enrollments for 2023.


The commented-out cell below shows an alternative approach to assigning colleges to each student. Because it iterates through each row in the DataFrame, it took 2.25 seconds to run on my laptop versus 0.016 seconds for the method shown above; in other words, the method above was around 133 times faster.

In [13]:
# df_curr_enrollment['college'] = ''
# college_col = df_curr_enrollment.columns.get_loc('college')
# for i in range(len(df_curr_enrollment)):
#     year = df_curr_enrollment.iloc[i]['matriculation_year']
#     if year == 2020:
#         probabilities = [0.15, 0.25, 0.3, 0.3]
#     elif year == 2021:
#         probabilities = [0.19, 0.26, 0.29, 0.26]
#     elif year == 2022:
#         probabilities = [0.25, 0.27, 0.27, 0.21]
#     elif year == 2022:
#         probabilities = [0.27, 0.25, 0.31, 0.17]
#     df_curr_enrollment.iloc[i, college_col] = rng.choice(
#     ['STL', 'STB', 'STM', 'STC'], p = probabilities)

In [14]:
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,college
0,Veronica,Wilkerson,F,2020,1,2020-1,STM
1,Michele,Jones,F,2020,2,2020-2,STC
2,Sheri,Jennings,F,2020,3,2020-3,STB
3,Frances,Bright,F,2020,4,2020-4,STB
4,Sarah,Calhoun,F,2020,5,2020-5,STB
...,...,...,...,...,...,...,...
16379,Tyler,Huynh,M,2023,5439,2023-5439,STB
16380,Eric,Molina,M,2023,5440,2023-5440,STB
16381,Adam,Braun,M,2023,5441,2023-5441,STC
16382,Brent,Fry,M,2023,5442,2023-5442,STB


In [15]:
## Assigning additional year-related values:

In [16]:
df_curr_enrollment['class_of'] = df_curr_enrollment['matriculation_year'] + 4
df_curr_enrollment['level'] = df_curr_enrollment['matriculation_year'].map(
    {2020:'Freshman',2021:'Sophomore',
     2022:'Junior',2023:'Senior'})
# Creating an integer-based equivalent to 'level':
df_curr_enrollment['level_for_sorting'] = (
    df_curr_enrollment['matriculation_year'] - 2020)
df_curr_enrollment

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,college,class_of,level,level_for_sorting
0,Veronica,Wilkerson,F,2020,1,2020-1,STM,2024,Freshman,0
1,Michele,Jones,F,2020,2,2020-2,STC,2024,Freshman,0
2,Sheri,Jennings,F,2020,3,2020-3,STB,2024,Freshman,0
3,Frances,Bright,F,2020,4,2020-4,STB,2024,Freshman,0
4,Sarah,Calhoun,F,2020,5,2020-5,STB,2024,Freshman,0
...,...,...,...,...,...,...,...,...,...,...
16379,Tyler,Huynh,M,2023,5439,2023-5439,STB,2027,Senior,3
16380,Eric,Molina,M,2023,5440,2023-5440,STB,2027,Senior,3
16381,Adam,Braun,M,2023,5441,2023-5441,STC,2027,Senior,3
16382,Brent,Fry,M,2023,5442,2023-5442,STB,2027,Senior,3


## Saving this table to our NVCU database:

(This operation will also create our database file (e.g. nvcu_db.db, the path specified when we first created our database engine) if it did not exist already.)

In [17]:
df_curr_enrollment.to_sql(
    'curr_enrollment', 
    con=e, if_exists='replace', index=False)

16384

To demonstrate that the above operation was successful, we can read in a copy of this table via pd.read_sql():

In [18]:
pd.read_sql('curr_enrollment', con = e)

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,college,class_of,level,level_for_sorting
0,Veronica,Wilkerson,F,2020,1,2020-1,STM,2024,Freshman,0
1,Michele,Jones,F,2020,2,2020-2,STC,2024,Freshman,0
2,Sheri,Jennings,F,2020,3,2020-3,STB,2024,Freshman,0
3,Frances,Bright,F,2020,4,2020-4,STB,2024,Freshman,0
4,Sarah,Calhoun,F,2020,5,2020-5,STB,2024,Freshman,0
...,...,...,...,...,...,...,...,...,...,...
16379,Tyler,Huynh,M,2023,5439,2023-5439,STB,2027,Senior,3
16380,Eric,Molina,M,2023,5440,2023-5440,STB,2027,Senior,3
16381,Adam,Braun,M,2023,5441,2023-5441,STC,2027,Senior,3
16382,Brent,Fry,M,2023,5442,2023-5442,STB,2027,Senior,3


This same table can also be saved as a standalone .csv file (thus making it easier to examine via a spreadsheet editor):

In [20]:
df_curr_enrollment.to_csv('curr_enrollment.csv', index = False)

Note that a SQLAlchemy engine can be used as the 'con' argument within both read_sql() and to_sql(). The [read_sql() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) states that 'con' needs to be a 'SQLAlchemy connectable,' and [the source code for sql.py()](https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/io/sql.py#L570-L743) specifies that 'SQLAlchemy connectable' can be either an engine or a connection.

# More tables will be added in the future!