# Python for Nonprofits Part x: Spreadsheet Operations

By Kenneth Burchfiel

Released under the MIT license

This notebook provides an interview to executing spreadsheet operations in PythonThe benefit of performing these tasks in Python (rather than Excel, Google Sheets, or another spreadsheet program) is that, once you have these tasks scripted, you can quickly rerun these tasks whenever the original data gets updated*. You can even have your computer run the script on a daily or hourly basis, thus saving you from busywork and freeing up your time for more interesting tasks. 

For example, suppose leaders at your school network would like to see an overview of the network's enrollment each day. One way to accomplish this task would be to retrieve data from your database each day, paste it into Excel or Google Sheets, pivot the data, and then share the output with them. However, you could accomplish these same steps much more quickly in Python. This notebook will show you how!

\* There are certainly ways to automate Excel tasks as well (e.g. using Visual Basic). I don't have any experience with Visual Basic, so I'm not the best person to compare these two tools; however, I have no doubt tat learning it would take some time, and given Python's versatility and power, I would recommend applying that time to learning Python instead. (You can get an estimate of the world's interest in Python versus Visual Basic by checking out the [TIOBE index](https://www.tiobe.com/tiobe-index/).)

Importing the libaries we'll need for this project:

In [1]:
import time
start_time = time.time() # Allows the program's runtime to be measured
import pandas as pd
import sqlalchemy

# Part 1: Importing data

## Connecting to our SQLite database:

This local SQLite database was created using the database_generator.ipnyb code found in supplemental/db_generator. The steps for connecting to an online database are quite similar; for guidance on this process, visit the [app_functions_and_variables.py](https://github.com/kburchfiel/dash_school_dashboard/blob/main/dsd/app_functions_and_variables.py) file within my [Dash School Dashboard](https://github.com/kburchfiel/dash_school_dashboard) project.

In [2]:
engine = sqlalchemy.create_engine(
'sqlite:///'+'../data/network_database.db')
# Based on:
#  https://docs.sqlalchemy.org/en/13/dialects/sqlite.html#connect-strings

engine

Engine(sqlite:///../data/network_database.db)

## Reviewing a list of all tables within our database:

In [3]:
pd.read_sql("Select name from sqlite_schema", con = engine)

Unnamed: 0,name
0,curr_enrollment
1,test_results
2,grad_outcomes


## Retrieving all data from the curr_enrollment table and reading it into a Pandas DataFrame

DataFrames are essentially spreadsheets that can be manipulated and summarized within Python. It's easy to convert them to .csv or .xlsx files (or vice versa). Lots of data analysis tasks within Python involve DataFrames, so they will show up very often within Python for Nonprofits.

In [4]:
df_curr_enrollment = pd.read_sql(
    "Select * from curr_enrollment", con = engine)

# 'Select * from curr_enrollment' is a SQL command that imports all
# data from the current_enrollment table within our SQLite database.
# If we wanted to import only a few columns, we could replace the * 
# with those specific column names (e.g. 'Student_ID', 'School', 'Grade');
# or, if we wanted to import data for just one school, we could enter:
# "Select * from curr_enrollment where School = 'CA'". 

# SQL is a language of its own and not the main focus of Python for Nonprofits,
# but Pandas' read_sql and to_sql functions will allow you to perform
# many SQL-related tasks with only an elementary understanding of how it works.

# Some of the Python code in this notebook could be replaced with SQL
# code, which could actually speed up the program's runtime (since you wouldn't
# need to import as much data within your initial SQL query), but this
# notebook's purpose is to demonstrate how to use Pandas, not SQL.

df_curr_enrollment

Unnamed: 0,Student_ID,First_Name,Last_Name,Full_School_Name,School,Grade,Gender,Race,Ethnicity,Street,City,State,Zip,Lat,Lon,Address,Students,Grade_for_Sorting
0,42646,Jeanne,Bell,Chestnut Academy,CA,1,Female,African American,Non-Hispanic,200 N PINECREST LN,BRISTOL,VA,24201,36.615974,-82.190919,"200 N PINECREST LN, BRISTOL, VA 24201",1,1
1,41632,Theodore,Brown,Chestnut Academy,CA,1,Male,African American,Non-Hispanic,1301 Whitehead Rd,Richmond,VA,23225,37.489169,-77.508759,"1301 Whitehead Rd, Richmond, VA 23225",1,1
2,42586,Lynn,Callahan,Chestnut Academy,CA,1,Female,African American,Hispanic,11230 WAPLES MILL RD STE 100,FAIRFAX,VA,22030,38.858060,-77.334451,"11230 WAPLES MILL RD STE 100, FAIRFAX, VA 22030",1,1
3,40108,Edward,Carrillo,Chestnut Academy,CA,1,Male,White,Hispanic,901 Rose Hill Drive,Charlottesville,VA,22903,38.039900,-78.486600,"901 Rose Hill Drive, Charlottesville, VA 22903",1,1
4,43600,Sara,Carson,Chestnut Academy,CA,1,Female,Asian,Hispanic,1825 Wenonah Avenue,Pearisburg,VA,24134,37.327800,-80.704500,"1825 Wenonah Avenue, Pearisburg, VA 24134",1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,42085,Jessica,Williams,Sycamore Academy,SA,K,Female,American Indian,Non-Hispanic,100 Cedarmeade Ave,Winchester,VA,22601,39.151300,-78.180800,"100 Cedarmeade Ave, Winchester, VA 22601",1,0
3996,42179,Kim,Williams,Sycamore Academy,SA,K,Female,American Indian,Non-Hispanic,7719D FULLERTON RD,SPRINGFIELD,VA,22153,38.741592,-77.211129,"7719D FULLERTON RD, SPRINGFIELD, VA 22153",1,0
3997,41677,Micheal,Williams,Sycamore Academy,SA,K,Male,White,Hispanic,361 Walnut St,Warsaw,VA,22572,37.945528,-76.744655,"361 Walnut St, Warsaw, VA 22572",1,0
3998,42238,Susan,Williams,Sycamore Academy,SA,K,Female,African American,Non-Hispanic,5000 W NORFOLK RD,PORTSMOUTH,VA,23703,36.869169,-76.378579,"5000 W NORFOLK RD, PORTSMOUTH, VA 23703",1,0


## Part 2: Analyzing this data

Let's say that you want to determine the number of students in each grade at each school. You can do so easily using the pivot_table function within Pandas. In the following code, 'index' represents the pairs of variables for which you want to analyze a given metric; 'values' shows the items that you wish to analyze, and 'aggfunc' shows how you wish to analyze them. In this case, we want to count the number of students belonging to each school-grade pair, so we'll pass ['School', 'Grade_for_Sorting', 'Grade'] to index; 'Students' (a column that contains the value '1' for each student); and 'sum' to aggfunc. 

('Grade_for_Sorting' is added before 'Grade' so that the pivot output will sort grades in the correct ascending order (e.g. 'K', '1', '2' . . . '11', '12'). Because the 'Grade' column uses an object data type, its default sort order would be alphabetical (e.g. '1', '11', '12' . . . '8', '9', 'K'), which certainly isn't what we want. Therefore, we'll sort the data by a column that stores all grades as integers *and* sets K equal to 0, thus eliminating the need to attempt an alphabetical sort.)

In [5]:
df_school_grade_pivot = df_curr_enrollment.pivot_table(
    index = ['School', 'Grade_for_Sorting', 'Grade'], 
    values = 'Students', aggfunc = 'sum').reset_index()

# Here's what the first 15 rows of the DataFrame look like:

df_school_grade_pivot.head(15) # .head(15) allows us to view the first
# 15 rows of data; similarly, .tail(5) would let us see the final 5 rows.

Unnamed: 0,School,Grade_for_Sorting,Grade,Students
0,CA,0,K,90
1,CA,1,1,71
2,CA,2,2,76
3,CA,3,3,61
4,CA,4,4,85
5,CA,5,5,66
6,CA,6,6,74
7,CA,7,7,65
8,CA,8,8,75
9,CA,9,9,77


To determine schoolwide student counts, we can pass 'School' as our 'index' argument:

In [11]:
# The following pivot table isn't saved to a variable, so its output
# won't be accessible in later parts of the code. This approach works fine
# if you just need to check a set of values or test out a potential change
# to a DataFrame.

df_curr_enrollment.pivot_table(
    index = 'School', 
    values = 'Students', aggfunc = 'sum').reset_index()

Unnamed: 0,School,Students
0,CA,964
1,DA,977
2,HA,1038
3,SA,1021


We don't need a pivot table in order to determine our network-wide enrollment; instead, we can just use Series.sum():

(Series is the name Pandas uses for a column within a DataFrame. Series can also be standalone objects, but you'll often find them within larger tables.)

In [9]:
df_curr_enrollment['Students'].sum()

4000

# Here with editing:

Also provide examples of:

1. query() (e.g. filtering the table to include school-grade pairs with particularly high or low enrollments)
2. np.select()
3. np.where()
4. Merging (you could bring in data from another table in order to accomplish this)
5. Saving to a .csv file (also note that you could save data to Google Sheets; reference that part of PFN)
6. Adding strings together (e.g. for making school/grade pairs)
7. Renaming columns
8. Replacing values (e.g. replacing 'CA' with its full name)

And so on!

## Appendix: alternative grade sorting method

In [6]:
# We could also have ordered our pivot table results by grade in ascending chronological order (i.e. 'K', '1' . . . '11', '12') by creating a dictionary that stores 'K' grades as 0 and all other grades as integers, then sorting the DataFrame by these dictionary values. However, this approach works best if we *only* need to sort the pivot table by grade.

# The following cell creates the dictionary that will be used to sort grades in their proper ascending order (K to 12). It does so via a dictionary comprehension; see https://docs.python.org/3/tutorial/datastructures.html#dictionaries for more information on this approach.

grade_sorting_map = {
    grade:0 if grade == 'K' else int(grade) 
    for grade in df_school_grade_pivot['Grade']}

df_school_grade_pivot_alt_sort = df_curr_enrollment.pivot_table(
    index = ['School', 'Grade'], 
    values = 'Students', aggfunc = 'sum').reset_index()

df_school_grade_pivot_alt_sort.sort_values(['School', 'Grade'], 
    key = lambda x: x.map(grade_sorting_map), inplace = True)

# In the above code, the addition of 'inplace = True' makes the sort operation
# permanent. If we didn't include that argument, the DataFrame would 
# revert back to its original sort after the operation was complete. 

# See 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html and 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html for more information
# on map() and the 'key' argument within sort_values(), respectively.


df_school_grade_pivot_alt_sort

Unnamed: 0,School,Grade,Students
12,CA,K,90
25,DA,K,93
38,HA,K,74
51,SA,K,72
0,CA,1,71
13,DA,1,56
26,HA,1,84
39,SA,1,85
4,CA,2,76
17,DA,2,75
