# Data Retrieval

By Kenneth Burchfiel

Released under the MIT License

One of Python's key strengths is that it can work with a wide variety of data. This script will demonstrate how to use Python's Pandas library to import data from .csv files, .xlsx files, SQL tables, and HTML pages; however, *many* other data types are supported, either by Python itself or via additional libraries. Later sections of Python for Nonprofits will introduce additional data sources.

In [1]:
import pandas as pd
from sqlalchemy import create_engine
# The above code can be found at
# https://docs.sqlalchemy.org/en/20/core/engines.html .

# Importing .csv data

Pandas' [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function simplifies the process of reading .csv data into your Python script. Here's a simple example:

In [2]:
df_curr_enrollment_csv = pd.read_csv(
    '../Appendix/curr_enrollment.csv')
df_curr_enrollment_csv.head()

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,date_of_birth,college,class_of,level,level_for_sorting
0,Rachel,Silva,F,2020,1,2020-1,2002-12-16,STC,2024,Fr,0
1,Brooke,Bradford,F,2020,2,2020-2,2002-09-26,STM,2024,Fr,0
2,Angela,Cameron,F,2020,3,2020-3,2002-05-18,STC,2024,Fr,0
3,Tonya,Hampton,F,2020,4,2020-4,2002-11-15,STC,2024,Fr,0
4,Tammy,Gardner,F,2020,5,2020-5,2002-06-21,STM,2024,Fr,0


However, you'll sometimes need to add additional arguments to read_csv() in order to correctly import your data. For example, if a separator other than a comma was used, you'll want to specify that separator via the 'sep' argument.

The following example shows what you'll see if you try to use read_csv to import *tab*-separated .csv data:

In [3]:
df_curr_enrollment_tab_csv = pd.read_csv(
    'curr_enrollment_tab_separated.csv')
df_curr_enrollment_tab_csv.head()

Unnamed: 0,first_name\tlast_name\tgender\tmatriculation_year\tmatriculation_number\tstudent_id\tdate_of_birth\tcollege\tclass_of\tlevel\tlevel_for_sorting
0,Rachel\tSilva\tF\t2020\t1\t2020-1\t2002-12-16\...
1,Brooke\tBradford\tF\t2020\t2\t2020-2\t2002-09-...
2,Angela\tCameron\tF\t2020\t3\t2020-3\t2002-05-1...
3,Tonya\tHampton\tF\t2020\t4\t2020-4\t2002-11-15...
4,Tammy\tGardner\tF\t2020\t5\t2020-5\t2002-06-21...


The '\t' strings within the column and field values *and* the lack of column separators are dead giveaways that this file was tab separated. To correctly import this information, you'll need to add `sep = '\t'` within your read_csv() call, as shown below. (`'\t'` represents tabs, just as `'\n'` represents newlines.)

In [4]:
df_curr_enrollment_tab_csv = pd.read_csv(
    'curr_enrollment_tab_separated.csv', sep = '\t')
df_curr_enrollment_tab_csv.head()

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,date_of_birth,college,class_of,level,level_for_sorting
0,Rachel,Silva,F,2020,1,2020-1,2002-12-16,STC,2024,Fr,0
1,Brooke,Bradford,F,2020,2,2020-2,2002-09-26,STM,2024,Fr,0
2,Angela,Cameron,F,2020,3,2020-3,2002-05-18,STC,2024,Fr,0
3,Tonya,Hampton,F,2020,4,2020-4,2002-11-15,STC,2024,Fr,0
4,Tammy,Gardner,F,2020,5,2020-5,2002-06-21,STM,2024,Fr,0


In addition, if your .csv file uses an encoding other than UTF-8 (the default), you may need to specify an [alternative codec](https://docs.python.org/3/library/codecs.html#standard-encodings) using the 'encoding' argument. 

# Importing .xlsx data

Importing .xlsx files is also easy to do within Python, although I've found that this process can take longer to execute than does importing .csv files.

The following code imports an .xlsx version of the same current enrollment dataset that we read above. You may need to install the openpyxl library in order for it to run on your computer.

In [5]:
df_curr_enrollment_xlsx = pd.read_excel('curr_enrollment.xlsx')
df_curr_enrollment_xlsx.head()

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,date_of_birth,college,class_of,level,level_for_sorting
0,Rachel,Silva,F,2020,1,2020-1,2002-12-16,STC,2024,Fr,0
1,Brooke,Bradford,F,2020,2,2020-2,2002-09-26,STM,2024,Fr,0
2,Angela,Cameron,F,2020,3,2020-3,2002-05-18,STC,2024,Fr,0
3,Tonya,Hampton,F,2020,4,2020-4,2002-11-15,STC,2024,Fr,0
4,Tammy,Gardner,F,2020,5,2020-5,2002-06-21,STM,2024,Fr,0


# Importing SQL data

Python's SQLAlchemy and Pandas libraries make it easy to import SQL tables into your script. Many different types of SQL (such as PostgreSQL) are supported, but this example will focus on a SQLite table created within PFN's appendix.

In order to import data from a database, we'll first need to connect to it via SQLAlchemy's `create_engine` function:

In [6]:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
# Based on: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite
# Note that the first 3 forward slashes indicate that a relative path 
# will be used. The actual relative path ('../Appendix/nvcu_db.db') follows
# those forward slashes.
# For guidance on creating engines for other database types (such as PostgreSQL,
# MySQL, and others), visit 
# https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls .

Once this SQLAlchemy engine has been created, we can use it to read in data from our database:

In [7]:
df_curr_enrollment_sql = pd.read_sql(
    'select * from curr_enrollment', con = e)
df_curr_enrollment_sql.head()

Unnamed: 0,first_name,last_name,gender,matriculation_year,matriculation_number,student_id,date_of_birth,college,class_of,level,level_for_sorting
0,Rachel,Silva,F,2020,1,2020-1,2002-12-16,STC,2024,Fr,0
1,Brooke,Bradford,F,2020,2,2020-2,2002-09-26,STM,2024,Fr,0
2,Angela,Cameron,F,2020,3,2020-3,2002-05-18,STC,2024,Fr,0
3,Tonya,Hampton,F,2020,4,2020-4,2002-11-15,STC,2024,Fr,0
4,Tammy,Gardner,F,2020,5,2020-5,2002-06-21,STM,2024,Fr,0


SQL is a whole language in itself, but you need not be a SQL expert to use Python to connect to database tables. In the above cell, `'select * from curr_enrollment'` is a line of SQL code that requests all fields (and, thus, all data) to be retrieved from the `curr_enrollment` SQLite table. 

SQL makes it possible to select specific columns, choose only particular groups of rows, and perform other more advanced operations. However, many of these operations can also be performed within Python. If you're already a SQL whiz, feel free to pass more advanced code to read_sql; if you're a SQL novice, 'select * from [table]' will still get you pretty far!

*Also note that a SQLAlchemy engine can be used as the 'con' argument within both read_sql() and to_sql(). This is mentioned explicitly within Pandas' [to_sql documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html), but can also be inferred from the [read_sql() page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html). This page states that 'con' needs to be a 'SQLAlchemy connectable,' and [the source code for sql.py()](https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/io/sql.py#L570-L743) specifies that 'SQLAlchemy connectable' can be either an engine or a connection. I mention this in part because using an engine as your argument for the 'con' parameter in read_sql and to_sql can save you a bit of code.*

# Importing HTML data

There are a number of ways to access data directly from the internet via Python. One means of doing so is `pd.read_html()`, which lets you read HTML tables from websites driectly into Pandas DataFrames.

The following code imports a list of example Census API connection strings into a DataFrame. The [0] at the end of the read_html() call instructs Pandas to convert *first* table retrieved by read_html() into a DataFrame. (There's only one HTML table on this page, but it's still necessary to add [0].)

Note that the import isn't perfect: for some reason, the data on the 'Number' field on the right didn't get downloaded successfully. As noted in the [read_html() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html), 'Expect to do some cleanup after you call this function.'

In [8]:
df_census_examples = pd.read_html(
    'https://api.census.gov/data/2022/acs/acs5/examples.html')[0]
df_census_examples.head()

Unnamed: 0,Geography Hierarchy,Geography Level,Example URL,Number
0,us,10,https://api.census.gov/data/2022/acs/acs5?get=...,
1,us,10,https://api.census.gov/data/2022/acs/acs5?get=...,
2,region,20,https://api.census.gov/data/2022/acs/acs5?get=...,
3,region,20,https://api.census.gov/data/2022/acs/acs5?get=...,
4,division,30,https://api.census.gov/data/2022/acs/acs5?get=...,


# What about exports?

The above examples focused on *importing* data; however, Pandas also makes it easy to *export* data to a variety of formats.

## Exporting data to a .csv file:

In [9]:
df_curr_enrollment_csv.to_csv('curr_enrollment_export.csv', index = False)

## Exporting data to an .xlsx file:

In [10]:
df_curr_enrollment_xlsx.to_excel('curr_enrollment_export.xlsx', index = False)

For examples of exporting data to SQL tables, visit the nvcu_db_gen.ipynb file within the Appendix.

## A caveat about data types

You'll often find that using read_csv(), read_excel(), and read_sql() will create identical DataFrames (as long as the source data is the same). However, in some cases, the data types returned by these functions will differ--requiring you to add in some additional code to resolve this discrepancy.

For instance, if you retrieve a dataset from a SQL table, dates of birth might be formatted as DateTime values. Meanwhile, those same values may be formatted as strings if they were imported from a .csv file. Therefore, if you're switching your import source from a SQL table to a .csv file or vice versa, it's not a bad idea to check the data types of your imported fields using [df.dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html). 