<img align="left" src="../All-sample-files/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for Constellate under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />

# Teaching Data Literacy 1

**Description:** This notebook introduces basic data literacy concepts related to data acquisition. These include:

* Selecting data for teaching
* Data basics
* Open data file formats
* Data sources
* Retrieving data

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Beginners

**Completion Time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))

**Knowledge Recommended:** None

**Data Format:** None

**Libraries Used:** 
* urllib.request to download files
* pathlib to work with files and directories

**Research Pipeline:** None
___

## Audience and Disciplinarity

While it is possible to teach a general course in data literacy, in practice, academic disciplines tend to work with particular kinds of data. Therefore, it is often a good practice to tailor your data literacy workshops and lessons to a particular discipline or group of disciplines. If you are a faculty member, then the disciplinary perspective is probably self-evident, but librarians and other university staff should consider a few different variables, including their own expertise, the area of greatest academic need, and institutional support.

## Selecting Data For Teaching
In order to teach data literacy, you need to have data for your students to use. Here are some good practices for teaching data.

### Choose data relevant to the discipline
Learners will respond more positively when the data is relevant to their disciplinary domain.

### Smaller is often better
While research often relies on using "big data," teaching with big data is often impractical. Moving large files for workshops often requires significant internet bandwidth. Processing large files takes significant class time. It is usually better to shorten the time for discovery and learning.

### Use open data formats
Proprietary file formats make it hard to extract useful data and to preserve it longterm. Use open data formats and encourage your students to use them as well.

### Toy data vs. real data
Toy data is data that is artificially constructed for the sake of a teaching exercise. Toy data can be more effective for teaching if it is carefully constructed for the lesson. For example, a class on data cleaning could include common errors within a compact data set. Real data, however, can be more compelling as learners discover aspects of data that are interesting to them. 

### Start with data basics
Establish some data basics before using more advanced methods. This helps establish a base level of literacy before moving on to more complex topics. Keep in mind that data means different things in each discipline. Humanists, for example, often have very little experience with research data.

## Data Basics

Disciplines can have very different perspectives on data, including what kind of data is valuable and/or valid. Here are some examples of how data is often categorized.

### Quantitative vs. Qualitative

The distinction between quantitative and qualitative data is often framed as numbers vs. text. Quantitative data is based on numbers, numerical comparison, and measurement. Qualitative data is based in language, description, and interpretation.

#### Qualitative Data Types

* Nominal Data- Data that is labeled, such as eye color or country.
* Ordinal Data- Data that uses a numerical scale but requires some level of qualitative choice, such as "Rate your satisfaction from 1-5." Ordinal data is a mix of qualitative and quantitative data.
* Unstructured Text- Descriptive text, which cannot be ordered into tables. Most data is unstructured text.

#### Quantitative Data Types

* Discrete- Data that is measured in whole numbers or integers, such as number of people.
* Continuous- Data that is measured in fractions, such as length of an object.

### Primary vs Secondary Data

The concepts of primary and secondary data are similar to primary and secondary sources. In short, primary data is data researchers have collected directly for the purposes of their research. Secondary data is data that has been gathered from another source and is being re-purposed for a new research project. Gathering data and organizing data requires a lot of time and labor, so it is usually easier for researchers to find and use existing data.

### Open Data File Formats

#### Tabular Data
* CSV (.csv)
* Excel (.xlsx)

#### Documents
* Text (.txt)
* Portable Document Format (.pdf)

#### Images
* JPG (.jpg)
* PNG (.png)

#### Markup
* HyperText Markup Language (.html)
* eXtensible HyperText Markup Language (.xhtml)
* eXtensible Markup Language (.xml)

#### Data Interchange
* JavaScript Object Notation (.json)

# Creating Primary Data for Teaching
Generally, creating primary data for teaching is more trouble than it is worth. Creating a high-quality, compelling dataset takes significant time. However, it may be worthwhile to teach disciplinary skills for collecting data. For example, at TAP Institute, we have taught courses in optical character recognition (OCR), web scraping, and gathering social media data. If you know the audience discipline, then you might focus on established best practices for collecting data and creating datasets.

Additionally, you might describe good data practices for institutional support:

* Grant proposals
* Submitting data to the institutional repository
* The Institutional Review Board (IRB) process

## Secondary Data Sources for Teaching

### Multiple Disciplines
* Constellate
* [data.gov](https://data.gov/)
* [Dataverse](https://dataverse.org/)
* [Data is Plural](https://www.data-is-plural.com/)
* [Dryad](https://datadryad.org/)
* [Figshare](https://figshare.com/)
* [Harvard Dataverse](https://dataverse.harvard.edu/)
* Local Government (e.g. [City of Detroit Open Data Portal](https://data.detroitmi.gov/), [Seattle Public Library Data](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data))
* [Papers with Code](https://paperswithcode.com/datasets)
* Institutional Repositories

### Computer Science (ML/AI focused)
* [🤗 Hugging Face](https://huggingface.co/datasets)
* [Kaggle](https://www.kaggle.com/datasets)

### Humanities
* [Collections as Data](https://osf.io/r9n3s/wiki/home/)
* [Journal of Open Humanities Data](https://openhumanitiesdata.metajnl.com/)
* [Project Gutenberg](https://www.gutenberg.org/)
* [Text Encoding Initiative Texts](https://tei-c.org/activities/projects/)


### Law
* [Caselaw Access Project](https://case.law/)
* [On the Books](https://onthebooks.lib.unc.edu/)

### Sciences
* [Astrophysics Data System](https://ui.adsabs.harvard.edu/)
* [data.nasa.gov](https://data.nasa.gov/)
* [National Center for Science and Engineering Statistics](https://ncsesdata.nsf.gov/explorer)
* [National Centers for Environmental Information](https://www.ncei.noaa.gov/)
* [National Institute of Standards and Technology](https://data.nist.gov/)
* [National Library of Medicine](https://www.ncbi.nlm.nih.gov/)
* [Zenodo](https://zenodo.org/)
  
### Social Sciences
* [Consortium of European Social Science Data Archive](https://datacatalogue.cessda.eu/)
* [Inter-university Consortium for Political and Social Research](https://www.icpsr.umich.edu/web/pages/)

## Avoid Buying Data
It is not a good academic practice to buy proprietary data for teaching or research.

### Teaching with proprietary data
If you buy proprietary data, then you will need to purchase a license for every person you teach. If someone outside your institution wants to use your lesson, then they *also* need to buy a license for everyone else *they* teach. In short, you're designing a lesson that funnels money to a vendor and is not re-usable as an Open Educational Resource (OER). If you can't make the data freely available, then it is bad choice for teaching.

### Research with proprietary data
Similarly, it is a bad academic practice to purchase research data for similar reasons. When an institution buys proprietary data, it forces other researchers to purchase the same proprietary data for reproducibility research. The problem is exacerbated if the research is locked within a proprietary environment, because often access to the research environment is *another expensive purchase*. Additionally, the researcher must rely on the provider to freeze the data, ideally with hashing. For all these reasons, many grant agencies refuse to fund research that relies on proprietary data and insist that any research data used in a funded project be freely available and publicly hosted, such as within an institutional repository.

## Teach Data Cleaning
For most data research projects, data cleaning is the vast majority of the work. This is true whether working with primary or secondary data. Given that all researchers must clean their data, it is important to offer classes in data cleaning. There are a number of ways of addressing data cleaning.

### Graphical User Interface (GUI) tools
Excel and OpenRefine are excellent introductory tools for cleaning data. They are more approachable for new researchers and they have very powerful feature sets. These tools have the following advantages:

* Point and click
* Easy to learn
* Great for browsing and exploring data
* Good for small datasets (<10,000 rows)

### Command line and Python tools
Unix and Python tools for data cleaning are potentially much more powerful. Python has two major libraries for working with tabular data:

* [Pandas](https://pandas.pydata.org/)- The more established library with lots of support
* [Polars](https://www.pola.rs/)- A newer library that is quickly growing in popularity

Working with the command line, Pandas, and Polars has several significant advantages over solutions like Excel and OpenRefine.

* More powerful data manipulation and transformations
* Can work with much larger data (millions of items)
* Much faster for working with data
* Can automate workflows in a large data pipeline

### When to use clean data

It is a good practice to use clean data when teaching a particular research method. Otherwise, the teacher risks going on a side-tangent about data cleaning. This may be a useful exercise if the data issues are common for the method being taught, but often it distracts from the course's purpose.

In short, use clean data for teaching a particular topic, but do not neglect data cleaning. Ideally, it should come first. While data cleaning is not a sexy topic, it is more fundamental and important than the latest artificial intelligence methods. 

## Working with Data in Python
Data files can generally be divided into text files and binary files. Generally, text files can be opened with a plain text editor, such as Notepad or TextEdit. A text file is processed as a sequence of characters. Common text file formats include:

* Text (.txt)
* CSV (.csv)
* Excel (.xlsx)
* JSON (.json)

On the other hand, binary files are processed as a sequence of bytes. These are common for image, audio, and video data. Examples include:

* JPG (.jpg)
* Ogg Vorbis (.ogg)

## Host and backup your class data

It is a good idea to host and backup your data. (Another reason why proprietary data is a bad idea.) A problem with data hosting can ruin a well-planned class. If there are a hundred students in your class, the host needs to be able to serve all of them the same data once. 

Common choices for storing research data for teaching:

* Your institutional repository
* Amazon Web Services (AWS)
* Google Drive
* Dropbox
* Box
* Microsoft OneDrive

GitHub is an excellent choice for saving and version-tracking lessons. It is not a good choice for serving data. GitHub has a file size limit of 50 MiB and blocks files larger than 100 MiB.

Storing your research data (and your lesson's images!) in the cloud will make your repository pull in quickly for your students.

## Opening Research Data with Python

The general process for opening data files in Python is generally similar for different file types. [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb) describes how to open, read, and write popular data file types:

* Text (.txt)
* Comma Separated Values (.csv)
* Javascript Object Notation (.json)

___
## Lesson Complete

Congratulations! You have completed *Teaching Data Literacy 1*.

### Start Next Lesson (coming soon): [Teaching Data Literacy 2](./teaching-data-literacy-2.ipynb)