<img align="left" width="200" src="Picture1.png">

# 2. Collecting data

There are several ways to create a dataset. Today we are going to talk about the following:
<ol>
  <li>Downloading existing datasets</li>
  <li>Application Programming Interfaces (APIs)</li>
  <li>Web scraping</li>
</ol>

## Responsible text collection and analysis

Before collecting data from the web, it is important to follow documentation, rights, and use guidelines to be sure the extraction and use of data is legal. You will also want to think about the ethics of your proposed textual analysis project. 

A useful list of questions can be found on page 21 of <a href="https://psyarxiv.com/xvrhm/">Reflexivity in Quantitative Research: A Rationale and Beginner’s Guide,</a> (Jamieson, M. K., Govaart, G. H., & Pownall, M.) including:
<ul><li>Why do I want to research this group?</li><li>If I am using existing datasets, are there any silent assumptions in this dataset?"</li></ul>

What can you do to ensure responsible textual analysis practices?
<ul>
  <li>Ask a librarian: The library has a lot of expertise on how to ethically use data. The library has a Copyright Librarian, Digital Humanities Librarian, and a Data Management Librarian. You can also reach out to your subject specialists. Librarians can help guide you and even reach out to our vendors if you want to do text analysis on library subscriptions.</li>
  <li>Look for the <a href="https://creativecommons.org/">Creative Commons</a> logo or other licenses. These will explain the permissions for using data on the webpage. The absence of a license does not mean there are no restrictions or ethical considerations.</li>

</ul>

## Download existing datasets

There are plenty of existing datasets already available to download. This is the easiest way to get data because there is no coding needed. However, it limits the data you can work with--someone else needs to prepare the dataset and make it available to download. It is often already structured, which can be good or bad depending on what is your goal. This can limit your research. However, it is easier to work with existing datasets, and it is a good place to start with text analysis.

These are a few examples of free large datasets available online to download:

| Type of Data | Name | Link | Description |
| --- | --- | --- | --- |
| Book | HathiTrust | https://www.hathitrust.org/hathifiles | Metadata for all of HathiTrust books |
| Emails | Enron Email dataset | https://www.cs.cmu.edu/~./enron/ | .5 million emails from senior managers at Enron (FEC provided) |
| Reviews | Yelp reviews | https://www.yelp.com/dataset | Access to >6 million Yelp reviews |

## APIs

APIs (Application Programming Interface) are an excellent way to access data on the web. Of course, there are limitations: it does require the owner of the data to make it available and provide permissions. The person retrieving the data needs coding skills to make data requests. API data is typically structured as json or xml data.

### Twitter API

Twitter has a powerful API that is available to almost anyone with a Twitter account. Python tools, like twarc, make accessing this data relatively simple with a little Python knowledge. Let's try getting data with the Twitter API!

twarc is a command line tool that allows us to easily query for tweets. Lucky for us you can use Jupyter notebooks to run command lines by add the "!" in front of the command.

In [None]:
#You should only need to do this command once on your computer. If it's already installed you'll see a message that says "Requirement already satisfied"
#! pip install --upgrade twarc

APIs usually require keys (unique codes) that are tied to users. This is because APIs always have limitations--restricting users, the number of queries, limits to what data can be accessed, etc. A lot of API data is behind a paywall. My Twitter API key is in another file called Constants for security reasons. 

In [None]:
#This tells the file to look for an API key in another file called Constants:
import Constants

In [None]:
#This gets that API key and renames it "t"
t = Constants.BEARER_TOKEN

In [None]:
#This configures twarc so the Twitter API knows you're ok to acccess the data
#!printf t | twarc2 configure

Finally it is time to have some fun with the API!

In [None]:
#limits to 50, results for tweets containing a term/phrase in the last week
!twarc2 search --limit 50 "pizza" results3.jsonl

The results file doesn't look pretty right now because it is structured as json data. For now, we're going to leave it there but there is much more that can be done with the twarc Twitter API tool.

As a Northwestern user, you can apply for <a href="https://developer.twitter.com/en/products/twitter-api/academic-research">academic access to the Twitter API</a>. This provides access to all public tweets by just adding `--archive` to the request.

## Web Scraping

Web scraping, or web harvesting, uses code to pull data from webpages. This is the data collection strategy that likely requires the most programming knowledge--it is also carries the most risk. If you are going to scrape a lot of data from a website, the best approach is to seek permission first.

There are web scraper tools available like this <a href="https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en">browser extension</a> that will scrape a single page. You can even scrape a single webpage using Google sheets! But as a researcher you will likely want to scrape a lot of data at a time, which requires more programming knowledge. 

Webpages can be extracted using <a href="https://support.google.com/docs/answer/3093339?hl=en">formulas</a> in Google sheets. We can test scraping a wikipedia page by using this formula:`=ImportHTML("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Supporting_Actress", "table", 6)` in a Google sheet: <a href="https://docs.google.com/spreadsheets/d/1vITQtcemjB_AYO3134Vi_TOiIR_AaHItSwXBE3_vOOg/edit">https://docs.google.com/spreadsheets/d/1vITQtcemjB_AYO3134Vi_TOiIR_AaHItSwXBE3_vOOg/edit</a>

A librarian can help with collecting data with web scraping. If you want to scrape data provided through library subscriptions we can help you with getting persmissions from our vendors. This way you can scrape web data without risk.

## More information

Responsible text analysis:

Jamieson, Michelle K., Gisela H. Govaart, and Madeleine Pownall. 2022. “Reflexivity in Quantitative Research: A Rationale and Beginner’s Guide.” PsyArXiv. February 23. <a href="doi:10.31234/osf.io/xvrhm.">doi:10.31234/osf.io/xvrhm</a>

Datasets:

APIs:

Web scraping: