# Data Source Websites

## What we will accomplish

In this notebook we will:
- Point you in the direction of some online sources of data,
- Introduce the concept of a data repository,
- Demonstrate a few popular data repositories,
- Introduce data competition sites, and
- Give examples of popular data competition sites.

In order to complete a data science project you need data. There are a number of excellent online data sources that you can tap into for portfolio projects, including your Erd&#337;s Institute Data Science Boot Camp project. While there are a wide array of online data sources, two of the most popular are:
1. Data repositories, and
2. Data competition sites.

## Data repositories

We will call a <i>data repository</i> any website where data sets are deposited. These can exist for many reasons for example:
- Housing data associated with published academic research,
- Holding data that was used by a news organization or
- Holding benchmark data sets that are used to compare algorithmic performance.

Let's review a couple different kinds and give some examples.

### Academic repositories

These repositories house data affiliated with academic research papers. They exist for both the purpose of replication and to spur additional research. Here are some examples:
- The UC Irvine Machine Learning Repository, <a href="https://archive.ics.uci.edu/ml/index.php">https://archive.ics.uci.edu/ml/index.php</a>, (<i>a very popular repository</i>),
- A repository of COVID-19 Tweets, <a href="https://publichealth.jmir.org/2020/2/e19273/">https://publichealth.jmir.org/2020/2/e19273/</a>,
- The Mendeley Data repository site, <a href="https://data.mendeley.com/">https://data.mendeley.com/</a> and
- The Harvard Dataverse, <a href="https://dataverse.harvard.edu/">https://dataverse.harvard.edu/</a>.

### GitHub repositories

There are many GitHub repositories whose sole purpose is data storage. News organizations and data-based blogs/websites often have repositories that store the data sets accompanying their stories/posts. For example:
- <a href="https://fivethirtyeight.com/">FiveThirtyEight</a>, <a href="https://github.com/fivethirtyeight/">https://github.com/fivethirtyeight/</a>,
- <a href="https://www.nytimes.com/">The New York Times</a>, <a href="https://github.com/nytimes">https://github.com/nytimes</a> and
- <a href="https://pudding.cool/">Pudding.cool</a>, <a href="https://github.com/the-pudding/data">https://github.com/the-pudding/data</a>.

There are also repositories maintained by individual users not affiliated with any larger organization. These may be harder to find, but if you have a data set in mind it can be a good idea to do a web search for an existing GitHub repository. This could save you a lot of time and work.

### An example

Let's demonstrate how you can use a repository to access data.

We will use a data set from the FiveThirtyEight repository. Let's download the `candy-data.csv` from the folder associated with this post, <a href="https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/">https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/</a>.

#### Instructions

1. First go to the link associated with the data file, <a href="https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv">https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv</a>.
2. Then click on the `Raw` button above the data table demonstrated on the page.
3. Using your web browser, save the file as `candy-data.csv` within the `Data Collection` folder of this repository.
4. Run the code chunks below.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("candy-data.csv")

In [3]:
data.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


We can also load the directly from the website by placing the raw csv file directly into `.read_csv`.

In [4]:
data = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")

In [5]:
data.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


Congratulations! You have now downloaded and used data stored on a repository.

Note this will not be the exact same process you will follow everytime you want to use data stored in a repository.

### Repository use guidelines

When using data that you did not collect or create yourself it is important to ensure that you follow whatever data use guidelines are associated with the data set you utilized. In particular, you should check to make sure that you are not violating any restrictions or legal guidelines outlined by the data provider. Many repositories will have guidelines on how you are allowed to use their data set.

It is also important that you credit the original source of the data in your final project. Some repositories may also have guidelines for citation. For example, an academic repository likely has an associated publication that you should cite.

Please be responsible and courteous data citizens. :)

## Data competition sites

A <i>data competition website</i> is a site that hosts competitions centered around particular data sets. 

For example, some entity may have a collection of images from MRI scans. This entity could then provide those images as a data set for a competition whose goal is to provide the "best" predictive algorithm for some disease of interest. The data competition site would:
- Host the competition,
- Publicly store the data,
- Specify the rules as outlined by the entity,
- Accept the competition entries and
- Help determine the winner or winners.

While the competitions may be the main purpose of the website, these sites can often serve as a source of data for personal projects, contain tutorials and be community hubs.

### Popular data competition websites

Here is a list of some of the most popular data competition sites:
- <a href="https://www.kaggle.com/">Kaggle.com</a> (you will need a Kaggle account to access kaggle data),
- <a href="https://idao.world/">The International Data Olympiad</a>,
- <a href="https://www.drivendata.org/">DrivenData</a>,
- <a href="https://competitions.codalab.org/">CodaLab</a> and
- <a href="https://datahack.analyticsvidhya.com/">DataHack</a>.

Some of these sites will require you to create a profile and others may only have data available for active competitions.

### Example: Extracting data from Kaggle.com

Let's now demonstrate how to extract data from Kaggle.com. Note that in order to work through this example you will need a Kaggle profile.

Kaggle has an entire section dedicated to public datasets, <a href="https://www.kaggle.com/datasets">https://www.kaggle.com/datasets</a>, we will download the famous iris data set found here, <a href="https://www.kaggle.com/uciml/iris">https://www.kaggle.com/uciml/iris</a>.

#### Instructions

1. Go to this link, <a href="https://www.kaggle.com/uciml/iris">https://www.kaggle.com/uciml/iris</a>,
2. Click the download button,
3. Unzip the zip file and move the `Iris.csv` file to this folder,
4. Run the following code to load in the data using `pandas`.

In [6]:
pd.read_csv("Iris.csv")

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


### Data competition site data use guidelines

Again, be sure to follow any data use guidelines put forth by the data competition website and/or the data set contributors. This includes using the data in accordance with their specified rules and citing the data source in any end products that result from your project.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)