# Open datasets

# Acquiring Data from open repositories

A crucial step in the work of a computational biologist is not only to analyse data, but acquiring datasets to analyse as well as toy datasets to test out computational methods and algorithms. The internet is full of such open datasets. Sometimes you have to sign up and make a user to get authentication, especially for medical data. This can sometimes be time consuming, so here we will deal with easy access resources, mostly of modest size. Multiple python libraries provide a `dataset` module which makes the effort to fetch online data extremely seamless, with little requirement for preprocessing.


### Goal of the notebook

Here you will get familiar with some ways to fetch datasets from online. We do some data exploration on the data just for illustration, but the methods will be covered later.


# Useful resources and links

When playing around with algorithms, it can be practical to use relatively small datasets. A good example is the `datasets` submodule of `scikit-learn`. `Nilearn` (library for neuroimaging) also provides a collection of neuroimaging datasets. Many datasets can also be acquired through the competition website [Kaggle](https://www.kaggle.com), in which they describe how to access the data.


### Links
- [OpenML](https://www.openml.org/search?type=data)
- [Nilearn datasets](https://nilearn.github.io/modules/reference.html#module-nilearn.datasets)
- [Sklearn datasets](https://scikit-learn.org/stable/modules/classes.html?highlight=datasets#module-sklearn.datasets)
- [Kaggle](https://www.kaggle.com/datasets)
- [MEDNIST]

-  [**Awesomedata**](https://github.com/awesomedata/awesome-public-datasets)

 - We strongly recommend to check out the Awesomedata lists of public datasets, covering topics such as [biology/medicine](https://github.com/awesomedata/awesome-public-datasets#biology) and [neuroscience](https://github.com/awesomedata/awesome-public-datasets#neuroscience)

- [Papers with code](https://paperswithcode.com)

- [SNAP](https://snap.stanford.edu/data/)
  - Stanford Large Network Dataset Collection  
- [Open Graph Benchmark (OGB)](https://github.com/snap-stanford/ogb)
  - Network datasets
- [Open Neuro](https://openneuro.org/)
- [Open fMRI](https://openfmri.org/dataset/)

In [None]:
# import basic libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

We start with scikit-learn's datasets for testing out Machine Learning (ML) algorithms. Visit [here](https://scikit-learn.org/stable/modules/classes.html?highlight=datasets#module-sklearn.datasets) for an overview of the datasets.

In [None]:
#import multiple datasets from sklearn package
from sklearn.datasets import fetch_olivetti_faces, fetch_20newsgroups, load_breast_cancer, load_diabetes, load_digits, load_iris

# Hand written digits

Load the MNIST dataset that consists of images of hand written digits

In [None]:
# load data (pixels) to X and target (the number image presents) to y
X,y = load_digits(return_X_y=True)

In [None]:
y.shape

In [None]:
X.shape #1797 images, 64 pixels per image

In [None]:
# values in array present grey values in the image
X[0]

In [None]:
y[0]

This is not very informative so we should plot the image itself.

In [None]:
# plot first image. We need to reshape the 64 pixels into [8,8] array
plt.imshow(X[0].reshape(8,8),cmap='gray')
plt.show()

<div class='alert alert-warning'>
<h4> Exercise 1.</h4>  Make a function `plot` taking an argument (k) to visualize the k'th sample. 
It is currently flattened, you will need to reshape it. Use `plt.imshow` for plotting and add title of the number the image presents. 

In [None]:
# Ex1


In [None]:
# %load solutions/ex2_1.py

In [None]:
# test your solution
plot(15); plot(450)

# Olivetti face data

Dataset of 40 subjects faces with varying facial expressions, facial details and lighting

In [None]:
faces = fetch_olivetti_faces()

<div class='alert alert-warning'>
<h4>Exercise 2. </h4> Inspect the dataset. How many classes are there? How many samples per class? Also, plot some examples. What do the classes represent? 
</div>

*Hint: Write `faces.` and press tab to see what attributes dataset has.*    

In [None]:
# Ex2


In [None]:
# %load solutions/ex2_2.py

Once you have made yourself familiar with the dataset we can do some data exploration with unsupervised methods, like below. The next few lines of code are simply for illustration, don't worry about the code (we will cover unsupervised methods in submodule F).

In [None]:
from sklearn.decomposition import randomized_svd

In [None]:
X = faces.data
y = faces.target

In [None]:
n_dim = 3
u, s, v = randomized_svd(X, n_dim)

Now we have factorized the images into their constituent parts. The code below displays the various components isolated one by one.

In [None]:
# don't worry about this code
def show_ims(ims):
    fig = plt.figure(figsize=(16,10))
    idxs = [0,1,2, 11,12,13, 40,41,42, 101,101,103]
    for i,k in enumerate(idxs):
        ax=fig.add_subplot(3,4,i+1)
        ax.imshow(ims[k])
        ax.set_title(f"target={y[k]}")

In [None]:
for i in range(n_dim):
    my_s = np.zeros(s.shape[0])
    my_s[i] = s[i]
    recon = u@np.diag(my_s)@v
    recon = recon.reshape(400,64,64)
    show_ims(recon)

Are you able to see what the components represent? It at least looks like the second component signifies the lightning  (the light direction), the third highlights eyebrows and facial chin shape.


## TSNE

Let's import TSNE which is clustering algorithm we can use to find groups from our data. We will learn more about this in Part 3 of this course when we dive deeper into machine learning and clustering algorithms.

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(init='pca', random_state=0)
trans = tsne.fit_transform(X)

In [None]:
m = 8*10 # choose 4 people

plt.figure(figsize=(16,10))
xs, ys = trans[:m,0], trans[:m,1]
plt.scatter(xs, ys, c=y[:m], cmap='rainbow')

for i,v in enumerate(zip(xs,ys, y[:m])):
    xx,yy,s = v 
    #plt.text(xx,yy,s) #class
    plt.text(xx,yy,i) #index

Many people seem to have multiple subclusters. What is the difference between those clusters? (e.g. 68,62,65 versus the other 60's)

In [None]:
def show(im):
    return plt.imshow(im, cmap='gray')

In [None]:
ims = faces.images

idxs = [68,62,65,66,60,64,63]
#idxs = [9,4,1, 5,3]
for k in idxs:
    show(ims[k])
    plt.show()

# Covid impact on airport traffic

**Kaggle** has lots of datasets to play with. Here we load data of traffic volume post-COVID in various airports. https://www.kaggle.com/datasets/terenceshin/covid19s-impact-on-airport-traffic

In [None]:
# import csv from kaggle
import pandas as pd
df = pd.read_csv('https://www.kaggle.com/datasets/terenceshin/covid19s-impact-on-airport-traffic/download')

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
# drop columns we don't need
df = df.drop(columns = ['AggregationMethod','Version','Centroid','ISO_3166_2','Geography'])

In [None]:
# let's analyze only Australia
data_au = df[df['Country']=='Australia']

In [None]:
# sort by date
data_au = data_au.sort_values(by="Date")

In [None]:
# set date as index
data_au.set_index('Date',inplace=True)

In [None]:
data_au.head()

In [None]:
# lets plot percent of baseline over time
plt.figure(figsize=(20,10))
plt.plot(data_au['PercentOfBaseline'])
plt.xticks(range(0,300,25))
plt.title("Plot for PercentOfBaseline Vs Time for Australia")
plt.show();

# Fetching an OpenML dataset

Here we will look at [OpenML](https://www.openml.org/) - a repository of open datasets free to explore data and test methods.

We need to pass in an ID to access, as follows:

In [None]:
from sklearn.datasets import fetch_openml

OpenML contains all sorts of datatypes. By browsing the website we found a electroencephalography (EEG) dataset to explore: 

In [None]:
data_id = 1471 #this was found by browsing OpenML
dataset = fetch_openml(data_id=data_id, as_frame=True)

In [None]:
dir(dataset)

In [None]:
dataset.url

In [None]:
type(dataset)

In [None]:
print(dataset.DESCR)

In [None]:
original_names=['AF3', 'F7', 'F3', 'FC5', 'T7', 'P', 'O1', 'O2', 'P8', 'T8', 'FC6', 'F4', 'F8', 'AF4']

In [None]:
dataset.feature_names

In [None]:
df = dataset.frame

In [None]:
df.head()

In [None]:
df.shape[0] / 117
# 128 frames per second

In [None]:
df.dtypes

In [None]:
#def summary(s):
#    print(s.max(), s.min(), s.mean(), s.std())
#    print()
#    
#for col in df.columns[:-1]:
#    column = df.loc[:,col]
#    summary(column)

In [None]:
df.plot()

From the plot we can quickly identify a bunch of huge outliers, making the plot look completely uselss. We assume these are artifacts, and remove them.

In [None]:
df2 = df.iloc[:,:-1].clip(upper=6000) #Elements above the threshold will be changed to match the threshold value.
df2.plot()

Now we see better what is going on. Lets just remove the frames corresponding to those outliers

In [None]:
# which are the frames that are above 5000
frames = np.nonzero(np.any(df.iloc[:,:-1].values>5000, axis=1))[0] 
frames

In [None]:
# remove those frames
df.drop(index=frames, inplace=True)

In [None]:
# plotting without outliers
df.plot(figsize=(16,8))
plt.legend(labels=original_names)

## Logistic regression example

Logistic regression estimates the probability of an event occurring, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1. We will look regression closer later in this course but here is short example with the previous data. 

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# make logistic regression model
lasso = LogisticRegression(penalty='l2')

In [None]:
# divide into X and y (data ie. EEG measurements and target ie. eyes open/closed)
X = df.values[:,:-1]
y = df.Class

y = y.astype(np.int) - 1 # map to 0,1

In [None]:
print(X.shape)
print(y.shape)

In [None]:
# fit the model with our data (=train model to classify samples using EEG values)
lasso.fit(X,y)

In [None]:
# predict y values with the model
comp = (lasso.predict(X) == y).values
np.sum(comp.astype(np.int))/y.shape[0] # accuracy of predictions

Accuracy is not very good.

In [None]:
# print coefficients
coef = lasso.coef_[0]
plt.barh(range(coef.shape[0]), coef)
plt.yticks(ticks=range(14),labels=original_names)

plt.show()

Interpreting the coeficients: we naturally tend to read the magnitude of the coefficients as feature importance. So in that case P and T7 values have most impact to y. That is a fair interpretation, but currently we did not scale our features to a comparable range prior to fitting the model, so we cannot draw that conclusion.

# Extra exercise

<div class='alert alert-warning'>
Go to [OpenML](https://openml.org) and use the search function (or just look around) to find any dataset that interest you. Load it using the above methodology, and try to do anything you can to understand the datatype, visualize it etc.

In [None]:
### YOUR CODE HERE