<img src="./images/logo-iug@2x.png" alt="IUG" style="width:300px;"/>

# Data Day@IUG 
**Learning Lab #3**: Data Exploration with Python by Dr. N. Tsourakis

[ntsourakis@iun.ch](ntsourakis@iun.ch)

## Introduction to Pandas

`Pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

In this exercise we will use pandas to load data that can be used for exploration. The data will be read from:
* *csv* files.
* Online web sites.
* Buil-in datasets.

### CSV files

A ``comma-separated values`` (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. 

<img src="./images/csv_basic.png" alt="csv file" style="width:400px;"/>

In [None]:
import pandas as pd

# Read the data.
pop = pd.read_csv('./data/state-population.csv')

print(type(pop))

In [None]:
# Print the first 10 records.
print(pop.head(10))

In [None]:
# Print the 'population' column.
pop['population']

In [None]:
# Print the mean value of the column.
pop['population'].mean()

<u>Quick exercise</u>: Get the mean value of the *age* column.

In [None]:
### Enter your code below this line.

### Online web sites

A very useful resource of data is the Web. In this section we exploit data from the official Swiss website about covid-19 ([COVID-⁠19 Switzerland](https://www.covid19.admin.ch/en/overview)).

In [None]:
import requests

# Request the data from a url
x = requests.get("https://www.covid19.admin.ch/en/overview")
print(x.status_code)

# Read the data from the tables
dfs = pd.read_html(x.text)

print(type(dfs[0]))

In [None]:
# Print the data that we have read.
print(dfs[0])
print(dfs[1])
print(dfs[2])

### Built-in datasets

The ``sklearn.datasets`` module includes utilities to load datasets, including methods to load and fetch popular reference datasets.

In [None]:
from sklearn import datasets

# Print the available datasets.
print(dir(datasets))

We will load the [**Digits dataset**](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) bundled with scikit-learn:
* It consists of 8-by-8 pixel images representing 1797 hand-written digits (0 through 9) 

<img src="./images/24digits.png" alt="First 24 digit images in the digits dataset" width=400/>

In [None]:
from sklearn.datasets import load_digits

# Load the digits dataset.
digits = load_digits()

# Print the description of the dataset.
print(digits.DESCR)

In [None]:
# Print the size of the dataset.
print(digits.data.shape)

In [None]:
# Show array for sample image at index.
digits.images[13] 

Visualization of `digits.images[13]`

    <img src="./images/digit3.png" alt="Image of a handwritten digit 3" width="200px"/>

Each of the array elements corresponds to a specific gray-scale value.

    <img src="./images/grays.png" alt="Grayscale" width="600px"/>

In [None]:
# Change the upper left pixel.
digits.images[13][0][0] = 16
digits.images[13]

In [None]:
# Print the image again.
import matplotlib.pyplot as plt

plt.gray()
plt.matshow(digits.images[13])