<img src="./images/logo-iug@2x.png" alt="IUG" style="width:300px;"/>

# Data Day@IUG 
**Learning Lab #3**: Data Exploration with Python by Dr. N. Tsourakis

[ntsourakis@iun.ch](ntsourakis@iun.ch)

## Introduction to Pandas

`Pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 

In this exercise we will use pandas to load data that can be used for data exploration. The data will be read from:
* *csv* files.
* Online web sites.
* Buil-in datasets.

### CSV files

A ``comma-separated values`` (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. 

<img src="./images/csv.png" alt="csv file" style="width:400px;"/>

In [14]:
import pandas as pd

# Read the data
pop = pd.read_csv('./data/state-population.csv')

print(type(pop))

#df = pd.read_json("https://api.exchangerate-api.com/v4/latest/USD")
#print(df)

<class 'pandas.core.frame.DataFrame'>


In [13]:
# Print the first 10 records.
print(pop.head(10))

  state/region     ages  year  population
0           AL  under18  2012   1117489.0
1           AL    total  2012   4817528.0
2           AL  under18  2010   1130966.0
3           AL    total  2010   4785570.0
4           AL  under18  2011   1125763.0
5           AL    total  2011   4801627.0
6           AL    total  2009   4757938.0
7           AL  under18  2009   1134192.0
8           AL  under18  2013   1111481.0
9           AL    total  2013   4833722.0


In [21]:
# Print the 'population' column.
pop['population']

0         1117489.0
1         4817528.0
2         1130966.0
3         4785570.0
4         1125763.0
           ...     
2539    309326295.0
2540     73902222.0
2541    311582564.0
2542     73708179.0
2543    313873685.0
Name: population, Length: 2544, dtype: float64

In [20]:
# Print the mean value of the column.
pop['population'].mean()

6805558.401347068

<u>Quick exercise</u>: Get the mean value of the *age* column.

In [None]:
### Enter your code below this line.

### Online web sites

A very useful resource of data is the Web. In this section we exploit data from the official Swiss website about covid-19 ([COVID-⁠19 Switzerland](https://www.covid19.admin.ch/en/overview)).

In [25]:
import requests

# Request the data from a url
x = requests.get("https://www.covid19.admin.ch/en/overview")
print(x.status_code)

# Read the data from the tables
dfs = pd.read_html(x.text)

print(type(dfs[0]))
print(len(dfs))

print(dfs[0])
print(dfs[1])
print(dfs[2])

200
<class 'pandas.core.frame.DataFrame'>
10
                            0       1
0  Difference to previous day    1846
1      Total since 13.10.2021  15 962
2     Per 100 000 inhabitants   18327
                            0    1
0  Difference to previous day   45
1      Total since 13.10.2021  245
2     Per 100 000 inhabitants  281
                            0   1
0  Difference to previous day   4
1      Total since 13.10.2021  52
2     Per 100 000 inhabitants   6


### Built-in datasets

In [5]:
from sklearn import datasets
print(dir(datasets))

['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups', 'clear_data_home', 'dump_svmlight_file', 'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 'get_data_home', 'load_boston', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_friedman2', 'make

We will load the [**Digits dataset**](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) bundled with scikit-learn
* 8-by-8 pixel images representing 1797 hand-written digits (0 through 9) 

<img src="./images/24digits.png" alt="First 24 digit images in the digits dataset" width=400/>

In [9]:
from sklearn.datasets import load_digits

# Load the digits dataset.
digits = load_digits()
# Print the description of the dataset.
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [10]:
# Print the size of the dataset.
print(digits.data.shape)

(1797, 64)


In [7]:
# Show array for sample image at index.
digits.images[13] 

array([[ 0.,  2.,  9., 15., 14.,  9.,  3.,  0.],
       [ 0.,  4., 13.,  8.,  9., 16.,  8.,  0.],
       [ 0.,  0.,  0.,  6., 14., 15.,  3.,  0.],
       [ 0.,  0.,  0., 11., 14.,  2.,  0.,  0.],
       [ 0.,  0.,  0.,  2., 15., 11.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  2., 15.,  4.,  0.],
       [ 0.,  1.,  5.,  6., 13., 16.,  6.,  0.],
       [ 0.,  2., 12., 12., 13., 11.,  0.,  0.]])

Visualization of `digits.images[13]`

    <img src="./images/digit3.png" alt="Image of a handwritten digit 3" width="200px"/>

Each of the array elements corresponds to a specific gray-scale value.

    <img src="./images/grays.png" alt="Grayscale" width="600px"/>

In [29]:
digits.images[13][0][0] = 16
digits.images[13]

array([[16.,  2.,  9., 15., 14.,  9.,  3.,  0.],
       [ 0.,  4., 13.,  8.,  9., 16.,  8.,  0.],
       [ 0.,  0.,  0.,  6., 14., 15.,  3.,  0.],
       [ 0.,  0.,  0., 11., 14.,  2.,  0.,  0.],
       [ 0.,  0.,  0.,  2., 15., 11.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  2., 15.,  4.,  0.],
       [ 0.,  1.,  5.,  6., 13., 16.,  6.,  0.],
       [ 0.,  2., 12., 12., 13., 11.,  0.,  0.]])

In [None]:
import matplotlib.pyplot as plt
plt.matshow(digits.images[0])