# 2019 ADM Laboratory

1. Introduction to Python, Pandas, Datasets
2. Visualization
3. Document Databases, JSON, Mongo
4. Web Scrapping, BeautifulSoup
5. Data Exchange, REST API, Social Networks, Twitter
6. Amazon Web Services
7. Amazon Web Services: MapReduce Paradigm
8. PageRank in MapReduce
9. Clustering, SKLearn
10. Python Pipeline

## Virtual Machine

    VirtualBox
    XUbuntu 18.04
    Python 3
    Jupyter/iPython notebook
    IntelliJ pyCharm
    Pandas, BeautifulSoup, Scilearnkit, networkx, pymongo, pyspark and matplotlib
    MongoDB
    Apache Spark with Hadoop

## Online Services
    mLab
    Amazon Web Services

# Part 1 - Python Basic & Compound Data Types

Python supports decimals (called integers) of any length, and reals (called floats) that support up to 15 decimal places.

In [1]:
number = 42
type(number)

int

In Jupyter we can mix python code, like the cell above. Notice the label "In [number]" and the corresponding "Out [number]". The number represents the order of execution of the specific piece of code.

Within a code block we can query for the value of a variable, like the cell below.

We can also mix comments and text, in cells called "Markedown", like this one.

In [2]:
number

42

In [None]:
decimal = 3.14159265358979323846264338327950288419716939937510
type(decimal)
print(decimal)

Store is stored is called a 'string' variable.

In [None]:
text = 'This is my text'
another = "We can also use double-quotes"
mix = "Mix is 'simple'"

In [None]:
print(text)
print(another)
print(mix)

In [None]:
mix

In [None]:
text

The slicing operator [ ] can be used with string.

In [None]:
text[2:4]

In [None]:
text[:-5]

In [None]:
text[5:]

In [None]:
text[5]

Notice the different formating of the contents of the variable when we use the __print__ command and when using the jupyter environment to query for the value.

Python support compound data types such as tupples, lists, sets and dictionaries.

### Tuples 

A tuple is a collection which is ordered and unchangeable.

In [None]:
thistuple = (42, "kiwi", 14.911)
type(thistuple)

In [None]:
thistuple

We can access element **i** of the tuple using the [**i**]. The first element's index is 0, that is, i=0.

In [None]:
thistuple[0]

In [None]:
print(thistuple[2], thistuple[1])

In [None]:
thistuple[3]

In [None]:
thistuple[0] = 9.871

In [None]:
thistuple = (11, 'carrot', 9.871)

In [None]:
thistuple

### Lists

Python provides List to store sequences of values Lists in python are dynamic: **They grow/shrink on demand**.
Lists are mutable: **Values can change on demand** and also **Data type of individual items can change**.

In [None]:
lst = [1,5,15,7]
print(lst)

In [None]:
lst[2] = 22
lst

In [None]:
lst[1] = 'Hello'
lst

In [None]:
lst[4] = 33

In [None]:
lst.append(33)

In [None]:
lst

Lists operations enable repetition, concatenation, slicing, iteration, checking for membership, ...

In [None]:
lst * 2

In [None]:
lst + [55, "kiwi"]

In [None]:
lst

New lists can be created from manipulating existing ones

In [None]:
newlst = lst * 2

In [None]:
newlst

In [None]:
len(newlst)

In [None]:
newlst[1:4]

In [None]:
newlst[:4]

In [None]:
newlst[:-2]

In [None]:
newlst[2:]

In [None]:
33 in newlst

In [None]:
44 in newlst

In [None]:
newlst.count(33)

In [None]:
for value in lst: 
    print(value)

Some other operators make changes on the list itself

In [None]:
values = [5, 1, 88, 3]

In [None]:
values

In [None]:
values.reverse()

In [None]:
values

In [None]:
values.sort()

In [None]:
values

In [None]:
values.insert(2, 101) # Insert 101 into list at index 2, that is at the 3rd position.

In [None]:
values

In [None]:
values.pop(2) # Deletes the ith element of the list and returns its value.

In [None]:
values

In [None]:
newlst.index(33) # Returns index of first occurrence of 33.

In [None]:
newlst.index(88)

In [None]:
newlst.remove(33) # Deletes the first occurrence of 33 in list.

In [None]:
newlst

In [None]:
newlst.index(33)

Lists can contain tuples.

In [None]:
data = [("julius", 3),
("maria", 2), 
("alice", 4),
("maria", 1)]

In [None]:
data

In [None]:
for (n, a) in data:
    print("I met %s %s times" % (n, a) )

In [None]:
for x in data:
    print("I met %s %s times" % (x[0], x[1]) )

In [None]:
data.sort()

In [None]:
data

### Sets

A set is a collection of unique values that unordered are unindexed.

In [None]:
myset = {'alice', 'julius', 'maria', 'maria'}

In [None]:
myset

In [None]:
myset[1]

In [None]:
myset.sort()

In [None]:
myset.append('cornelia')

In [None]:
myset.add('cornelia')

In [None]:
myset

In [None]:
myset.add(3)

In [None]:
print(myset)

In [None]:
myset

In [None]:
'alice' in myset

In [None]:
for values in myset:
    print(values)

In [None]:
len(myset)

### Dictionaries

Are lookup tables that map a **key** to a **value**. The keys of a dictionary form a Set. Thus duplicate keys are not allowed.

In [None]:
cities= {'A': 'Ancona',
'B': 'Bary',
'C': 'Como'}

In [None]:
cities

In [None]:
cities['A']

In [None]:
cities.get('A')

In [None]:
cities['X']

In [None]:
cities.get('X','unknown')

In [None]:
cities['D'] = 'Domodosola'

In [None]:
cities['D']

Values can be of any type

In [None]:
cities['E'] = 42

In [None]:
cities

Keys can be of any data type

In [None]:
cities[42] = 'E'

In [None]:
cities

In [None]:
cities.pop(42)

In [None]:
cities

In [None]:
del cities['E']

In [None]:
cities

In [None]:
len(cities)

In [None]:
nobel = {
(1979, "physics"): ["Glashow", "Salam", "Weinberg"],
(1962, "chemistry"): ["Hodgkin"],
(1984, "biology"): ["McClintock"],
}

In [None]:
nobel

In [None]:
for key in nobel:
    print(key)

In [None]:
for key in nobel:
    print(nobel[key])

In [None]:
for (key, value) in nobel.items():
    print(key, value)

In [None]:
nobel.values()

In [None]:
nobel.keys()

# Part 2 - Pandas

An library with useful tools for data engineering.

In [None]:
import numpy as np
import pandas as pd

## Series
Represent a list of values.

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [None]:
s

**Jupyter** does not enforce the flow of execution. Code sniplets can be executed in any order, possibly changing the state of the variables used.

In [None]:
myset

In [None]:
myindex = [value for value in myset]

In [None]:
myindex

In [None]:
myseries = pd.Series(np.random.randn(5), index = myindex)

In [None]:
myseries

Remark here **randn** returns a sample (or samples) from the “standard normal” distribution.

https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randn.html

In [None]:
myseries['pitt'] = 1.5

In [None]:
myseries

In [None]:
myseries['julius']

In [None]:
myseries.max()

In [None]:
myseries.sum()

In [None]:
nobel

In [None]:
nobelseries = pd.Series(nobel)

In [None]:
nobelseries

In [None]:
nobelseries[(1984,'biology')]

## DataFrames

A 2-dimensional labeled data structure with columns of potentially different types.

In [None]:
seriesA = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
seriesB = pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])

In [None]:
d = {'Series A': seriesA, 'Series B': seriesB}

In [None]:
frame = pd.DataFrame(d)

In [None]:
frame

In [None]:
frame['Series A']

In [None]:
frame['Series A']['b']

In [None]:
frame.iloc[2]

In [None]:
frame = pd.DataFrame(d, index=['d', 'b', 'a'])

In [None]:
frame

In [None]:
frame = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['Series B', 'Series C'])

In [None]:
frame

In [None]:
frame['Series B'] > 2.0

In [None]:
frame['Series B']['b'] = 2.5

In [None]:
frame['Series B'] > 2.0

In [None]:
frame.insert(1, 'copy B', frame['Series B'])

In [None]:
frame

In [None]:
frame.loc[frame['Series B'] > 2.0]

# Part 3 - Open Data
https://dati.comune.roma.it/
Portal of Open Data of City of Rome
### Strutture ricettive gennaio 2019
Elenco delle strutture ricettive di Roma Capitale aggiornato al mese di gennaio.

https://dati.comune.roma.it/catalog/dataset/d823/resource/9964559d-0a9b-4dd6-a417-eb1ed019ab59

In [None]:
dataset = pd.read_csv('opendata_suar_gennaio.csv', sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None, encoding = "ISO-8859-1", nrows=20)

In [None]:
dataset

In [None]:
dataset[:3] # Look at the first 3 rows

In [None]:
dataset['Municipio'] # Select a column

In [None]:
dataset.index

In [None]:
dataset.columns

In [None]:
dataset.sort_values('Tipologia')

In [None]:
dataset['Fax'].isnull()

In [None]:
dataset.drop(columns=['Fax'])

In [None]:
dataset

In [None]:
dataset['Quadruple'].isnull()

In [None]:
dataset['Quadruple'].fillna(0)

In [None]:
dataset['Quadruple'].dropna()

In [None]:
dataset['Quadruple'].sum()

In [None]:
dataset['Unitaâ Abitative'].max()

In [None]:
dataset['Unitaâ Abitative'].idxmax()

In [None]:
dataset.loc[19]