# Pandas

The `numpy` module is excellent for numerical computations, but to handle missing data or arrays with mixed types takes more work. The `pandas` module is currently the most widely used tool for data manipulation, providing high-performance, easy-to-use data structures and advanced data analysis tools.

In particular `pandas` features:

* A fast and efficient "DataFrame" (distributed data frames is the basis of big data analysis) object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats (CSV -comma separated values-, Excel, SQL, HDF5);
* Intelligent data alignment and integrated handling of missing data (comune che ad un certo punto un sensore non dia alcun dato per un qualche problema);
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Aggregating or transforming data with a powerful "group-by" engine; 
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionalities (es. time steps as indexes);
* Highly optimized for performance, with critical code paths written in Cython or C.

Pandas: operazioni sono spesso fatte colonna per colonna (column wise) -colonne sono omogenee fra loro come successione di tipi di dati- (in ML colonne sono "features", in maths e physics "variables" -es. pressure, volume-).

In [1]:
import pandas as pd
import numpy as np

## Series

Series are completely equivalent to 1D array but with axis labels and the possibility to store heterogeneous elements. Of paramount importance are the time-series, used to define time evolutions of a phenomenon. 

Data frame è simile ad excel table, si possono labellare sia colonne sia righe: una singola colonna è una serie. serie è generalizzione di un numpy array: indici possono essere non solo (0,1,2...), ma puoi labellarli (es. tempi)

In [2]:
from string import ascii_lowercase as letters #lettere dell'alfabeto

# Creating a series, accessing indexes, values and values by their index 
xs = pd.Series(np.arange(10)*0.5, index=tuple(letters[:10]))
print ("xs:", xs,'\n')
print ("xs indexes:",xs.index,'\n')
# Values of the Series are actually a numpy array
print ("xs values:", xs.values, type(xs.values),'\n')
print (xs['f'], xs.f, xs.h, '\n') #seriesname.label è un altro modo di accedere, oltre alla 
#[], ma è più pericoloso
print (xs[['d', 'f', 'h']], '\n')#si crea una "sub-serie"
print (type(xs[['d', 'f', 'h']]), '\n')

xs: a    0.0
b    0.5
c    1.0
d    1.5
e    2.0
f    2.5
g    3.0
h    3.5
i    4.0
j    4.5
dtype: float64 

xs indexes: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object') 

xs values: [0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5] <class 'numpy.ndarray'> 

2.5 2.5 3.5 

d    1.5
f    2.5
h    3.5
dtype: float64 

<class 'pandas.core.series.Series'> 



In [5]:
# Extracting elements and operations: same as numpy array
print (xs[:3],'\n')#fino al quarto elemento
print (xs[7:], '\n')#dall'ottavo in poi
print (xs[::3], '\n')#tutta la serie, uno ogni tre
print (xs[xs>3], '\n')#filtering: seleziona tutti gli elementi maggior di tre 
#(qui sono h, i , j), prende i loro indici e li restituisce alla serie,facendo uno slicing
#questo lascia invariata l'xs originale
print ('exp:', '\n', np.exp(xs), '\n')#se un elemento è ad es. una stringa e non si può
#esponenziare, semplicemente lascia invariato quell'elemento
print (np.mean(xs), np.std(xs), '\n')

a    0.0
b    0.5
c    1.0
dtype: float64 

h    3.5
i    4.0
j    4.5
dtype: float64 

a    0.0
d    1.5
g    3.0
j    4.5
dtype: float64 

h    3.5
i    4.0
j    4.5
dtype: float64 

exp:  
 a     1.000000
b     1.648721
c     2.718282
d     4.481689
e     7.389056
f    12.182494
g    20.085537
h    33.115452
i    54.598150
j    90.017131
dtype: float64 

2.25 1.4361406616345072 



In [None]:
# Series can be created from python dictionary too (key : index. value : value).
# Not that the elements can be whatever!
d = {'b' : 1, 'a' : 'cat', 'c' : [2,3]
pd.Series(d) #passo il dictionary al costruttore



A key difference between Series and nparray is that operations between Series automatically align the data based on label. Thus, you can write computations without considering whether the Series involved have the same labels.

In [6]:
s = pd.Series(np.random.randn(5), index=tuple(letters[:5]))
print(s)
s = s[1:] + s[:-1]
print(s) #NaN esce quando sommo roba incompatibile

a   -1.075249
b    0.695521
c    0.090315
d   -0.118532
e   -0.333618
dtype: float64
a         NaN
b    1.391042
c    0.180629
d   -0.237064
e         NaN
dtype: float64


### Time series

Le serie sono importanti soprattutto perché implementano bene delle serie temporali (dove index è "time step"), ad esempio misure ripetute nel tempo.
Time series are very often used to profile the behaviour of a quantity as a function of time. Pandas as a special index for that, `DatetimeIndex`, that can be created e.g. with the function `pd.data_range()` '

`DatetimeIndex`: Immutable ndarray-like of datetime64 data.
Represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata.

In [9]:
# to define a date, the datetime module is very useful
import datetime as dt #una delle librerie di python comode per gestire date e tempi
date = dt.date.today()
print(date)

date = dt.datetime(2020,11,9,14,45,10,15)#(...secondi, microsecondi)
print (date)

# otherwise, several notations are interpreted too
date = 'Nov 9 2020'
# or alternatively
date = '9/11/2020 14:45:00'
print (date)

days = pd.date_range(date, periods=7, freq='D') #D è per avere frequenza "day"
print (days)

seconds = pd.date_range(date, periods=3600, freq='s') #s è per avere frequenza "sec"
print (seconds)
#può essere utile attaccare questa serie a delle misure se so ogni quanto sono prese

2020-11-22
2020-11-09 14:45:10.000015
9/11/2020 14:45:00
DatetimeIndex(['2020-09-11 14:45:00', '2020-09-12 14:45:00',
               '2020-09-13 14:45:00', '2020-09-14 14:45:00',
               '2020-09-15 14:45:00', '2020-09-16 14:45:00',
               '2020-09-17 14:45:00'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-09-11 14:45:00', '2020-09-11 14:45:01',
               '2020-09-11 14:45:02', '2020-09-11 14:45:03',
               '2020-09-11 14:45:04', '2020-09-11 14:45:05',
               '2020-09-11 14:45:06', '2020-09-11 14:45:07',
               '2020-09-11 14:45:08', '2020-09-11 14:45:09',
               ...
               '2020-09-11 15:44:50', '2020-09-11 15:44:51',
               '2020-09-11 15:44:52', '2020-09-11 15:44:53',
               '2020-09-11 15:44:54', '2020-09-11 15:44:55',
               '2020-09-11 15:44:56', '2020-09-11 15:44:57',
               '2020-09-11 15:44:58', '2020-09-11 15:44:59'],
              dtype='datetime64[ns]', lengt

To learn more about the frequency strings, please see this [link](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)


Timestamped data is the most basic type of time series data that associates values with points in time. For pandas objects it means using the points in time.

functions like `pd.to_datetime` can be used, for instance, when reading information as string from a dataset

In [12]:
tstamp = pd.Timestamp(dt.datetime(2020, 11, 9)) #counter che avanti dal 1970
#converte la data in un int

# internally it counts the nanoseconds from January 1st 19 1970
#tstamp = pd.Timestamp(dt.datetime(1970, 1, 1, 0, 0, 0, 0))
print(tstamp.value)

# when creating a timestamp the format can be explicitly passed
ts = pd.to_datetime('2010/11/12', format='%Y/%m/%d')
print (type(ts))
print (ts)
ts = pd.to_datetime('12-11-2010 00:00', format='%d-%m-%Y %H:%M')
print (ts)
print (ts.value)



0
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
2010-11-12 00:00:00
2010-11-12 00:00:00
1289520000000000000


A standard series can be created and (range of) elements can be used as indexes

In [None]:
tseries = pd.Series(np.random.normal(10, 1, len(days)), index=days)
# Extracting elements
print (tseries[0:4], '\n')
print (tseries['2020-11-9':'2020-11-11'], '\n') # Note - includes end time


`pd.to_datetime` can also be used to create a `DatetimeIndex`:

In [None]:
pd.to_datetime([1, 2, 3, 4], unit='D', origin=pd.Timestamp('1980-02-03'))

## DataFrame

A pandas DataFrame is like a simple tabular spreadsheet. For future reference (or for people already familiar with R), a pandas DataFrame is very similar to the R DataFrame.

Each column in a DataFrame is a Series object.

The element can be whatever, missing data are dealt with too (as NaN)

### DataFrame creation

A DataFrame can be created implicitly, with, e.g., a DatatimeIndex object as index:

In [25]:
entries=10
dates=pd.date_range('11/9/2020 14:45:00',freq='h', periods=entries) #returna uno speciale array, un DatetimeIndex
df = pd.DataFrame(np.random.randn(entries,4), index=dates, columns=['B','A','C','D'])
#indici=date, elementi: matrice 10x4 random 
df #viene + bello che con print(df)


Unnamed: 0,B,A,C,D
2020-11-09 14:45:00,-0.771276,-0.008932,-1.425441,-2.904318
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,-1.379909
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,1.250551
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,-0.047505
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,1.852458
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,-0.510105
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,-1.0382
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,-0.827927
2020-11-09 22:45:00,0.607844,0.180913,0.475864,1.158746
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,-1.149191


or by means of a dictionary: (le keys diventano le label per le colonne)

In [13]:
df2 = pd.DataFrame(
    { 'A' : 1.,
      'B' : pd.Timestamp('20130102'),
      'C' : pd.Series(1,index=range(4),dtype='float32'),
      'D' : np.arange(7,11),
      'E' : pd.Categorical(["test","train","test","train"]),
    }
    )
df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,7,test
1,1.0,2013-01-02,1.0,8,train
2,1.0,2013-01-02,1.0,9,test
3,1.0,2013-01-02,1.0,10,train


### Viewing Data (idea a volo d'uccello, possono esserci milioni di righe)

In [None]:
df.head() #prime cinque (default) righe

In [None]:
df.tail(4) #ultime quattro righe

In [16]:
df.index #dà gli indici (keys, label righe)

DatetimeIndex(['2020-11-09 14:45:00', '2020-11-09 15:45:00',
               '2020-11-09 16:45:00', '2020-11-09 17:45:00',
               '2020-11-09 18:45:00', '2020-11-09 19:45:00',
               '2020-11-09 20:45:00', '2020-11-09 21:45:00',
               '2020-11-09 22:45:00', '2020-11-09 23:45:00'],
              dtype='datetime64[ns]', freq='H')

In [None]:
df.columns #label colonne

In [17]:
df.values #contenuto, ogni riga è nparray

array([[ 0.88641128, -0.96350209, -0.06240263, -0.52395349],
       [-0.03074952, -0.4214697 , -0.49909023, -0.62817259],
       [ 0.48872606,  0.59810126, -0.1576192 ,  0.39946065],
       [ 0.36964647,  0.48636288, -1.76025062, -0.45759079],
       [-0.63613712,  1.56022681,  0.22359176, -1.5189408 ],
       [-1.55269706,  0.04516   , -1.3652768 , -0.84890151],
       [ 0.35046574, -0.01344609, -0.69297552,  0.33928997],
       [ 0.55378789, -1.17320169,  1.10680572, -0.41755749],
       [-0.62100121, -1.00947035,  0.43276287, -2.26629301],
       [ 0.03500649,  1.75087374,  0.79011669,  0.20074187]])

In [None]:
df.describe() #mi dice, per ciascuna colonna, min max mean std....

In [26]:
df

Unnamed: 0,B,A,C,D
2020-11-09 14:45:00,-0.771276,-0.008932,-1.425441,-2.904318
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,-1.379909
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,1.250551
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,-0.047505
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,1.852458
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,-0.510105
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,-1.0382
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,-0.827927
2020-11-09 22:45:00,0.607844,0.180913,0.475864,1.158746
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,-1.149191


In [27]:
df.sort_index(axis=1,ascending=True)#axes=1 columns
#così ordina le label (lettere) delle colonne in (alfabetico) ascendente 
#axes=0 rows: così ordina le label (tempi) delle righe in (numerico) ascendente 
#ascending=False: cioè da destra verso sinistra (colonne) dal basso verso l'alto (righe)

Unnamed: 0,A,B,C,D
2020-11-09 14:45:00,-0.008932,-0.771276,-1.425441,-2.904318
2020-11-09 15:45:00,-2.69084,0.045266,0.873521,-1.379909
2020-11-09 16:45:00,0.925539,-0.672052,0.813724,1.250551
2020-11-09 17:45:00,0.106865,-1.70713,0.946792,-0.047505
2020-11-09 18:45:00,-0.904404,1.851424,-0.972436,1.852458
2020-11-09 19:45:00,-1.388916,-0.705163,-0.034287,-0.510105
2020-11-09 20:45:00,-0.278878,-1.357664,-1.19092,-1.0382
2020-11-09 21:45:00,-0.212307,-0.937293,-0.307887,-0.827927
2020-11-09 22:45:00,0.180913,0.607844,0.475864,1.158746
2020-11-09 23:45:00,0.848406,-0.687394,-0.562924,-1.149191


In [24]:
df.sort_values(by="C") #il dataframe viene ordinato in modo tale da avere la colonna C in ordine crescente

Unnamed: 0,A,B,C,D
2020-11-09 17:45:00,0.369646,0.486363,-1.760251,-0.457591
2020-11-09 19:45:00,-1.552697,0.04516,-1.365277,-0.848902
2020-11-09 20:45:00,0.350466,-0.013446,-0.692976,0.33929
2020-11-09 15:45:00,-0.03075,-0.42147,-0.49909,-0.628173
2020-11-09 16:45:00,0.488726,0.598101,-0.157619,0.399461
2020-11-09 14:45:00,0.886411,-0.963502,-0.062403,-0.523953
2020-11-09 18:45:00,-0.636137,1.560227,0.223592,-1.518941
2020-11-09 22:45:00,-0.621001,-1.00947,0.432763,-2.266293
2020-11-09 23:45:00,0.035006,1.750874,0.790117,0.200742
2020-11-09 21:45:00,0.553788,-1.173202,1.106806,-0.417557


In [28]:
df #dopo tutti i sort, il dataframde iniziale è invariato, questo perchè inplace=False -default-

Unnamed: 0,B,A,C,D
2020-11-09 14:45:00,-0.771276,-0.008932,-1.425441,-2.904318
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,-1.379909
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,1.250551
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,-0.047505
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,1.852458
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,-0.510105
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,-1.0382
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,-0.827927
2020-11-09 22:45:00,0.607844,0.180913,0.475864,1.158746
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,-1.149191


## Selection

### Getting slices

The following show how to get part of the DataFrame (i.e. not just the elements)

In [42]:
## standard and safe
print (df['A'],'\n')
#print (df[:1]) #solo le righe si selezionano così? perché questo non mi dà la colonna 1?
#print (df[0:1])#questi due modi sono equivalenti e printano le righe dalla 0 alla 1 esclusa, cioè la riga 0
#print (df[:,1]) QUESTA SINTASSI PER AVERE COLONNA 1 NON FUNZIONA, vedi invece sezioni successive

## equivalent but dangerous (imagine blank spaces in the name of the column.. es. invece di "A", "A and something else")
#print (df.A)

2020-11-09 14:45:00   -0.008932
2020-11-09 15:45:00   -2.690840
2020-11-09 16:45:00    0.925539
2020-11-09 17:45:00    0.106865
2020-11-09 18:45:00   -0.904404
2020-11-09 19:45:00   -1.388916
2020-11-09 20:45:00   -0.278878
2020-11-09 21:45:00   -0.212307
2020-11-09 22:45:00    0.180913
2020-11-09 23:45:00    0.848406
Freq: H, Name: A, dtype: float64 



In [31]:
# selecting rows by counting
print (df[0:3])

# or by index
print (df["2020-11-09 14:45:00":"2020-11-09 16:45:00"])

                            B         A         C         D
2020-11-09 14:45:00 -0.771276 -0.008932 -1.425441 -2.904318
2020-11-09 15:45:00  0.045266 -2.690840  0.873521 -1.379909
2020-11-09 16:45:00 -0.672052  0.925539  0.813724  1.250551
                            B         A         C         D
2020-11-09 14:45:00 -0.771276 -0.008932 -1.425441 -2.904318
2020-11-09 15:45:00  0.045266 -2.690840  0.873521 -1.379909
2020-11-09 16:45:00 -0.672052  0.925539  0.813724  1.250551


### Selection by label

In [39]:
# getting a cross section (part of the DataFrame) using a label
print(df.loc[dates[3]]) #.loc usa [] come se fosse un array lui stesso e non un metodo
#dates è l'array DatetimeIndex degli indici temporali
df.loc["2020-11-09 17:45:00"]
#come previsto questi due sono equivalenti

B   -1.707130
A    0.106865
C    0.946792
D   -0.047505
Name: 2020-11-09 17:45:00, dtype: float64


B   -1.707130
A    0.106865
C    0.946792
D   -0.047505
Name: 2020-11-09 17:45:00, dtype: float64

In [None]:
# selecting on a multi-axis by label:
df.loc[:,['A','B']] # : vuol dire dall'inizio alla fine, TUTTE le righe

In [None]:
# showing label slicing, both endpoints are included:
df.loc['2020-11-09 18:45:00':'2020-11-09 20:45:00',['A','B']]

In [None]:
# getting an individual element
print (df.loc[dates[1],'A'])

# equivalently
print (df.at[dates[1],'A']) #modo principale per avere un solo elemento

### Selecting by position

In [40]:
# select via the position of the passed integers:
print (df.iloc[3],'\n')#iloc è indexloc, non serve mettere loc[nome_array_degli_indici[posizione_in_array_indici]]
#basta invece iloc[posizione (da 0 in poi) della riga desiderata: così scelgo la quarta [3]]
#df.loc[dates[3]] e print (df.iloc[3],'\n') sono equivalenti, con iloc non serve ricordarsi il nome_array_degli_indici
#o meglio, vedi qualche cella sopra: loc["nome_label"], iloc[int_posizione_label]
#loc e iloc con questa sintassi returnano delle series
print (df.iloc[[3]],'\n') #così ti mette anche la label righe, perché returna un dataframe.
# notation similar to numpy/python
print (df.iloc[3:5,0:2])

B   -1.707130
A    0.106865
C    0.946792
D   -0.047505
Name: 2020-11-09 17:45:00, dtype: float64 

                           B         A         C         D
2020-11-09 17:45:00 -1.70713  0.106865  0.946792 -0.047505 

                            B         A
2020-11-09 17:45:00 -1.707130  0.106865
2020-11-09 18:45:00  1.851424 -0.904404


In [None]:
# selecting rows 1,2 and 4 for columns 0 and 2
df.iloc[[1,2,4],[0,2]]

In [50]:
# slicing rows explicitly
print (df.iloc[1:3,:],'\n')

# slicing columns explicitly
print (df.iloc[:,1:3])#con iloc si può fare ciò che con print df[:,1] non si poteva fare
print ("buh:", '\n', df.iloc[:,1])


                            B         A         C         D
2020-11-09 15:45:00  0.045266 -2.690840  0.873521 -1.379909
2020-11-09 16:45:00 -0.672052  0.925539  0.813724  1.250551 

                            A         C
2020-11-09 14:45:00 -0.008932 -1.425441
2020-11-09 15:45:00 -2.690840  0.873521
2020-11-09 16:45:00  0.925539  0.813724
2020-11-09 17:45:00  0.106865  0.946792
2020-11-09 18:45:00 -0.904404 -0.972436
2020-11-09 19:45:00 -1.388916 -0.034287
2020-11-09 20:45:00 -0.278878 -1.190920
2020-11-09 21:45:00 -0.212307 -0.307887
2020-11-09 22:45:00  0.180913  0.475864
2020-11-09 23:45:00  0.848406 -0.562924
buh: 
 2020-11-09 14:45:00   -0.008932
2020-11-09 15:45:00   -2.690840
2020-11-09 16:45:00    0.925539
2020-11-09 17:45:00    0.106865
2020-11-09 18:45:00   -0.904404
2020-11-09 19:45:00   -1.388916
2020-11-09 20:45:00   -0.278878
2020-11-09 21:45:00   -0.212307
2020-11-09 22:45:00    0.180913
2020-11-09 23:45:00    0.848406
Freq: H, Name: A, dtype: float64


In [52]:
# selecting an individual element by position
print(df.iloc[1,1])
df.iat[1,1]


-2.6908401308292427


-2.6908401308292427

### Boolean index

Very powerful way of filtering out data with certain features. Notation is very similar to numpy arrays.

In [53]:
# Filter by a boolean condition on the values of a single column
df[df['B'] > 0] #returna tutto il dataframe corrispondente a dei B positivi

Unnamed: 0,B,A,C,D
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,-1.379909
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,1.852458
2020-11-09 22:45:00,0.607844,0.180913,0.475864,1.158746


In [54]:
# Selecting on the basis of boolean conditions applied to the whole DataFrame
df[df>0]

# a DataFrame with the same shape is returned, with NaN's where condition is not met

Unnamed: 0,B,A,C,D
2020-11-09 14:45:00,,,,
2020-11-09 15:45:00,0.045266,,0.873521,
2020-11-09 16:45:00,,0.925539,0.813724,1.250551
2020-11-09 17:45:00,,0.106865,0.946792,
2020-11-09 18:45:00,1.851424,,,1.852458
2020-11-09 19:45:00,,,,
2020-11-09 20:45:00,,,,
2020-11-09 21:45:00,,,,
2020-11-09 22:45:00,0.607844,0.180913,0.475864,1.158746
2020-11-09 23:45:00,,0.848406,,


### Setting

Combination of selection and setting of values

In [55]:
# setting values by label (same as by position)
df.at[dates[0],'A'] = 0 #loc e at selezionano con le labels, iloc e iat con le posizioni
# setting and assigning a numpy array
df.loc[:,'D'] = np.array([5] * len(df))

# defining a brand new column: qui gli indici sono automaticamente gli stessi, perché un ndarray non ha indici suoi
df['E'] = np.arange(len(df))*0.5

# defining a brand new column by means of a pd.Series: indexes must be the same! bisogna implementarlo a mano,
#una series ha indici suoi
df['E prime'] = pd.Series(np.arange(len(df))*2, index=df.index)


In [56]:
df
def dcos(theta):
    theta = theta*(np.pi/180)
    return np.cos(theta)
 
df['cosine'] = pd.Series(df["E"].apply(dcos), index=df.index)#faccio il coseno dellla colonna E
#e lo salvo nella nuova colonna cosine
df

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,-0.771276,0.0,-1.425441,5,0.0,0,1.0
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,5,0.5,2,0.999962
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,5,1.0,4,0.999848
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,5,1.5,6,0.999657
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,5,2.0,8,0.999391
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,5,2.5,10,0.999048
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,5,3.0,12,0.99863
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,5,3.5,14,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16,0.997564
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,5,4.5,18,0.996917


In [57]:
# another example of global setting
df2=df.copy()
df2[df2>0] = -df2
df2 #ha tutti gli elementi opposti a df

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,-0.771276,0.0,-1.425441,-5,0.0,0,-1.0
2020-11-09 15:45:00,-0.045266,-2.69084,-0.873521,-5,-0.5,-2,-0.999962
2020-11-09 16:45:00,-0.672052,-0.925539,-0.813724,-5,-1.0,-4,-0.999848
2020-11-09 17:45:00,-1.70713,-0.106865,-0.946792,-5,-1.5,-6,-0.999657
2020-11-09 18:45:00,-1.851424,-0.904404,-0.972436,-5,-2.0,-8,-0.999391
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,-5,-2.5,-10,-0.999048
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,-5,-3.0,-12,-0.99863
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,-5,-3.5,-14,-0.998135
2020-11-09 22:45:00,-0.607844,-0.180913,-0.475864,-5,-4.0,-16,-0.997564
2020-11-09 23:45:00,-0.687394,-0.848406,-0.562924,-5,-4.5,-18,-0.996917


### Dropping

N.B.: dropping doesn't act permanently on the DataFrame, i.e. to get that do : (agisce su copia, come il sort sopra)
```python
df = df.drop(....)
```

In [58]:
# Dropping by column: caviamo una colonna
df.drop(['E prime'], axis=1)

#which is equivalent to
df.drop(columns=['E prime'])

Unnamed: 0,B,A,C,D,E,cosine
2020-11-09 14:45:00,-0.771276,0.0,-1.425441,5,0.0,1.0
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,5,0.5,0.999962
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,5,1.0,0.999848
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,5,1.5,0.999657
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,5,2.0,0.999391
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,5,2.5,0.999048
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,5,3.0,0.99863
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,5,3.5,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,0.997564
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,5,4.5,0.996917


In [59]:
# Dropping by raws
# safe and always working
df.drop(df.index[[1,2,3,4]])#cancello dalla seconda alla quinta incluse

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,-0.771276,0.0,-1.425441,5,0.0,0,1.0
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,5,2.5,10,0.999048
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,5,3.0,12,0.99863
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,5,3.5,14,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16,0.997564
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,5,4.5,18,0.996917


In [None]:
# something like df.drop('index_name') -per droppare la riga grazie alla sua label va specificato il tipo-
# would work but the type of index must be specificed, 
# in particular with DatetimeIndex
df.drop(pd.to_datetime("2020-11-09 22:45:00"))

## Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

In [60]:
df_wNan = df[df>0]
df_wNan #ha i NaN dove df aveva una roba minore di zero

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,,,,5,,,1.0
2020-11-09 15:45:00,0.045266,,0.873521,5,0.5,2.0,0.999962
2020-11-09 16:45:00,,0.925539,0.813724,5,1.0,4.0,0.999848
2020-11-09 17:45:00,,0.106865,0.946792,5,1.5,6.0,0.999657
2020-11-09 18:45:00,1.851424,,,5,2.0,8.0,0.999391
2020-11-09 19:45:00,,,,5,2.5,10.0,0.999048
2020-11-09 20:45:00,,,,5,3.0,12.0,0.99863
2020-11-09 21:45:00,,,,5,3.5,14.0,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16.0,0.997564
2020-11-09 23:45:00,,0.848406,,5,4.5,18.0,0.996917


In [61]:
# dropping rows with at least a Nan
df_wNan.dropna(how='any') #era già il default, basta un NaN per droppare la riga (axes=0 default)
#con all dovrebbro essere tutti Nan 

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16.0,0.997564


In [62]:
# getting a mask
df_wNan.isna() #funzione che identifica le posizioni dei NaN: returna una mask, cioè un dataframe di bool
#df_wNan.notna()

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,True,True,True,False,True,True,False
2020-11-09 15:45:00,False,True,False,False,False,False,False
2020-11-09 16:45:00,True,False,False,False,False,False,False
2020-11-09 17:45:00,True,False,False,False,False,False,False
2020-11-09 18:45:00,False,True,True,False,False,False,False
2020-11-09 19:45:00,True,True,True,False,False,False,False
2020-11-09 20:45:00,True,True,True,False,False,False,False
2020-11-09 21:45:00,True,True,True,False,False,False,False
2020-11-09 22:45:00,False,False,False,False,False,False,False
2020-11-09 23:45:00,True,False,True,False,False,False,False


In [63]:
# filling missing data: Fill NA/NaN values
df_wNan.fillna(value=0)

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,0.0,0.0,0.0,5,0.0,0.0,1.0
2020-11-09 15:45:00,0.045266,0.0,0.873521,5,0.5,2.0,0.999962
2020-11-09 16:45:00,0.0,0.925539,0.813724,5,1.0,4.0,0.999848
2020-11-09 17:45:00,0.0,0.106865,0.946792,5,1.5,6.0,0.999657
2020-11-09 18:45:00,1.851424,0.0,0.0,5,2.0,8.0,0.999391
2020-11-09 19:45:00,0.0,0.0,0.0,5,2.5,10.0,0.999048
2020-11-09 20:45:00,0.0,0.0,0.0,5,3.0,12.0,0.99863
2020-11-09 21:45:00,0.0,0.0,0.0,5,3.5,14.0,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16.0,0.997564
2020-11-09 23:45:00,0.0,0.848406,0.0,5,4.5,18.0,0.996917


Fill gaps forward or backward by propagating non-NA values forward or backward:

In [None]:
#Fill NA/NaN values using the specified method.
df_wNan.fillna(method='pad') #methods: pad / ffill: propagate last valid observation forward to next valid 
#backfill / bfill: use next valid observation to fill gap.

## Operations

Here comes the most relevant advantage of DataFrame. Operations on columns are extremly fast, almost as fast as the actual operation between elements in a row

In [None]:
# Some statistics (mean() just as an example)
# rows
print (df.mean(axis=0),'\n')
# columns
print (df.mean(axis=1),'\n')

In [None]:
# global operations on columns
df.apply(np.cumsum)#somma cumulativa

In [64]:
df

Unnamed: 0,B,A,C,D,E,E prime,cosine
2020-11-09 14:45:00,-0.771276,0.0,-1.425441,5,0.0,0,1.0
2020-11-09 15:45:00,0.045266,-2.69084,0.873521,5,0.5,2,0.999962
2020-11-09 16:45:00,-0.672052,0.925539,0.813724,5,1.0,4,0.999848
2020-11-09 17:45:00,-1.70713,0.106865,0.946792,5,1.5,6,0.999657
2020-11-09 18:45:00,1.851424,-0.904404,-0.972436,5,2.0,8,0.999391
2020-11-09 19:45:00,-0.705163,-1.388916,-0.034287,5,2.5,10,0.999048
2020-11-09 20:45:00,-1.357664,-0.278878,-1.19092,5,3.0,12,0.99863
2020-11-09 21:45:00,-0.937293,-0.212307,-0.307887,5,3.5,14,0.998135
2020-11-09 22:45:00,0.607844,0.180913,0.475864,5,4.0,16,0.997564
2020-11-09 23:45:00,-0.687394,0.848406,-0.562924,5,4.5,18,0.996917


In [65]:
df.apply(lambda x: x.max() - x.min()) #per colonna B= 1.85 - (-1.71)

B           3.558554
A           3.616379
C           2.372233
D           0.000000
E           4.500000
E prime    18.000000
cosine      0.003083
dtype: float64

In [None]:
# syntax is as usual similar to that of numpy arrays
df['A']+df['B']

Let's play it hard and load (in memory) a (relatively) large dataset

In [None]:
# WARNING! link in past notebook was wrong!, (if needed) get the right file from:
#!wget https://www.dropbox.com/s/xvjzaxzz3ysphme/data_000637.txt -P ~/data/

file_name="~/data/data_000637.txt"
data=pd.read_csv(file_name)
data

Let's now do some operations among (elements of) columns

In [None]:
# the one-liner killing it all
data['timens']=data['TDC_MEAS']*25/30+data['BX_COUNTER']*25

In [None]:
# the old slooow way
def conversion(data):
    result=[]
    for i in range(len(data)): 
        result.append(data.loc[data.index[i],'TDC_MEAS']*25/30.+data.loc[data.index[i],'BX_COUNTER']*25)
    return result

data['timens']=conversion(data)

## Merge

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

### Concat

concatenation (adding rows) is straightforward


In [67]:
rdf = pd.DataFrame(np.random.randn(10, 4))
rdf

Unnamed: 0,0,1,2,3
0,-1.060695,-0.714518,1.092179,-0.204018
1,-0.711785,0.430397,-0.193971,-0.876465
2,0.073343,0.152318,-0.1671,0.116445
3,-0.748755,-0.148072,0.842267,-0.03444
4,0.489096,0.647985,-1.212847,0.138134
5,0.918031,-0.518538,-0.799419,0.855883
6,1.874292,1.628738,1.637651,-0.767216
7,0.285124,-1.480095,-0.599827,-0.10088
8,-1.250541,-0.592458,1.200935,0.853618
9,0.46026,-0.771179,-1.344308,0.254719


In [68]:
# divide it into pieces row-wise
pieces = [rdf[:3], rdf[3:7], rdf[7:]]
pieces

[          0         1         2         3
 0 -1.060695 -0.714518  1.092179 -0.204018
 1 -0.711785  0.430397 -0.193971 -0.876465
 2  0.073343  0.152318 -0.167100  0.116445,
           0         1         2         3
 3 -0.748755 -0.148072  0.842267 -0.034440
 4  0.489096  0.647985 -1.212847  0.138134
 5  0.918031 -0.518538 -0.799419  0.855883
 6  1.874292  1.628738  1.637651 -0.767216,
           0         1         2         3
 7  0.285124 -1.480095 -0.599827 -0.100880
 8 -1.250541 -0.592458  1.200935  0.853618
 9  0.460260 -0.771179 -1.344308  0.254719]

In [70]:
# put it back together
#pd.concat(pieces) 

# indexes can be ignored, però qui nel pratico non cambia nulla
pd.concat(pieces, ignore_index=True)#If True, do not use the index values along the concatenation axis.

# in case of dimension mismatch, Nan are added where needed

Unnamed: 0,0,1,2,3
0,-1.060695,-0.714518,1.092179,-0.204018
1,-0.711785,0.430397,-0.193971,-0.876465
2,0.073343,0.152318,-0.1671,0.116445
3,-0.748755,-0.148072,0.842267,-0.03444
4,0.489096,0.647985,-1.212847,0.138134
5,0.918031,-0.518538,-0.799419,0.855883
6,1.874292,1.628738,1.637651,-0.767216
7,0.285124,-1.480095,-0.599827,-0.10088
8,-1.250541,-0.592458,1.200935,0.853618
9,0.46026,-0.771179,-1.344308,0.254719


In [None]:
# appending a single row (as a Series)
s = rdf.iloc[3]
rdf.append(s, ignore_index=True)
rdf

### Merge/Join

SQL like operations on table can be performed on DataFrames. This is all rather sophisticated, refer to the [doc](https://pandas.pydata.org/pandas-docs/stable/merging.html#merging) for more info/examples

In [None]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

pd.merge(left,right,on="key")

## Grouping

By “group by” we are referring to a process involving one or more of the following steps:

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure


In [None]:
gdf = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
gdf

In [None]:
# Grouping and then applying the sum() 
# function to the resulting groups (effective only where number are there).
gdf.groupby('A').sum()

## Multi-indexing


Hierarchical / Multi-level indexing allows sophisticated data analysis on higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In [None]:
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))
multi_index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
print (multi_index,'\n')

s = pd.Series(np.random.randn(8), index=multi_index)
print (s)


In [None]:
# it enables further features of the groupby method,
# e.g. when group-by by multiple columns
gdf.groupby(['A','B']).sum()

In [None]:
# stack() method “compresses” a level in the DataFrame’s columns
gdf.groupby(['A','B']).sum().stack()

## Plotting

Just a preview, more on the next lab class!

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.cumsum().plot()

In [None]:
import matplotlib.pyplot as plt

pdf=pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')