# Pandas and data visualization
### Lecture 3

Slides adapted from: https://www.tutorialspoint.com/

# What is Pandas
* Panda is an open-source Python Library
* It provides high-performance data manipulation and analysis
* History 
    * The name *Pandas* is derived from the word Panel Data
    * 2008, McKinney started developing pandas to get performance and flexibility in data analysis
    * Prior to Pandas, Python was majorly used for data munging and preparation.
    * Pandas solved the problem of data analysis.     
    * Python with Pandas is used in a wide range of academic and commercial fields  
    

##  Key Features of Pandas
* Fast and efficient **DataFrame** object with default and customized indexing.
* Tools for **loading data** into in-memory data objects from different file formats.
* Data alignment and integrated handling of **missing data**
* **Label-based slicing**, indexing and subsetting of large data sets
* Columns from a data structure can be deleted or inserted
* **Group by data** for aggregation and transformations
* Time Series functionality

## Installation

``` conda install pandas ```

## Data Structures

* Three data structures
    * Series
    * DataFrame
    * Panel
    
 

   
### Relationship among structures
* The higher dimensional data structure is a container of its lower dimensional data structure: 
    * DataFrame is a container of Series
    * Panel is a container of DataFrame

### Summary data structures 
    
| Data Structure |Dimensions | Description |
| --- | --- | --- |
| Series | 1D | 1D labeled homogeneous array, fixed size |
| Data Frames | 2D | 2D labeled tabular structure with  heterogeneous columns, size not fixed |
| Panel | 3D | General 3D labeled data, array of non fixed size |

### Series
* 1D array structure
* Homogeneous data (same type)
* Size Immutable
* Values of Data Mutable

Example: 1,10,14,15,6,90,76

### DataFrame

* A labelled two-dimensional array with heterogeneous data

| Name | Room | Age |
| --- | --- | --- |
| Andrea | MEJ 9202B | 42  |
| Caterina | MEJ 9202A | NaN |
| Claire | MEJ 7302 | 33 |
| Nicolas | Kenan 725| 34|

*	Data Type of Columns

| Column| Type| 
| --- | --- | 
| Name| 	String| 
| Room| 	String| 
| Age	| Integer| 

#### DataFrame Summary

* Heterogeneous data
* Size Mutable
* Data Mutable

### Panel

* A 3D data structure with heterogeneous data
* A  container of DataFrames
* Heterogeneous data
* Size Mutable
* Data Mutable

### pandas.Series

* The constructor is ``` pandas.Series(data, index, dtype, copy)```
    * data: ndarray, list, constants
    * index: values must be unique and hashable, same length as data. Default np.arrange(n) 
    * dtype: data type, if None will be inferred
    * copy: Copy data. Default False

In [None]:
import pandas as pd #import the pandas library 
help(pd.Series) # check function help file

####  Example series

In [167]:
ds = pd.Series() # create empty series
print(ds)

Series([], dtype: float64)


In [168]:
import numpy as np # impor tnupy library
data = np.array(['x','y','z','j']) # create series from Array
ds = pd.Series(data) # index is not passed: using range(n)
print(ds) 

0    x
1    y
2    z
3    j
dtype: object


### Examples series

In [169]:
data = {'x' : 0., 'y' : 1., 'z' : 2.} # using dictionary as an input
ds = pd.Series(data) # index is not passed, using dict keys as index
print(ds)

x    0.0
y    1.0
z    2.0
dtype: float64


In [170]:
data = {'x' : 0., 'y' : 1., 'z' : 2.}
ds = pd.Series(data,index=['y','z','k','x']) # data with index
print(ds)

y    1.0
z    2.0
k    NaN
x    0.0
dtype: float64


In [171]:
ds = pd.Series(5, index=[0, 1, 2, 3]) # series from scalar
print(ds)

0    5
1    5
2    5
3    5
dtype: int64


#### Access data from position

In [172]:
ds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(ds[0])
print('****')
print(ds[[0,1]])
print('****')
print(ds[:3])
print('****')
print(ds[-3:])

1
****
a    1
b    2
dtype: int64
****
a    1
b    2
c    3
dtype: int64
****
c    3
d    4
e    5
dtype: int64


#### Access data using labels

In [173]:
ds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(ds['b'])
print('****')
print(ds[['b']])
print('****')
print(ds[['b','c']])

2
****
b    2
dtype: int64
****
b    2
c    3
dtype: int64


In [174]:
#print(ds['z']) # key is not in  ['a','b','c','d','e']

### pandas.DataFrame (EXCEL LIKE)

* columns might be of different types
* Size – Not fixed
* Labeled axes (rows and columns)
* Can apply algorithms on rows and columns

#### Rows and columns

| Name | Room | Age |
| --- | --- | --- |
| Andrea | MEJ 9202B | 42  |
| Caterina | MEJ 9202A | NaN |
| Claire | MEJ 7302 | 33 |
| Nicolas | Kenan 725| 34|



#### Constructor

* ```pandas.DataFrame(data, index, columns, dtype, copy)```
    * data: ndarray, series, map, lists, dict, constants, another DataFrame	 --> Iterables
    * index: For the row labels, similar to Series, default np.arrange(n) if no index is passed.
    * columns: For column labels, default np.arrange(n). This is only true if no index is passed.
    * dtype: Data type of each column.	
    * copy



In [None]:
help(pd.DataFrame)

#### Creating DataFrames

In [176]:
import pandas as pd
df = pd.DataFrame() # empty
print(df)

Empty DataFrame
Columns: []
Index: []


In [177]:
df = pd.DataFrame([1,2,3,4,5]) # from list
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [178]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=np.float) # from dict and set float type
print(df)

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


In [179]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} # from dict and iterable
df = pd.DataFrame(data)
print(df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [180]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['A','B','C','D']) # now using index
print(df)

    Name  Age
A    Tom   28
B   Jack   34
C  Steve   29
D  Ricky   42


In [181]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data) # from lists of dicts
print(df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [182]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['X', 'Y']) # from lists of dicts and index
print(df)

   a   b     c
X  1   2   NaN
Y  5  10  20.0


In [183]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#subselects only  dictionary keys a,b
df1 = pd.DataFrame(data, index=['X', 'Y'], columns=['a', 'b']) 
print(df1)
print('******')

#if keys not existing, adds NaNs
df2 = pd.DataFrame(data, index=['X', 'Y'], columns=['a', 'b1'])  

print(df2)

   a   b
X  1   2
Y  5  10
******
   a  b1
X  1 NaN
Y  5 NaN


In [184]:
d = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} # from dictionary of series

df = pd.DataFrame(d)
print(df) # notice the missing element in first column

   first  second
a    1.0       1
b    2.0       2
c    3.0       3
d    NaN       4


#### Column Selection, Addition, and Deletion

In [185]:
print(df['first'])
print('****')
print(df[['first','second']])

a    1.0
b    2.0
c    3.0
d    NaN
Name: first, dtype: float64
****
   first  second
a    1.0       1
b    2.0       2
c    3.0       3
d    NaN       4


In [186]:
df['third']=pd.Series([10,20,30],index=['a','b','c']) # add column using a Series
print(df)

   first  second  third
a    1.0       1   10.0
b    2.0       2   20.0
c    3.0       3   30.0
d    NaN       4    NaN


In [187]:
df['fourth'] = df['first'] + df['third'] # add column by combining existing values (EXCEL)
print(df)

   first  second  third  fourth
a    1.0       1   10.0    11.0
b    2.0       2   20.0    22.0
c    3.0       3   30.0    33.0
d    NaN       4    NaN     NaN


In [188]:
df4 = df.pop('fourth') #delete using pop
print(df4)
print('****')
print(df)

a    11.0
b    22.0
c    33.0
d     NaN
Name: fourth, dtype: float64
****
   first  second  third
a    1.0       1   10.0
b    2.0       2   20.0
c    3.0       3   30.0
d    NaN       4    NaN


In [189]:
del df['first'] #delete using del
print(df)

   second  third
a       1   10.0
b       2   20.0
c       3   30.0
d       4    NaN


#### Row Selection, Addition, and Deletion

In [190]:
print(df.loc['b']) # select by index label

second     2.0
third     20.0
Name: b, dtype: float64


In [191]:
print(df.iloc[0]) # select by index

second     1.0
third     10.0
Name: a, dtype: float64


In [192]:
print(df[2:4])  # slice rows

   second  third
c       3   30.0
d       4    NaN


In [193]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


In [194]:
df = df.drop(0) # remove using label
print(df)


   a  b
1  3  4
1  7  8


In [199]:
df = pd.DataFrame([[1, 2], [3, 4],[5, 6], [7, 8]], columns = ['a','b'])
df2 = pd.DataFrame([], columns = ['a','b'])
df = df.append(df2)
df = df[2:] # remove by indexing 
print(df) 

   a  b
2  5  6
3  7  8
