# Lab 3 Data Structures in Pandas

In [1]:
import pandas as pd

In [2]:
import numpy as np

# 1. Let's create our own data

## 1.1. Creating a Series

### The Pandas Series is a one-dimensional container, similar to the built-in Python list. 
#### You can think of Pandas Series like a column of a DataFrame


In [3]:
fruitSeries=pd.Series(['banana', 'apple', 'coconut','orange'])

In [4]:
fruitSeries

0     banana
1      apple
2    coconut
3     orange
dtype: object

In [5]:
caleriesSeries=pd.Series(['100','50','120','25'])

In [6]:
caleriesSeries

0    100
1     50
2    120
3     25
dtype: object

### We can create a DataFrame from multiple data series

In [7]:
df=pd.DataFrame()
df['fruit']=fruitSeries
df['calorie']=caleriesSeries
df

Unnamed: 0,fruit,calorie
0,banana,100
1,apple,50
2,coconut,120
3,orange,25


## 1.2. Creating a DataFrame from a dictionary

### The DataFrame is the most common Pandas object. It can be thought of as Pythonâ€™s way of storing spreadsheet-like data.

### A DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the the most common way of creating a DataFrame by hand.
### The key represents the column name, and the values are the contents of the column.



<img src="https://raw.githubusercontent.com/mosleh-exeter/BEM1025/master/sessions/images/session03-fig1.png"  width="1000" height="1200">

In [8]:
# python dictionary is a key value structure. 
dictionaryOfScientists={'Name': ['Rosaline Franklin', 'William Gosset','Alexander Flemming','Carl F. Gauss'],
                        'Occupation': ['Chemist', 'Statistician','Physician','Mathematician'],
                        'Born': ['1920-07-25', '1876-06-13','1881-08-06','1777-04-10'],
                        'Died': ['1958-04-16', '1937-10-16','1954-03-11','1855-02-23'],
                        'YearsActive': [10, 10,20,25],
                        'Age': [37, 61,73,77]}



In [9]:
dictionaryOfScientists

{'Name': ['Rosaline Franklin',
  'William Gosset',
  'Alexander Flemming',
  'Carl F. Gauss'],
 'Occupation': ['Chemist', 'Statistician', 'Physician', 'Mathematician'],
 'Born': ['1920-07-25', '1876-06-13', '1881-08-06', '1777-04-10'],
 'Died': ['1958-04-16', '1937-10-16', '1954-03-11', '1855-02-23'],
 'YearsActive': [10, 10, 20, 25],
 'Age': [37, 61, 73, 77]}

In [10]:
# we now transform the dictionary to a dataframe
DataFrameOfScientists = pd.DataFrame(dictionaryOfScientists)

In [11]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


## 2. Let's select some data

### 2.1. Selection based on column name

In [12]:
DataFrameOfScientists['Name']

0     Rosaline Franklin
1        William Gosset
2    Alexander Flemming
3         Carl F. Gauss
Name: Name, dtype: object

In [13]:
DataFrameOfScientists[['Name','Occupation']]

Unnamed: 0,Name,Occupation
0,Rosaline Franklin,Chemist
1,William Gosset,Statistician
2,Alexander Flemming,Physician
3,Carl F. Gauss,Mathematician


### 2.2. Selection based on filter

In [14]:
# filter to select chemists
chemistFilter=DataFrameOfScientists['Occupation']=='Chemist'
chemistFilter

0     True
1    False
2    False
3    False
Name: Occupation, dtype: bool

In [15]:
DataFrameOfScientists[chemistFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37


In [16]:
# filter to select those named Alexander Flemming
alexanderFlemmingFilter=DataFrameOfScientists['Name']=='Alexander Flemming'

In [17]:
DataFrameOfScientists[alexanderFlemmingFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73


In [18]:
# filter to select those named William 
# we use the function str.contains()
williamFilter=DataFrameOfScientists['Name'].str.contains('William')

In [19]:
DataFrameOfScientists[williamFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61


In [20]:
# filter to select by age
ageFilter=DataFrameOfScientists['Age']>=60

In [21]:
DataFrameOfScientists[ageFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


In [22]:
# filter to select by age range [40,70]
ageFilter2=(DataFrameOfScientists['Age']>=40)&(DataFrameOfScientists['Age']<70)

In [23]:
DataFrameOfScientists[ageFilter2]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61


### Practice: find the scientist who is younger than 70 and has a 'W' in their name?

### 2.3. Selection based on inverse filter 

In [24]:
DataFrameOfScientists[~williamFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


## 3. Let's perform some basic operations 

In [25]:
# we multiply the YearsActive column and save the result as a new column
DataFrameOfScientists['DoubleActive']=DataFrameOfScientists['YearsActive']*2

In [26]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age,DoubleActive
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37,20
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61,20
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73,40
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77,50


In [27]:
# we compute the log of the column years active
DataFrameOfScientists['YearsActive_log']=np.log(DataFrameOfScientists['YearsActive'])

In [28]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age,DoubleActive,YearsActive_log
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37,20,2.302585
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61,20,2.302585
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73,40,2.995732
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77,50,3.218876


In [29]:
# we can find maximum and minimum of values in a column:
DataFrameOfScientists['Age'].max()

77

In [30]:
#we can calculate the sum or mean of a value
DataFrameOfScientists['Age'].mean()

62.0

### We can create a new column based on existing columns
### create a new column for fraction of age being active!