[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mosleh-exeter/BEM1025/blob/main/Lecture/03-Lecture03-Data-Structures-in-Pandas.ipynb)

# Session 03 - Data Structures and Filtering in Pandas

Content:
- Create your own dataset
    - Create a series
    - Create a dataframe from dictionary
-  Select/filter data
    - Subsetting on column name
    - Selection based on filter
    - Selection based on inverse filter 
- Basic operations on columns

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session03-fig0.png"  width="300" height="1200">

As usual we first import libraries we need

In [5]:
import pandas as pd

In [6]:
import numpy as np

# Creating our own data

## Creating a Series

### Pandas Series is a one-dimensional data structure, similar to the built-in Python list. 
#### You can think of Pandas Series like a column of a DataFrame
#### Learn more about series here: https://pandas.pydata.org/docs/reference/api/pandas.Series.html


We can transform a list to a pandas series using the following:

In [7]:
fruitSeries=pd.Series(['banana', 'apple', 'coconut','orange'])

In [8]:
fruitSeries

0     banana
1      apple
2    coconut
3     orange
dtype: object

In [9]:
type(fruitSeries)

pandas.core.series.Series

In [10]:
caleriesSeries=pd.Series([100,50,120,25])

In [11]:
caleriesSeries

0    100
1     50
2    120
3     25
dtype: int64

### We can create a DataFrame from multiple data series, where each series is a column of our dataframe

In [12]:
df=pd.DataFrame()
df['fruit']=fruitSeries
df['calorie']=caleriesSeries
df

Unnamed: 0,fruit,calorie
0,banana,100
1,apple,50
2,coconut,120
3,orange,25


## Creating a DataFrame from a dictionary

### Dictionaries are written with curly brackets, and have keys and values:
    my_dict={'key1':value1, 'key2': value2, ... }
### Keys are unique. Values can be any data type e.g., numbers, string, lists or also another dictionary
### Learn more about Python dictionary here: https://www.w3schools.com/python/python_dictionaries.asp


******

### A DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the most common way of creating a DataFrame by hand.



******

### We can create a dictionary where the key represents the column name, and the values are lists containing  values for each column.
    my_dict={'col1': [item11,item21,...],
            'col2':[item21,item22,...]
            ...}

In [13]:
# python dictionary is a key value structure. 
dictionaryOfScientists={'Name': ['Rosaline Franklin', 'William Gosset','Alexander Flemming','Carl F. Gauss'],
                        'Occupation': ['Chemist', 'Statistician','Physician','Mathematician'],
                        'Born': ['1920-07-25', '1876-06-13','1881-08-06','1777-04-10'],
                        'Died': ['1958-04-16', '1937-10-16','1954-03-11','1855-02-23'],
                        'YearsActive': [10, 10,20,25],
                        'Age': [37, 61,73,77]}

In [14]:
dictionaryOfScientists

{'Name': ['Rosaline Franklin',
  'William Gosset',
  'Alexander Flemming',
  'Carl F. Gauss'],
 'Occupation': ['Chemist', 'Statistician', 'Physician', 'Mathematician'],
 'Born': ['1920-07-25', '1876-06-13', '1881-08-06', '1777-04-10'],
 'Died': ['1958-04-16', '1937-10-16', '1954-03-11', '1855-02-23'],
 'YearsActive': [10, 10, 20, 25],
 'Age': [37, 61, 73, 77]}

In [15]:
# we now transform the dictionary to a dataframe
DataFrameOfScientists = pd.DataFrame(dictionaryOfScientists)

In [16]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


## Selecting / filtering

### Subsetting on column name

In [20]:
DataFrameOfScientists['Name']

0     Rosaline Franklin
1        William Gosset
2    Alexander Flemming
3         Carl F. Gauss
Name: Name, dtype: object

In [21]:
DataFrameOfScientists[['Name','Occupation']]

Unnamed: 0,Name,Occupation
0,Rosaline Franklin,Chemist
1,William Gosset,Statistician
2,Alexander Flemming,Physician
3,Carl F. Gauss,Mathematician


### Selection based on filter

#### To select rows based on a conditional expression, use a condition inside the selection brackets [].

#### The condition inside the selection brackets DataFrameOfScientists["Age"] > 35 checks for which rows the Age column has a value larger than 60:

Learn more about subsetting here: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe

In [22]:
DataFrameOfScientists[DataFrameOfScientists['Age']>=60]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


#### We can do this in two steps:

First create a series that is true for rows where the condition is met and false for others. Next, we use that as input for subsetting

In [27]:
# filter to select by age
ageFilter=DataFrameOfScientists['Age']>=60
ageFilter

0    False
1     True
2     True
3     True
Name: Age, dtype: bool

In [28]:
DataFrameOfScientists[ageFilter]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


#### We can use other comparison operators for subsetting

In [29]:
# filter to select chemists

DataFrameOfScientists[DataFrameOfScientists['Occupation']=='Chemist']

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37


In [30]:
# filter to select those named Alexander Flemming
DataFrameOfScientists[DataFrameOfScientists['Name']=='Alexander Flemming']

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73


In [31]:
# filter to select those named is not Alexander Flemming
DataFrameOfScientists[DataFrameOfScientists['Name']!='Alexander Flemming']

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


#### We can use str.contains to see if part of a string matches 
#### Learn more about str.contains : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html

In [32]:
# filter to select those named William 
# we use the function str.contains()


In [33]:
DataFrameOfScientists[DataFrameOfScientists['Name'].str.contains('William')]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61


#### We can use str.lower().str.contains to lower case everything to avoid case sensivity

In [38]:
DataFrameOfScientists[DataFrameOfScientists['Name'].str.lower().str.contains('william')]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61


### Combining multiple conditions
#### When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use "or" / "and" but need to use the or operator "|" and the and operator "&".

In [39]:
# filter to select by age range [40,70]
ageFilter2=(DataFrameOfScientists['Age']>=40)&(DataFrameOfScientists['Age']<70)

In [40]:
DataFrameOfScientists[ageFilter2]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61


### Selection based on inverse filter 

#### You can user "~" to inverse a filter (Note: for inversion in comparison we use "!")

In [41]:
DataFrameOfScientists[~DataFrameOfScientists['Name'].str.contains('William')]

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77


## Basic operations on columns

### We can apply basic arithmetic operations on pandas columns and rows. Similarly we can use opertions from numpy library

In [42]:
# we multiply the YearsActive column and save the result as a new column
DataFrameOfScientists['DoubleActive']=DataFrameOfScientists['YearsActive']*2

In [43]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age,DoubleActive
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37,20
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61,20
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73,40
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77,50


In [44]:
# we compute the log of the column years active
DataFrameOfScientists['YearsActive_log']=np.log(DataFrameOfScientists['YearsActive'])

In [45]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,YearsActive,Age,DoubleActive,YearsActive_log
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,10,37,20,2.302585
1,William Gosset,Statistician,1876-06-13,1937-10-16,10,61,20,2.302585
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,20,73,40,2.995732
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,25,77,50,3.218876


### There are internal pandas functions we can use for general statistics

In [48]:
# we can find maximum and minimum of values in a column:
DataFrameOfScientists['Age'].max(),DataFrameOfScientists['Age'].min()

(77, 37)

In [49]:
#we can calculate the sum or mean of a value
DataFrameOfScientists['Age'].mean()

62.0

### Practice: find the scientist who is younger than 70 and has a 'W' in their name?

### Practice: create a new column for fraction of age being active!

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session03-fig3.png"  width="300" height="1200">

