# Day 3: Data analysis with Pandas



1.   Introduction: Data frames & explorative data analysis
2.   Data manipulation: Cleaning, combining, grouping, sorting
3.   Views vs copies
4.   Useful tools & summary





Kieran Didi

![](https://github.com/kdidi99/Python_for_Biochemists/blob/main/notebooks/workflow-wickham.png?raw=1)

## 1. Introduction: The Data Frame

![](https://github.com/kdidi99/Python_for_Biochemists/blob/main/notebooks/01_table_dataframe.svg?raw=1)


### Advantages of data frames:

*   Easy to use
*   Pandas library fast & supports complex operations
*   De facto standard for data analysis



1st step: download some data

In [1]:
import requests

download_url = "https://raw.githubusercontent.com/rashida048/Datasets/master/titanic_data.csv"
target_csv_path = "titanic_data.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")


Download ready.


2nd step: use pandas to load data

In [2]:
import pandas as pd
df_titanic = pd.read_csv("titanic_data.csv")
df_titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [6]:
df_titanic.head(5) #prints the first five rows of the data frame

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3rd step: Do some explorative analysis of your data

In [11]:
type(df_titanic) #type of object
len(df_titanic) #number of rows
df_titanic.shape #shape attribute of data frame, rows&columns
df_titanic.info() #data types contained in data frame
df_titanic.describe() #first statistics

#we need numpy to get statistics for object columns
import numpy as np
df_titanic.describe(include=object)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


Why are there only 204 cabin values but 891 names? Let's look closer...

In [12]:
df_titanic["Cabin"].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

Are there rows without cabin value?

In [13]:
df_titanic["Cabin"].isnull().sum()

687

In [14]:
891-687

204

## Excursion: python series and data frames

Think:

Python list = Pandas series <br>
Python dictionary = Pandas series with custom index <br>
Python list of list = Pandas data frame

In [22]:
#list: activity = [1593, 1333, 2283]
activity = pd.Series([1593, 1333, 2283])
activity

#dictionary
activity = pd.Series([10, 1333, 2283], index = ["Glycine", "Serine", "Cysteine"])
activity = pd.Series({"Glycine": 10, "Serine": 1333, "Cysteine": 2283})#create it with dictionary syntax
activity

#list of list
l = [["Glycine", 10, "Mutant1"],
    ["Serine", 1333, "Mutant2"],
    ["Cysteine", 2283, "Wildtype"]]
activity = pd.DataFrame(l, columns = ["Amino acid", "activity", "mutation"])
activity

Unnamed: 0,Amino acid,activity,mutation
0,Glycine,10,Mutant1
1,Serine,1333,Mutant2
2,Cysteine,2283,Wildtype


## 2. Cleaning, combining, grouping, sorting
We can easily perform simple mathematical operations like addition, multiplication, division and exponentiation in base Python.

In [None]:
# addition
print(5+5)

# multiplication
print(3*2)

# division
print(6/2)

# exponentiation
print(2**3)

# brackets also matter in Python
print(6*2+3)
print(6*(2+3))

10
6
3.0
8
15
30


You can also add and multiply strings, which behaves very differently from the numeric operators.

In [None]:
# "math" with strings
x = "hello "
y = "world"
print(x+y)
print(5*x)
print(2*x + y)

z = 2
print(2*z)
print(2*str(z))

print("hello"+"10")
print("hello" + 10) # <-- this does not work

hello world
hello hello hello hello hello 
hello hello world
4
22
hello10


TypeError: can only concatenate str (not "int") to str

## Useful utility: format-strings
Python has a very useful syntax with strings that allows you to insert expression or values directly into your string and convert them accordingly. This is called a *format-string* or *f-string* and you might see it in our code later on.

As an example, we will write a script that automatically writes birthday wishes to a friend.

In [None]:
friend = "Peter"
birthyear = 1998

message = "Hi " + friend + ", congratulations to becoming " + str(2021 - birthyear) + " years old!!"
print(message)

# with a format-string
message = f"Hi {friend}, congratulations to becoming {2021 - birthyear} years old!!"
print(message)

Hi Peter, congratulations to becoming 23 years old!!
Hi Peter, congratulations to becoming 23 years old!!


Note that we were able to just directly insert the expression `2021 - birthyear` and the format string converted it for us automatically.