# **Insuficiência Cardíaca**

<img src = "/img/Heart-Failure-image-small.jpg">

A insuficiência cardíaca é uma síndrome caracterizada pela incapacidade do coração em manter fluxo de sangue adequado para os diversos órgãos e tecidos do corpo.

O levantamento de dados de pacientes que enfrentaram uma falha cardíaca pode auxiliar na identificação e previsibilidade da ocorrência do risco.

Para tanto, faremos uma análise de uma base de dados de 299 pacientes que possuem insuficiência cardíaca, coletada em 2015, abaixo:

In [2]:
import pandas as pd
df = pd.read_csv('./data/heart_failure_clinical_records_dataset.csv')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


## **1. Visão geral do conjunto de dados (DataFrame)**

A base de dados em questão tem o seguinte formato (_shape_):

In [3]:
df.shape

print(f'São, portanto, {df.shape[0]} linhas e {df.shape[1]} colunas.')
print(f'Temos, portanto, {df.shape[0]} pacientes monitorados e {df.shape[1]} critérios sobre eles.')

São, portanto, 299 linhas e 13 colunas.
Temos, portanto, 299 pacientes monitorados e 13 critérios sobre eles.


Para identificarmos rapidamente informações sobre o seu preenchimento, podemos utilizar o método _info_, abaixo:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


Para melhor manipulação destes dados, transformaremos o dataset acima, resultante do método _info_, em um DataFrame próprio. São estes os dados que utilizaremos:

    Column                    Non-Null Count  Dtype
    age                       299 non-null    float64
    anaemia                   299 non-null    int64  
    creatinine_phosphokinase  299 non-null    int64  
    diabetes                  299 non-null    int64  
    ejection_fraction         299 non-null    int64  
    high_blood_pressure       299 non-null    int64  
    platelets                 299 non-null    float64
    serum_creatinine          299 non-null    float64
    serum_sodium              299 non-null    int64  
    sex                       299 non-null    int64  
    smoking                   299 non-null    int64  
    time                      299 non-null    int64  
    DEATH_EVENT               299 non-null    int64 

In [6]:
info = pd.read_clipboard()
visao_geral = info.copy()
visao_geral

Unnamed: 0,Column,Non-Null,Count,Dtype
0,age,299,non-null,float64
1,anaemia,299,non-null,int64
2,creatinine_phosphokinase,299,non-null,int64
3,diabetes,299,non-null,int64
4,ejection_fraction,299,non-null,int64
5,high_blood_pressure,299,non-null,int64
6,platelets,299,non-null,float64
7,serum_creatinine,299,non-null,float64
8,serum_sodium,299,non-null,int64
9,sex,299,non-null,int64


Criar um dicionário para construir o novo dataframe _visao_geral_ é mais seguro, pois, quando o kernell reiniciar e a informação não estiver no clipboard, a célula acima não irá ler os dados corretos e o seu dataset ficará comprometido.

Um bom jeito de criar o dicionário a partir destas informações (colunas, valores não nulos e tipo de variável) é o seguinte:

In [None]:
# Identificar as colunas com o  parâmetro columns
columns = df.columns
# Identificar os valores não nulos pela soma dos valores encontrados com o método notnull:
non_null = df.notnull().sum()
# Identificar os tipos de variáveis com o parâmetro dtypes
types = df.dtypes

# Transformar todos em lista usando o método tolist():
columns = columns.tolist()
non_null = non_null.tolist()
types = types.tolist()

# Criar um dicionário com estas variáveis e transformá-lo em um dataframe:
dicionario = {'Columns': columns, 'Non-Null Count': non_null, 'Dtype': types}
visao_geral = pd.DataFrame(dicionario)


display(visao_geral)

Unnamed: 0,Columns,Non-Null Count,Dtype
0,age,299,float64
1,anaemia,299,int64
2,creatinine_phosphokinase,299,int64
3,diabetes,299,int64
4,ejection_fraction,299,int64
5,high_blood_pressure,299,int64
6,platelets,299,float64
7,serum_creatinine,299,float64
8,serum_sodium,299,int64
9,sex,299,int64


### 1.1 Preenchimento do dataset

Utilizando, portanto, os dados do dataframe _visao_geral_, obtidos a partir do método _.info_, temos o seguinte:

    1. Todas as colunas possuem 299 informações preenchidas
    2. Todas as colunas são preenchidas com números, inteiros ou _floats_

#### 1.1.1. _Valores nulos_

Considerando que a base de dados contempla dados de 299 pacientes e que todas as colunas tem 299 células preenchidas, percebemos que **não há nenhum valor faltante, ou seja, nulo**.

Essa conclusão pode ser confirmada pela aplicação do método isnull() na base original:

In [None]:
df.isnull()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,False,False,False,False,False,False,False,False,False,False,False,False,False
295,False,False,False,False,False,False,False,False,False,False,False,False,False
296,False,False,False,False,False,False,False,False,False,False,False,False,False
297,False,False,False,False,False,False,False,False,False,False,False,False,False


Sua soma deverá indicar que não existe nenhum valor nulo em nenhuma das colunas:

In [None]:
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

**Confirmada, portanto, a conclusão de que a base de dados não possui valores faltantes ou nulos.**

#### 1.1.2. _Tipos de variáveis e seu preenchimento_

Na estatística, as variáveis podem ser classificadas em **quantitativas** ou **qualitativas (categóricas)**.

As **variáveis quantitativas** são aquelas que podem ser medidas em uma escala quantitativa, ou seja, por valores numéricos que façam sentido, podendo ser subclassificadas em discretas ou contínuas, a depender de se faz sentido que seus valores sejam fracionados ou não.

Já as **variáveis qualitativas ou categóricas** são características categorizáveis, não definidas por valores numéricos. Podem ser nominais ou ordinais, quando progressivas.

Retornando às informações do dataset, é possível verificar a presença de algumas variáveis categóricas como sexo (_sex_) e se o paciente é fumante (_smoking_). Apesar disso, vê-se que o seu preenchimento foi feito com números inteiros (_int64_):

In [12]:
filtro = (visao_geral['Column'] == 'sex') | (visao_geral['Column'] == 'smoking')
visao_geral[filtro]

Unnamed: 0,Column,Non-Null,Count,Dtype
9,sex,299,non-null,int64
10,smoking,299,non-null,int64


O próximo passo é, portanto, investigar como foi feito o preenchimento dessas colunas. Vejamos os valores inseridos em cada uma delas:

In [35]:
sex_unique_values = df['sex'].unique().tolist()
smoking_unique_values = df['smoking'].unique().tolist()

print(f'A coluna \'sex\' está preenchida com {len(sex_unique_values)} valores: {sex_unique_values}')
print(f'A coluna \'smoking\' está preenchida com {len(sex_unique_values)} valores: {sex_unique_values}')

A coluna 'sex' está preenchida com 2 valores: [1, 0]
A coluna 'smoking' está preenchida com 2 valores: [1, 0]


Vejamos, de forma exemplificativa, algumas linhas com este preenchimento:

In [37]:
df[['sex', 'smoking']].head()

Unnamed: 0,sex,smoking
0,1,0
1,1,0
2,1,1
3,1,0
4,0,0


###  .describe() - Method

Computes some summary statistics for numerical columns, a quick overview of numeric variables. "count" is the number of non-missing values in each column. 

In [None]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### .values - Attribute

Contains the data values in a 2-dimensional NumPy array. 

In [None]:
titanic.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

### .columns - Attribute

An index of columns: the column names.

In [None]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### .index - Attribute

An index for the rows: either row numbers or row names.

In [None]:
titanic.index


RangeIndex(start=0, stop=891, step=1)

## Sorting and subsetting

In [None]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### .sort_values() - Method

In [None]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
titanic.sort_values("Name")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.5500,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.2500,,S
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.2500,,S
308,309,0,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0000,,C
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
286,287,1,3,"de Mulder, Mr. Theodore",male,30.0,0,0,345774,9.5000,,S
282,283,0,3,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5000,,S
361,362,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S


#### _Ascending or Descending_

Setting the **ascending argument** to False will sort the data the other way around. 

In [None]:
titanic.sort_values("Name", ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S
361,362,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C
282,283,0,3,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5000,,S
286,287,1,3,"de Mulder, Mr. Theodore",male,30.0,0,0,345774,9.5000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
308,309,0,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0000,,C
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.2500,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.2500,,S


#### _Sort by Multiple Variables_

We can **sort by multiple variables** by passing a list of column names to sort_values.

In [None]:
titanic.sort_values(["Age", "Name"])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5000,C52,S
354,355,0,3,"Yousif, Mr. Wazli",male,,0,0,2647,7.2250,,C
495,496,0,3,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C
240,241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C


You can adjust ascending or descending arguments to each column:

In [None]:
titanic.sort_values(["Age", "Name"], ascending=[False, True])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5000,C52,S
354,355,0,3,"Yousif, Mr. Wazli",male,,0,0,2647,7.2250,,C
495,496,0,3,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C
240,241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C


### df["column"]

It's a way to show only one or some columns.

In [None]:
titanic["Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

#### _Selecting multiple columns_

You have to put the column names between brackets, as an array or a list.

In [None]:
titanic[["Name", "Age"]]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


In [None]:
cols_to_subset = ["Name", "Age"]
titanic[cols_to_subset]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


#### _Filter rows_

You can set a condition to show only some of the rows that meets this condition.

In [None]:
filter = titanic["Age"] > 50
print(filter)

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool


In [None]:
titanic[filter]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5000,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
...,...,...,...,...,...,...,...,...,...,...,...,...
820,821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52.0,1,1,12749,93.5000,B69,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0000,B28,
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S


Same as:

In [None]:
titanic[titanic["Age"] > 50]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5000,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
...,...,...,...,...,...,...,...,...,...,...,...,...
820,821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52.0,1,1,12749,93.5000,B69,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0000,B28,
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S


You can do it with strings:

In [None]:
filter2 = titanic["Sex"] == "female"

In [None]:
titanic[filter2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


Same as:

In [None]:
titanic[titanic["Sex"] == "female"]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


### _Logical Operators & |_

You can subset based on multiple conditions using operators & (AND) and | (OR)

In [None]:
titanic[filter & filter2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C
268,269,1,1,"Graham, Mrs. William Thompson (Edith Junkins)",female,58.0,0,1,PC 17582,153.4625,C125,S
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
496,497,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C
513,514,1,1,"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",female,54.0,1,0,PC 17603,59.4,,C
571,572,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S


In [None]:
titanic[filter | filter2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


#### .isin() - Method

 This takes in a list of values to filter for. 

In [None]:
selected_cabins = ['C85', 'B42']
cabin_filter = titanic["Cabin"].isin(selected_cabins)
titanic[cabin_filter]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S


### Adding a new column

Put the name of the column between brackets. If it doesn't exist, it will create a new one in the dataframe.

In [None]:
titanic["Age (Months)"] = titanic["Age"] * 12
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age (Months)
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,264.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,456.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,312.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,420.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,420.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,324.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,228.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,312.0


## The four most common types of data manipulation

We have seen the four most common types:
- Sorting Rows;
- Subsetting Columns;
- Subsetting Rows;
- Adding New Columns.