## Pandas
### Ana Kely | Universidade Federal do Ceará

Pandas é uma biblioteca de software criada para a linguagem Python para manipulação e análise de dados. Em particular, oferece estruturas e operações para manipular tabelas numéricas e séries temporais. É software livre sob a licensa licença BSD.

https://pandas.pydata.org/

**Importação do pandas**

In [106]:
import pandas as pd

**Leitura de um DataFrame de arquivo CSV**

In [107]:
data_frame = pd.read_csv('titanic/train.csv')

**Primeiras linhas**

In [108]:
data_frame.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Últimas linhas**

In [109]:
data_frame.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**Informações do DataFrame**

In [110]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


**Resumo estatístico de campos numéricos**

In [111]:
data_frame.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Resumo estatístico de campos categóricos**

In [112]:
data_frame.describe(include='O')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Klaber, Mr. Herman",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644


**Renomear tabelas**

In [113]:
data_frame.rename({'Name': 'Nome', 'Sex': 'Sexo'}, axis = 1, inplace=True)

In [114]:
data_frame.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Nome', 'Sexo', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [115]:
data_frame.columns = ['IdPassageiro', 'Sobreviveu', 'Classe', 'Nome',
                      'Sexo', 'Idade', 'IrmaosConjugue','PaisFilhos',
                      'Bilhete', 'Tarifa', 'Cabine', 'Embarque']

**Selecionar Colunas Específicas**

In [116]:
data_frame[['Nome', 'Sexo', 'Idade']].head()

Unnamed: 0,Nome,Sexo,Idade
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


**Funções Matemáticas**

In [117]:
data_frame['Idade'].mean()

29.69911764705882

In [118]:
data_frame['Idade'].min()

0.42

In [119]:
data_frame['Idade'].max()

80.0

In [120]:
data_frame['Idade'].mode()

0    24.0
dtype: float64

In [121]:
data_frame['Idade'].median()

28.0

In [122]:
data_frame['Idade'].idxmin()

803

In [123]:
data_frame['Idade'].idxmax()

630

In [124]:
data_frame.mode()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Cabine,Embarque
0,1,0.0,3.0,"Abbing, Mr. Anthony",male,24.0,0.0,0.0,1601,8.05,B96 B98,S
1,2,,,"Abbott, Mr. Rossmore Edward",,,,,347082,,C23 C25 C27,
2,3,,,"Abbott, Mrs. Stanton (Rosa Hunt)",,,,,CA. 2343,,G6,
3,4,,,"Abelson, Mr. Samuel",,,,,,,,
4,5,,,"Abelson, Mrs. Samuel (Hannah Wizosky)",,,,,,,,
5,6,,,"Adahl, Mr. Mauritz Nils Martin",,,,,,,,
6,7,,,"Adams, Mr. John",,,,,,,,
7,8,,,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",,,,,,,,
8,9,,,"Aks, Mrs. Sam (Leah Rosen)",,,,,,,,
9,10,,,"Albimona, Mr. Nassef Cassem",,,,,,,,


In [125]:
data_frame.std()

IdPassageiro      257.353842
Sobreviveu          0.486592
Classe              0.836071
Idade              14.526497
IrmaosConjugue      1.102743
PaisFilhos          0.806057
Tarifa             49.693429
dtype: float64

**Seleção por index - Iloc**

In [126]:
data_frame.iloc[[630, 803, 10]]

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Cabine,Embarque
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S


In [127]:
data_frame.iloc[1:10]

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Cabine,Embarque
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Seleção - Loc**

In [128]:
data_frame[['Nome']].loc[data_frame['Sexo'] == 'female'].head()

Unnamed: 0,Nome
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,"Nasser, Mrs. Nicholas (Adele Achem)"


In [129]:
data_frame['Nome'].loc[data_frame['Idade'] >= 3]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
6                                McCarthy, Mr. Timothy J
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
20                                  Fynney, Mr. Joseph J
21                                 Beesley, Mr. Lawrence
22                           Mc

In [130]:
data_frame['Sexo'].loc[(data_frame['Idade'] > 14) 
                       & (data_frame['Sobreviveu'] == 0)].head()

0     male
4     male
6     male
12    male
13    male
Name: Sexo, dtype: object

**Substituição de valores - Map e Replace**

In [131]:
data_frame['Sexo'].map({'male': 'homem', 'female': 'mulher'}).head()

0     homem
1    mulher
2    mulher
3    mulher
4     homem
Name: Sexo, dtype: object

In [132]:
data_frame['Sexo'].replace({'male': 'homem', 'female': 'mulher'}, inplace=True)

In [133]:
data_frame['Sexo'].head()

0     homem
1    mulher
2    mulher
3    mulher
4     homem
Name: Sexo, dtype: object

**Agrupamento - Groupby**

In [134]:
data_frame.groupby('Sexo').mean()

Unnamed: 0_level_0,IdPassageiro,Sobreviveu,Classe,Idade,IrmaosConjugue,PaisFilhos,Tarifa
Sexo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
homem,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893
mulher,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818


In [135]:
data_frame.groupby('Sobreviveu').mean()

Unnamed: 0_level_0,IdPassageiro,Classe,Idade,IrmaosConjugue,PaisFilhos,Tarifa
Sobreviveu,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,447.016393,2.531876,30.626179,0.553734,0.32969,22.117887
1,444.368421,1.950292,28.34369,0.473684,0.464912,48.395408


In [136]:
data_frame.groupby('Classe').mean()

Unnamed: 0_level_0,IdPassageiro,Sobreviveu,Idade,IrmaosConjugue,PaisFilhos,Tarifa
Classe,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


In [137]:
data_frame.groupby(['Sexo', 'Classe']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,IdPassageiro,Sobreviveu,Idade,IrmaosConjugue,PaisFilhos,Tarifa
Sexo,Classe,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
homem,1,455.729508,0.368852,41.281386,0.311475,0.278689,67.226127
homem,2,447.962963,0.157407,30.740707,0.342593,0.222222,19.741782
homem,3,455.51585,0.135447,26.507589,0.498559,0.224784,12.661633
mulher,1,469.212766,0.968085,34.611765,0.553191,0.457447,106.125798
mulher,2,443.105263,0.921053,28.722973,0.486842,0.605263,21.970121
mulher,3,399.729167,0.5,21.75,0.895833,0.798611,16.11881


**Crosstab**

In [138]:
pd.crosstab(data_frame['Sobreviveu'], data_frame['Classe'], margins=True)

Classe,1,2,3,All
Sobreviveu,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,80,97,372,549
1,136,87,119,342
All,216,184,491,891


In [139]:
pd.crosstab(data_frame['Sobreviveu'], data_frame['Sexo'], margins=True).style.background_gradient(cmap='pink')

Sexo,homem,mulher,All
Sobreviveu,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,468,81,549
1,109,233,342
All,577,314,891


**Ordenação de dados**

In [140]:
data_frame.sort_values('Idade', ascending=True).head()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Cabine,Embarque
803,804,1,3,"Thomas, Master. Assad Alexander",homem,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",homem,0.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",mulher,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",mulher,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",homem,0.83,0,2,248738,29.0,,S


**Valores faltantes - NaN**

In [141]:
data_frame.isnull().sum()

IdPassageiro        0
Sobreviveu          0
Classe              0
Nome                0
Sexo                0
Idade             177
IrmaosConjugue      0
PaisFilhos          0
Bilhete             0
Tarifa              0
Cabine            687
Embarque            2
dtype: int64

In [142]:
data_frame.loc[data_frame['Embarque'].isnull()]

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Cabine,Embarque
61,62,1,1,"Icard, Miss. Amelie",mulher,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",mulher,62.0,0,0,113572,80.0,B28,


In [143]:
data_frame['Embarque'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [144]:
data_frame['Embarque'].mode()

0    S
dtype: object

In [145]:
data_frame.describe(include = 'O')

Unnamed: 0,Nome,Sexo,Bilhete,Cabine,Embarque
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Klaber, Mr. Herman",homem,CA. 2343,B96 B98,S
freq,1,577,7,4,644


In [146]:
data_frame['Embarque'].fillna('S', inplace=True)

In [147]:
data_frame.isnull().sum()

IdPassageiro        0
Sobreviveu          0
Classe              0
Nome                0
Sexo                0
Idade             177
IrmaosConjugue      0
PaisFilhos          0
Bilhete             0
Tarifa              0
Cabine            687
Embarque            0
dtype: int64

**Deleção de colunas e linhas**

In [148]:
data_frame.isnull().sum()

IdPassageiro        0
Sobreviveu          0
Classe              0
Nome                0
Sexo                0
Idade             177
IrmaosConjugue      0
PaisFilhos          0
Bilhete             0
Tarifa              0
Cabine            687
Embarque            0
dtype: int64

In [149]:
data_frame.drop('Cabine', axis=1, inplace=True)

In [150]:
data_frame.isnull().sum()

IdPassageiro        0
Sobreviveu          0
Classe              0
Nome                0
Sexo                0
Idade             177
IrmaosConjugue      0
PaisFilhos          0
Bilhete             0
Tarifa              0
Embarque            0
dtype: int64

In [151]:
data_frame.drop(0, inplace=True)

In [152]:
data_frame.head()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Embarque
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",mulher,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",mulher,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",mulher,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",homem,35.0,0,0,373450,8.05,S
5,6,0,3,"Moran, Mr. James",homem,,0,0,330877,8.4583,Q


In [153]:
data_frame.drop(['PaisFilhos'], axis = 1)

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Sexo,Idade,IrmaosConjugue,Bilhete,Tarifa,Embarque
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",mulher,38.0,1,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",mulher,26.0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",mulher,35.0,1,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",homem,35.0,0,373450,8.0500,S
5,6,0,3,"Moran, Mr. James",homem,,0,330877,8.4583,Q
6,7,0,1,"McCarthy, Mr. Timothy J",homem,54.0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",homem,2.0,3,349909,21.0750,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",mulher,27.0,0,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",mulher,14.0,1,237736,30.0708,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",mulher,4.0,1,PP 9549,16.7000,S


**Apply e Lambda**

In [154]:
data_frame['Idade'].apply(lambda x:x**2)

1      1444.0
2       676.0
3      1225.0
4      1225.0
5         NaN
6      2916.0
7         4.0
8       729.0
9       196.0
10       16.0
11     3364.0
12      400.0
13     1521.0
14      196.0
15     3025.0
16        4.0
17        NaN
18      961.0
19        NaN
20     1225.0
21     1156.0
22      225.0
23      784.0
24       64.0
25     1444.0
26        NaN
27      361.0
28        NaN
29        NaN
30     1600.0
        ...  
861     441.0
862    2304.0
863       NaN
864     576.0
865    1764.0
866     729.0
867     961.0
868       NaN
869      16.0
870     676.0
871    2209.0
872    1089.0
873    2209.0
874     784.0
875     225.0
876     400.0
877     361.0
878       NaN
879    3136.0
880     625.0
881    1089.0
882     484.0
883     784.0
884     625.0
885    1521.0
886     729.0
887     361.0
888       NaN
889     676.0
890    1024.0
Name: Idade, Length: 890, dtype: float64

In [155]:
data_frame['Bilhete'] = data_frame['Bilhete'].apply(lambda x:x[:-1])

In [156]:
data_frame['Bilhete']

1              PC 1759
2      STON/O2. 310128
3                11380
4                37345
5                33087
6                 1746
7                34990
8                34774
9                23773
10              PP 954
11               11378
12            A/5. 215
13               34708
14               35040
15               24870
16               38265
17               24437
18               34576
19                 264
20               23986
21               24869
22               33092
23               11378
24               34990
25               34707
26                 263
27                1995
28               33095
29               34921
30             PC 1760
            ...       
861               2813
862               1746
863            CA. 234
864              23386
865              23685
866       SC/PARIS 214
867            PC 1759
868              34577
869              34774
870              34924
871               1175
872                 69
873        

**Variáveis Dummy e Concatenação
de Dados**

Gerar um valor matemático para variáveis categóricas

In [157]:
data_frame = pd.get_dummies(data_frame, columns=['Sexo'],drop_first=True)

In [158]:
data_frame.head()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Embarque,Sexo_mulher
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 1759,71.2833,C,1
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 310128,7.925,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,11380,53.1,S,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,37345,8.05,S,0
5,6,0,3,"Moran, Mr. James",,0,0,33087,8.4583,Q,0


In [159]:
embarque = pd.get_dummies(data_frame['Embarque'], drop_first=True)

In [160]:
embarque

Unnamed: 0,Q,S
1,0,0
2,0,1
3,0,1
4,0,1
5,1,0
6,0,1
7,0,1
8,0,1
9,0,0
10,0,1


In [161]:
data_frame = pd.concat([data_frame, embarque], axis=1)

In [162]:
data_frame.head()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Embarque,Sexo_mulher,Q,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 1759,71.2833,C,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 310128,7.925,S,1,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,11380,53.1,S,1,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,37345,8.05,S,0,0,1
5,6,0,3,"Moran, Mr. James",,0,0,33087,8.4583,Q,0,1,0


In [163]:
data_frame.drop('Embarque', axis=1, inplace=True)

In [164]:
data_frame.columns

Index(['IdPassageiro', 'Sobreviveu', 'Classe', 'Nome', 'Idade',
       'IrmaosConjugue', 'PaisFilhos', 'Bilhete', 'Tarifa', 'Sexo_mulher', 'Q',
       'S'],
      dtype='object')

**Expressões Regulares**

.  - qualquer caracter
<br>\d - dígitos (0-9)
<br>\D - exceto dígitos (0-9)
<br>\w - palavra (a-z, A-Z, 0-9, \_)
<br>\s - espaço em branco (tab, espaço, nova linha)
<br>\S - qualquer coisa exceto espaço em branco (tab, espaço, nova linha)
<br>\b - início de uma palavra
<br>\B - exceto início da palavra
<br>^ - inicio de uma String
<br>$ - final de uma string
<br>[] - caracteres em colchetes
<br>[^] - caracteres não em colchetes
<br>| - ou
<br>( ) - grupo

<br>* - nenhum ou mais
<br>+ - um ou mais
<br>? - nenhum ou um
<br>{3} - número exato
<br>{3,4} - range de Números (mínimo, máximo)

Referência (inglês): 
http://bit.ly/ExpressoesRegulares

In [165]:
data_frame['Titulo'] = data_frame['Nome'].str.extract('([a-zA-Z]+)\.')

In [166]:
data_frame.head()

Unnamed: 0,IdPassageiro,Sobreviveu,Classe,Nome,Idade,IrmaosConjugue,PaisFilhos,Bilhete,Tarifa,Sexo_mulher,Q,S,Titulo
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 1759,71.2833,1,0,0,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 310128,7.925,1,0,1,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,11380,53.1,1,0,1,Mrs
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,37345,8.05,0,0,1,Mr
5,6,0,3,"Moran, Mr. James",,0,0,33087,8.4583,0,1,0,Mr


**Engenharia de Recursos - Feature Engineering**

In [167]:
data_frame['Sexo'] = data_frame['Sexo_mulher'].map({0: 'homem', 1: 'mulher'})

In [168]:
pd.crosstab(data_frame['Titulo'], data_frame['Sexo'])

Sexo,homem,mulher
Titulo,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,1,0
Col,2,0
Countess,0,1
Don,1,0
Dr,6,1
Jonkheer,1,0
Lady,0,1
Major,2,0
Master,40,0
Miss,0,182


In [169]:
data_frame['Titulo'] = data_frame['Titulo'].apply(lambda x: 'Outros' if x not in ['Master', 'Miss', 'Mrs', 'Mr'] else x)

In [170]:
data_frame['Titulo']

1         Mrs
2        Miss
3         Mrs
4          Mr
5          Mr
6          Mr
7      Master
8         Mrs
9         Mrs
10       Miss
11       Miss
12         Mr
13         Mr
14       Miss
15        Mrs
16     Master
17         Mr
18        Mrs
19        Mrs
20         Mr
21         Mr
22       Miss
23         Mr
24       Miss
25        Mrs
26         Mr
27         Mr
28       Miss
29         Mr
30     Outros
        ...  
861        Mr
862       Mrs
863      Miss
864        Mr
865       Mrs
866      Miss
867        Mr
868        Mr
869    Master
870        Mr
871       Mrs
872        Mr
873        Mr
874       Mrs
875      Miss
876        Mr
877        Mr
878        Mr
879       Mrs
880       Mrs
881        Mr
882      Miss
883        Mr
884        Mr
885       Mrs
886    Outros
887      Miss
888      Miss
889        Mr
890        Mr
Name: Titulo, Length: 890, dtype: object

In [171]:
pd.crosstab(data_frame['Titulo'], data_frame['Sexo'])

Sexo,homem,mulher
Titulo,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,40,0
Miss,0,182
Mr,516,0
Mrs,0,125
Outros,20,7


**Mudando valores linha por linha - iterrows()**

In [172]:
cabine = pd.read_csv('titanic/train.csv')

In [173]:
cabine['Cabin']

0              NaN
1              C85
2              NaN
3             C123
4              NaN
5              NaN
6              E46
7              NaN
8              NaN
9              NaN
10              G6
11            C103
12             NaN
13             NaN
14             NaN
15             NaN
16             NaN
17             NaN
18             NaN
19             NaN
20             NaN
21             D56
22             NaN
23              A6
24             NaN
25             NaN
26             NaN
27     C23 C25 C27
28             NaN
29             NaN
          ...     
861            NaN
862            D17
863            NaN
864            NaN
865            NaN
866            NaN
867            A24
868            NaN
869            NaN
870            NaN
871            D35
872    B51 B53 B55
873            NaN
874            NaN
875            NaN
876            NaN
877            NaN
878            NaN
879            C50
880            NaN
881            NaN
882         

In [174]:
for idx, _ in cabine[['Cabin']].dropna().iterrows():
    cabine_individual = cabine['Cabin'].at[idx]
    cabine['Cabin'].at[idx] = cabine_individual[0]

In [175]:
cabine['Cabin']

0      NaN
1        C
2      NaN
3        C
4      NaN
5      NaN
6        E
7      NaN
8      NaN
9      NaN
10       G
11       C
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN
21       D
22     NaN
23       A
24     NaN
25     NaN
26     NaN
27       C
28     NaN
29     NaN
      ... 
861    NaN
862      D
863    NaN
864    NaN
865    NaN
866    NaN
867      A
868    NaN
869    NaN
870    NaN
871      D
872      B
873    NaN
874    NaN
875    NaN
876    NaN
877    NaN
878    NaN
879      C
880    NaN
881    NaN
882    NaN
883    NaN
884    NaN
885    NaN
886    NaN
887      B
888    NaN
889      C
890    NaN
Name: Cabin, Length: 891, dtype: object