# São Paulo 14 de Agosto 2019
## Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Arquivos: {train.csv, test.csv}

In [95]:
#Upgrade pip
!pip install --upgrade pip

Requirement already up-to-date: pip in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (19.2.2)


In [96]:
# Realizando os imports
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [97]:
# Load the data
train_df = pd.read_csv('data/train.csv', header=0)
test_df = pd.read_csv('data/test.csv', header=0)


In [98]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [101]:
# Muitos valores aqui são inuteis tais como Name, PassengerId e Embarked.
train = train_df[['Survived', 'Pclass', 'Sex', 'Age']]

In [111]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0


In [112]:
#Alterado nome da coluna Sex para Male
train.columns = ['Survived','Pclass', 'Male', 'Age']

In [114]:
train['Age'].isnull().sum()

177

In [132]:
train.count()

Survived    891
Pclass      891
Male        891
Age         891
dtype: int64

In [117]:
# Preciso arrumar as idades existem 177 idades NaN
train.loc[train['Age'] > 0].mean()

Survived     0.406162
Pclass       2.236695
Age         29.699118
dtype: float64

In [125]:
#Media das idades obtidas
train.loc[train.Age.isnull() == True, 'Age'] = 30

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [131]:
train.head(7)

Unnamed: 0,Survived,Pclass,Male,Age
0,0,3,1,22.0
1,1,1,0,38.0
2,1,3,0,26.0
3,1,1,0,35.0
4,0,3,1,35.0
5,0,3,1,30.0
6,0,1,1,54.0


In [133]:
train.loc[train.Male == 'male', 'Male'] = 1
train.loc[train.Male == 'female', 'Male'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [137]:
train.loc[train['Pclass']  >2]

Unnamed: 0,Survived,Pclass,Male,Age
0,0,3,1,22.0
2,1,3,0,26.0
4,0,3,1,35.0
5,0,3,1,30.0
7,0,3,1,2.0
8,1,3,0,27.0
10,1,3,0,4.0
12,0,3,1,20.0
13,0,3,1,39.0
14,0,3,0,14.0


In [None]:
# We'll impute missing values using the median for numeric columns and the most
# common value for string columns.
# This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948
from sklearn.base import TransformerMixin

In [None]:
class DataFrameImputer(TransformerMixin):
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

feature_columns_to_use = ['Pclass','Sex','Age','Fare','Parch']
nonnumeric_columns = ['Sex']


In [None]:
# Join the features from train and test together before imputing missing values,
# in case their distribution is slightly different
big_X = train_df[feature_columns_to_use].append(test_df[feature_columns_to_use])
big_X_imputed = DataFrameImputer().fit_transform(big_X)


In [None]:
# XGBoost doesn't (yet) handle categorical features automatically, so we need to change
# them to columns of integer values.
# See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing for more
# details and options
le = LabelEncoder()
for feature in nonnumeric_columns:
    big_X_imputed[feature] = le.fit_transform(big_X_imputed[feature])


In [None]:
# Prepare the inputs for the model
train_X = big_X_imputed[0:train_df.shape[0]].as_matrix()
test_X = big_X_imputed[train_df.shape[0]::].as_matrix()
train_y = train_df['Survived']



In [None]:
# You can experiment with many other options here, using the same .fit() and .predict()
# methods; see http://scikit-learn.org
# This example uses the current build of XGBoost, from https://github.com/dmlc/xgboost
gbm = xgb.XGBClassifier(max_depth=3, n_estimators=333, learning_rate=0.0001).fit(train_X, train_y)
predictions = gbm.predict(test_X)


In [None]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("data/submission.csv", index=False)