#### A Random Forest analysis in Python
A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Building a random forest in Python looks almost the same as building a decision tree; so we can jump right to it. There are two key differences, however. Firstly, a different class is used. And second, a new argument is necessary. Also, we need to import the necessary library from scikit-learn.

Use RandomForestClassifier() class instead of the DecisionTreeClassifier() class.
n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.


In [35]:
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
# you can import from URL in this way:
#
#

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

target = train["Survived"].values

import warnings
pd.options.mode.chained_assignment = None  
warnings.simplefilter(action='ignore', category=FutureWarning)

In [36]:
train_one = train.copy()
train_one["Sex"][train_one["Sex"] == "male"] = 0
train_one["Sex"][train_one["Sex"] == "female"] = 1


# Impute Age variable
train_one["Age"] = train_one["Age"].fillna(train_one["Age"].median())

# Impute the Embarked variable
train_one["Embarked"] = train_one["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train_one["Embarked"][train_one["Embarked"] == "S"] = 0
train_one["Embarked"][train_one["Embarked"] == "C"] = 1
train_one["Embarked"][train_one["Embarked"] == "Q"] = 2

In [33]:
# Impute the missing value with the median
test_one = test.copy()
test_one.Fare[152] = test_one.Fare.median()

test_one["Sex"][test_one["Sex"] == "male"] = 0
test_one["Sex"][test_one["Sex"] == "female"] = 1

# Impute Age variable
test_one["Age"] = test_one["Age"].fillna(test_one["Age"].median())


# Impute the Embarked variable
test_one["Embarked"] = test_one["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
test_one["Embarked"][test_one["Embarked"] == "S"] = 0
test_one["Embarked"][test_one["Embarked"] == "C"] = 1
test_one["Embarked"][test_one["Embarked"] == "Q"] = 2



In [38]:

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train_one[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test_one[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))


0.9393939393939394
418


In [41]:
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
import numpy as np
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(pred_forest, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_four.csv", index_label = ["PassengerId"])


      Survived
892          0
893          0
894          0
895          0
896          0
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          0
906          1
907          1
908          0
909          0
910          0
911          0
912          1
913          0
914          1
915          1
916          1
917          0
918          1
919          0
920          0
921          0
...        ...
1280         0
1281         0
1282         0
1283         1
1284         0
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         0
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         0
1305         0
1306         1
1307         0
1308         0
1309         1

[418 rows x 1 columns]
(418, 1)
