[View in Colaboratory](https://colab.research.google.com/github/laurajacob22/LessonPlans/blob/master/Titanic_Python_AI_Exercise.ipynb)

**Titanic Machine Learning Exercise**

This is a practice exercise for creating artifical intelligence systems. The goal of this activity is to create a predictive system that accurately depicts who survived and who died on the Titanic. 

```

![Titanic Sinking](http://www.titanicuniverse.com/wp-content/uploads/2009/10/titanic-sinking-underwater.jpg)

The code below allows us to import several libraries that we will use to analyze our data. 

Library: a series of functions available on the Internet. 

In [0]:
import pandas as pd
import numpy as np
from sklearn import tree, preprocessing

Let's next import two datasets. One is for training our machine and the other is to test the data. 

In [0]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

We have now created two variables: train and test. We will use the data in "train" to create a model and then we will use the data in "test" to see how accurate our model is. 

Let's see what is in the head (first few lines) of the training data:

In [0]:
print(train.head())

The data has 12 columns. We will need to review the data dictionary to identify more information:

*   PassengerId = Number of passengers
* Survived = Survival (0 = No; 1 = Yes)
*   Pclass = Passenger Class (1= 1st; 2 = 2nd; 3 = 3rd)
*  Name = Name
* Sex = Sex
* Age = Age (in years)
* SibSp = Number of siblings or spouses aboard
* Parch = Number of parents/children aboard
* Ticket = Ticket Number
* Fare = Passenger Fare (pre-1970 British Pound)
* Cabin = Cabin number
* Embarked = Port from where they embarked (C = Cherbourg; Q = Queenstown; S = Southampton)


NaN vs. Zero

* NaN = Not a number, no value for this variable
* Zero = 0
* Null = empty data set

Let's see what the first few lines of the test dataset look like:

In [0]:
print(test.head())

Next, we are going to do what's call "interviewing the data." This will run some basic summary statistics  on the training dataset to understand it a little better. 

In [0]:
train.describe()



* How many records are there?
* What is the average age on the Titanic? 
* What was the maximum number of siblings?
* How old was the oldest person on the Titanic?
* What was the average fare?

Let's dive a litter deeper into the data:


In [0]:
train["Pclass"].value_counts()

In [0]:
train["Survived"].value_counts()

In [0]:
print(train["Survived"].value_counts(normalize = True))

**What factors will help us improve our decision of survival?**

In [0]:
#Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

In [0]:
#Females that survived vs females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

In [0]:
#normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))

In [0]:
#normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))

To construct our model, we are going to develop a "Decision Tree", a type of algorithm. 

The first thing we need to do is fill in the missing values because otherwise the algorithm will not run. 

In [0]:
train["Age"] = train["Age"].fillna(train["Age"].median())

In [0]:
print(train)

Let's create arrays (structures that the computer can manipulate):

In [0]:
#creates the target and features numpy arrays: target, features_one
target = train["Survived"].values

#preprocess
encoded_sex = preprocessing.LabelEncoder()

#convert into numbers
train.Sex = encoded_sex.fit_transform(train.Sex)
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

#Fit the first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

We are running a function call "fit" on the decision tree classifier called "my_tree_one." The features we want it to figure are: Pclass, Sex, Age, and Fare. We are telling the algorithm to figure out what the relationship is among the four and how it predicts the value in the target field: "Survived."

In [0]:
print(my_tree_one.feature_importances_)

The feature_importances show the statistical significance of each predictor. Here is what they explain:


*   PClass = 0.13303968
*   Sex = 0.31274009
* Age = 0.2390173
* Fare = 0.31520292



**What is the highest number? What does that say?**

Let's run a calculation to see how accurate our calculation is within the constraints of the data:

In [0]:
print(my_tree_one.score(features_one, target))

**How accurate is your model?**

Now, we are going to take this model and apply it to the test data. The test data does not have a "Survived" column, so our job is to predict whether each passenger in the test data durvived or perished. 

In [0]:
#Fill any missing fare values with the median fare
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

#Fill any missing age values with the median age
test["Age"] = test["Age"].fillna(test["Age"].median())

#Preprocess
test_encoded_sex = preprocessing.LabelEncoder()
test.Sex = test_encoded_sex.fit_transform(test.Sex)

#Extract important features from the test set: Pclass, Sex, Age, and Fare
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
print('These are the features:\n')
print(test_features)

#Make a prediction using the test set and print
my_prediction = my_tree_one.predict(test_features)
print('This is the prediction:\n')
print(my_prediction)

#Create a data frame with two columns: PassengerID & Survived
#Survived contains the model's prediction
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print('This is the solution in toto:\n')
print(my_solution)

#Check that the data frame has 418 entries
print('This is the solution shape:\n')
print(my_solution.shape)

#Write the solution to a CSV file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])