<a href="https://colab.research.google.com/github/mdeihim/CPE322/blob/master/TitanicTensor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Linear Estimator Using TensorFlow

The goal of this tutorial will be to take in numeric and categorical data about the survivors and passengers who were killed on the Titanic. Using this data, we will train a linear model to make a prediction about the survival of individuals on board the Titanic that the model hasn't seen yet.

Our objectives are:
1. Load the proper datasets as a pandas dataframe
2. Preprocess the data
3. Use a tensorflow model to predict the outcome of passenger's survival

# Setup
This is particularly easy in Google Collaboratory, because the necessary libraries are readily available and will not have to be downloaded onto your local machine. You can go ahead and run this block of code below to get these on your current notebook:


In [0]:
!pip install -q sklearn
%tensorflow_version 2.x  # this line is not required unless you are in a notebook
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib
import tensorflow.compat.v2.feature_column as fc

import tensorflow as tf  # now import the tensorflow module

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `2.x  # this line is not required unless you are in a notebook`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


# Loading the Data
In this section, we will be using the pandas library to read data from a csv file and save as a dataframe. The data is already split into two different datasets, one for training the model, and one for evaluation. For more information on pandas.read_csv see the documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [0]:
# Loads datasets as a dataframe using pandas, read_csv
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') # training data
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') # testing data

#This shows the first five entries of the training data.
dftrain.head()

# Pre-Processing the Data
From this dataframe, we can see that the feature columns are the sex, siblings, parch, class, deck, age, and fare. The label, or the column that we are attempting to predict is the "survived" column. In order to train against this column, we must remove it from the training and evaluation set, using 
pop().

In [0]:
#removes survived column and save to new variable
y_train = dftrain.pop('survived') #variable for the the survived column at the corresponding index
y_eval = dfeval.pop('survived')

The data is not all numeric data, so the columns must be split into different classes, numeric columns and categorical columns.

In [1]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = dftrain[feature_name].unique()  # gets a list of all unique values from given feature column
 feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

  print(feature_columns)

IndentationError: ignored

# Input Function


In [0]:
#input function/ usually can copy/ determines number of epochs, batches, and batch size
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=False, batch_size=32):
  def input_function():  # inner function, this will be returned
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  # create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000)  # randomizes the order of data
    ds = ds.batch(batch_size).repeat(num_epochs)  # split dataset into batches of 32 and repeat process for number of epochs
    return ds  # return a batch of the dataset
  return input_function  # return a function object for use

# Training and Creating the model

In [0]:


train_input_fn = make_input_fn(dftrain, y_train)  # here we will call the input_function where a dataset object is returned so we can feed to the model
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

#create model

linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns). #uses tensorflow linear regression estimator
linear_est.train(train_input_fn) 
result = linear_est.evaluate(eval_input_fn)

clear_output()
print(result['accuracy']) 


#Results

In [0]:
result= list(linear_est.predict(eval_input_fn))
print(dfeval.loc[0])
print(y_eval.loc[0])
print(result[0]['probabilities'][1]) #print the probablity of survival[1] at index 0 (the first passenger)