<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Titanic: Learn Classification</h1>
</div>

This a small tutorial targeted at the complete beginner.  It's no substitue for a good book on Machine Learning.  In fact, I highly recommend this book: 

[Hands on Machine Learning](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) (HOML)

My main goal here is to get the beginner started on Kaggle, where there's no limit to learning ML. 

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Import Libraries</h1>
</div>

A best practise is to include all libraries here.  However, I will put a few imports farther down where they are first used so beginners can learn with an "as needed" approach.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from pathlib import Path

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Load Train/Test Data</h1>
</div>

- train.csv - Data used to build our machine learning model
- test.csv - Data used to build our machine learning model. Does not contain the target variable
- gender_submission.csv - A file in the proper format to submit test predictions

In [2]:
data_dir = Path("../input/titanic")

train = pd.read_csv(data_dir / "train.csv")
test = pd.read_csv(data_dir / "test.csv")
sample_submission = pd.read_csv(data_dir / "gender_submission.csv")

print(f"train data: Rows={train.shape[0]}, Columns={train.shape[1]}")
print(f"test data : Rows={test.shape[0]}, Columns={test.shape[1]}")

train data: Rows=891, Columns=12
test data : Rows=418, Columns=11


In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In supervised learning problems, we have a label or target.

In [4]:
TARGET = "Survived"

There are many features but to keep it simple we are only going to start with one.

In [5]:
FEATURES = ["Sex"] # A not so random feature to start with

In [6]:
y = train[TARGET]
X = train[FEATURES].copy()

X_test = test[FEATURES].copy()

In [7]:
X.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [8]:
replacement_dict = {"female": 0, "male": 1}

X["Sex"] = X["Sex"].map(replacement_dict)
X_test["Sex"] = X_test["Sex"].map(replacement_dict)

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Train Model with Train/Test Split</h1>
</div>

We split the training data so we can evaluate how well each model performs  We are saving 20% of the training data to validate the model(s).

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.2,    # Save 20% for validation
    random_state=42,  # Make the split deterministic
)
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((712, 1), (712,), (179, 1), (179,))

# Create a Model

In [10]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

## Fit/Train the model

In [11]:
model.fit(X_train,y_train)

LogisticRegression()

## Use the Trained Model to Predict the Validation Data

In [12]:
yhat = model.predict(X_valid)

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Score the Model</h1>
</div>

We get a score by evaluating our model on the validation data.


In [13]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_valid, yhat)
print(f"Score: {score:.4f}")

Score: 0.7821


## Predict the Test Data

In [14]:
preds = model.predict(X_test)

In [15]:
preds[:5]

array([0, 1, 0, 0, 1])

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Submission File</h1>
</div>

The sample file and our data is in the same row order.  This allows us to simply assign our prediction to the target column (`Survived`) in the sample submission.

In [16]:
sample_submission[TARGET] = preds
sample_submission.to_csv(f"submission.csv", index=False)
sample_submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
