# Titanic Tutorial for Beginners, Part 1: The very basics

Hi, this is my first Titanic tutorial. I have the impression that sometimes the available starters tend to be lengthy and include a lot of code, something that might not be great for a newbie. So, I'll try my best to keep the tutorials short and intuitive for beginners. My aim is to make a series of kernels (4 parts), starting from this one, which is a quick approach of loading and understanding our data and submitting the output of a very simple model. I will gradually add more tutorials on more advanced techniques used by competitive Kagglers, such as detailed EDA, local validation, feature engineering, hyperparameter tuning, model stacking, and more. I hope it is useful, thanks!

**Notebooks in the series**:

* https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-1
* https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-2
* https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-3


Below we can see the default first cell in any Kaggle Kernel, it is handy since it loads the standard Python Data Science stack **NumPy** and **Pandas** libraries. It also lists the available files in our notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Data loading!

Our first step is to load and quickly explore our train dataset, the previous cell output helps us to find the file names! To read the dataset, we will use the **read_csv**() Pandas method. Now, the previous listing of our files comes in handy, since we know the input to our method:

In [None]:
# Read train data
train = pd.read_csv('/kaggle/input/titanic/train.csv')

# Have a first look at train data
print('Train shape:', train.shape)

OK, this means that we have 891 rows and 12 columns. Now, let's have a first look at our training data!

In [None]:
# Have a look at first 5 data observations
train.head()

In [None]:
# Have a look at last 5 data observations
train.tail()

Aha, very interesting. We have 891 **observations**, 11 **features**, and our **target** variable, **Survived**.

But what are all those variable names? From https://www.kaggle.com/c/titanic/data, we have: 
* **Survival**  = Survival
* **Pclass**    = Ticket class
* **Sex**       = Sex 
* **Age**       = Age in years 
* **Sibsp**     = # of siblings / spouses aboard the Titanic 
* **Parch**     = # of parents / children aboard the Titanic 
* **Ticket**    = Ticket number 
* **Fare**	    = Passenger fare 
* **Cabin**	    = Cabin number 
* **Embarked**  = Port of Embarkation 

Hmmm, a lot of info...

Our second step is to check for missing values!

In [None]:
# Have a look for possible missing values
train.isnull().sum()

This is good enough, but there is a neater way!

In [None]:
# Let's sort them in descending order!
train.isnull().sum().sort_values(ascending = False)

Cool! We see that we have missing variables for '**Cabin**', '**Age**',  and '**Embarked**', so we can't use these variables for prediction without dealing with the missing variables first. For a simple but efficient first approach: [Part 2](https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-2) 

For something a bit more more sophisticated: [Part 3](https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-3)


OK! Now we should load our test data. Also, let's have a first look at train and test column names.

In [None]:
# Read test data
test = pd.read_csv('/kaggle/input/titanic/test.csv')

# Have a first look at test data
print('Test shape:', test.shape)

In [None]:
# Have a look at train and test columns
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())

It looks OK, the only additional column in train is '**Survived**', which is our target variable, i.e. the one we want to actually predict in the test dataset.

# Check sample submission!

To be able to make a valid submission, a good habit is to check the sample submission file provided by Kaggle, to become familiar with the needed format:

In [None]:
# Read sample submission file
sample_submission = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

# Look at the first 5 observations of sample submission
print(sample_submission.head())

In [None]:
# Look at last 5 observations
print(sample_submission.tail())

Great, now it is time to decide what kind of model to use. We can guess by looking at the above output: the variable **'Survived'** seems to be binary... Let's confirm it! Now, there are a lot of Python plotting libraries, but we can't go wrong with **Matplotlib** and **Seaborn**!

In [None]:
# Load our plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Countplot for 'Survived' variable
sns.countplot(train['Survived'])

Hmmm, we have a classification problem here, **'Survived'** is binary, with 2 possible values, 0 and 1! So, the first thought would probably be to try a **Logistic Regression** model, which is **THE** fundamental classification algorithm! And the simplest one to start with is a model with only one feature, but which one to choose? Here is a hint:

In [None]:
sns.countplot(x = 'Survived', hue = 'Sex', data = train)

Wow! **'Sex'** looks like a very strong explanatory variable, and it will be our choice for our single feature Logistic Regression model!

# Our first simple model!

Let's fit the model to our train set. We will use **Scikit-learn**, which is our favourite library when it comes to **Machine Learning**, since it includes every standard **regression**, **classification**, and **clustering** algorithm (**Deep Learning** is a different animal, if you are interested in this area, stay tuned)!

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Fit a logistic regression model to our train data, by converting 'Sex' to a dummy variable, to feed it into the model.
logisticRegression = LogisticRegression()
logisticRegression.fit(X = pd.get_dummies(train['Sex']), y = train['Survived'])

Now, we can make predictions for our test set! Notice that we quickly transform the '**Sex**' values to numeric ones (dummies) by using the **get_dummies**() pandas method, in order to bring them to a suitable format for our Logistic Regression model!

In [None]:
# Predict!
test['Survived'] = logisticRegression.predict(pd.get_dummies(test['Sex']))

# Our first submission!

At last! Final step: let's make our submission!

In [None]:
# Write test predictions for final submission
test[['PassengerId', 'Survived']].to_csv('kaggle_submission.csv', index = False)

Surprise: Our simplest model is far from being a weak one, with an **accuracy** equal to 0.76555! This is actually a decent score, especially since we've used just a single feature. For more info about accuracy: [Part 2](https://www.kaggle.com/kernelgenerator/titanic-tutorial-for-beginners-part-2) 

Excellent, here we conclude our intro tutorial, but this is just the start!

### If you like my notebook, I would appreciate an upvote, which will keep me motivated for additional content, thanks!