# Overview
**Hi, welcome. <br>
In this notebook I present a gentle description of how *linear regression model* mathematically works in Titanic classification problem.**

## Preparing Dataset

1. Read dataset
2. Clean dataset
3. Label encoding 

## Model Creation
1. Linear regression 
2. Single prediction 
3. Mathematics behind code explained

## Imports

[pandas](https://pandas.pydata.org/): used for data manipulation and analysis. <br>
[sklearn](https://scikit-learn.org/stable/): used to build machine learning models. 

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression, LinearRegression

## Preparing Dataset

### Reading Dataset

In [2]:
train_set = pd.read_csv("train.csv")
test_set  = pd.read_csv("test.csv")
test_ids  = test_set["PassengerId"]

In [3]:
# train_set overview 
# there is 891 rows × 12 columns
train_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
# There is 418 rows × 11 columns
# there is not 12 columns like train_set because the Survived columns are exactly what we
# intended to predict. The test set cannot know the truth we want to predict.
test_set

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


## Useful Functions to Clean Dataset
**As we can see in the output below (train_set.columns), our dataset has some features that does not mean much for this problem, so lets remove them.**

In [5]:
print(*train_set.columns, sep=", " )

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked


In [6]:
def find_nan_values(dataset) -> list:
    """
    Returns a list of all Columns that has nan-values.
    
    Parameters: 
        dataset (pandas.core.frame.DataFrame): dataframe which is to be searched for nan-valeus
    
    Returns: 
        a list (str) of columns 
    """ 
    return dataset.columns[dataset.isnull().any()].tolist()

In [7]:
def drop_columns(dataset, cols):
    """
    Returns dataset without cols 
    
    Parameters:
         dataset (pandas.core.frame.DataFrame): dataframe which is to be removed useless column(s)
   
    Returns:
        dataframe which without cols
    """
    return dataset.drop(cols, axis=1)

In [8]:
def clean(dataset, cols):
    """
    Returns dataset without useless columns and nan-values are replaced to the mean of those (column) that have values

    Parameters:
        dataset (pandas.core.frame.DataFrame): dataframe which is to be removed: nan-values and useless columns.
        cols (list): list of dataset's columns to be removed.
    
    Returns:
        dataset_cleaned: the dataframe which is cleaned

    """
    dataset_cleaned = drop_columns(dataset,cols)
    nan_values = find_nan_values(dataset_cleaned)
    for col in nan_values:
        dataset_cleaned[col].fillna(dataset[col].median(), inplace=True)
    return dataset_cleaned

In [9]:
cols_to_remove = ["PassengerId",
                  "Name",
                  "SibSp", 
                  "Parch", 
                  "Ticket", 
                  "Fare", 
                  "Cabin", 
                  "Embarked"]

In [10]:
train_set = clean(train_set, cols_to_remove)
test_set  = clean(test_set, cols_to_remove)

**After remotion, the atributes of interest (columns's names) are:**
* Survived
* Pclass
* Sex
* Age <br>

In [11]:
# Realize that only Survived, Pclass, Sex and Age columns remains. 
train_set

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0
...,...,...,...,...
886,0,2,male,27.0
887,1,1,female,19.0
888,0,3,female,28.0
889,1,1,male,26.0


**In terms of linear regression, all variables must be quantitative (continuous or discrete), however, from our 4 selected variables, "Sex" is a nominal categorical variable (Male or Famale). The code bellow show how to convert it to a numeric representation (label enconder). It's still a category, but now we're mapping from Male to 1 and Famale to 0.** <br>

Consider to checking out this documentation to understand why I chose the following methods. [sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [12]:
le = preprocessing.LabelEncoder()

var_to_transform = "Sex"
train_set[var_to_transform] = le.fit_transform(train_set[var_to_transform])
test_set[var_to_transform] = le.fit_transform(test_set[var_to_transform])

In [13]:
# observe that train_set sex column's content has 1 and 0
# Male and Famale string no longer exists 
train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,1,22.0
1,1,1,0,38.0
2,1,3,0,26.0
3,1,1,0,35.0
4,0,3,1,35.0


## Model Creation

**Linear regression**

In [14]:
linear_reg = LinearRegression()
linear_reg.fit(train_set[["Pclass", "Age", "Sex"]], train_set.Survived)

LinearRegression()

Once we've trained, let's make a **single prediction:**

In [15]:
Pclass = 1; Age = 27; Sex = 1

print("Probability to survive: {:.2f}%".format(linear_reg.predict([[Pclass, Age, Sex]])[0]*100))

Probability to survive: 46.28%


**Mathematics behind code explained**


$X$: train\_set <br>
$X \in \mathbb{R^ {n_x \times m}}$ <br>
$Y \in \mathbb{R^{1 \times m}} \ \ \ \ Y=[y^{(1)},y^{(2)},..., y^{(m)}]$ <br> 
$W$
X.shape = (891,4) $\land$ (X.shape = (nx, m)) $ \rightarrow X_{(4,891)}$ <br>
Y.shape = (1, m) <br>

**Definitions:** 
<br>\#things = the number of things <br>
$n_x$ = #features <br>
$m$ = #examples <br>

Recall that this is a binary classification problem (a person $p$ aboard to Titanic  survived or not survived) 
and based on three variables of interest: $x_1,x_2$ and $x_3$: (Pclass, Age and Sex) <br>
the model will output the probability of this person had survived or not. <br>
Therefore, a linear regression function that predicts this behaviour is given by:<br>


\begin{equation}
 ŷ_1 = w_1x_1 + w_2x_2 + w_3x_3 + β \ \ \ (1)
\end{equation}



and to predict the entire train_set: <br>
\begin{equation}
Ŷ=W^TX + β  \ \ \ (2)
\end{equation}




**After model training we can consult it's weights through:**

In [16]:
print("weights  : {} \nintercept: {}".format(linear_reg.coef_,linear_reg.intercept_))

weights  : [-0.1857794  -0.00499336 -0.49922946] 
intercept: 1.2826439065893616


**Let's use the equation (1) and weights of model to predict the following cases:** <br>

A person $p_1$ and $p_2$ where, <br>
$p_1$ = Pclass = 3, Age = 27 and Sex = male <br>
$p_2$ = Pclass = 1, Age = 27 and Sex = male <br>

<br>
$p_1 \rightarrow y_1 \approx -0.1857*3 -0.0049*27 -0.4992*1 + 1.2826 = 0.094 \approx$ 9.4% <br>
$p_2 \rightarrow y_2 \approx -0.1857*1 -0.0049*27 -0.4992*1 + 1.2826 = 0.4654 \approx$ 46% <br>

In [17]:
p1 = {"Pclass": 3, "Age":27, "Sex":1}
p2 = {"Pclass": 1, "Age":27, "Sex":1}

**Getting model's parameters (coefficients and intercept)**

In [18]:
w1, w2, w3 = *linear_reg.coef_,
B          =  linear_reg.intercept_

**Computing $p_1$ and $p_2$ probability to survive**

In [19]:
y1 = (w1 * p1["Pclass"]) + (w2 * p1["Age"]) + (w3 * p1["Sex"]) + B
y2 = (w1 * p2["Pclass"]) + (w2 * p2["Age"]) + (w3 * p2["Sex"]) + B

print("p1 => y1 ≃ {}% of chance to survive".format("%.2f" % (y1*100)))
print("p2 => y2 ≃ {}% of chance to survive".format("%.2f" % (y2*100)))

p1 => y1 ≃ 9.13% of chance to survive
p2 => y2 ≃ 46.28% of chance to survive


**We solved this problem with linear regression, but logistic regression is better because it's a binary classification (0 or 1). Linear regression does not have a predefined range.**

[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) <br>
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

y = train_set["Survived"]
X = train_set.drop("Survived", axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
classifier = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)

In [22]:
predictions = classifier.predict(X_val)
from sklearn.metrics import accuracy_score
accuracy_score(y_val, predictions)

0.8100558659217877

In [23]:
submission_preds = classifier.predict(test_set)

In [24]:
df = pd.DataFrame({"PassengerId":test_ids.values,
                   "Survived": submission_preds,
                  })

In [26]:
df.to_csv("submission.csv", index=False)