## Logistic Regression (Classification Algorithm) Exercise with Titanic data

<b>Goal</b>: Predict survival based on passenger characteristics. 1 is survived and 0 is not. As this is a logistic regression exercise, use a logistic regression model to accomplish this goal. 

### Load Data

`titanic.csv` is in the data folder. The data is from Kaggle's Titanic competition. Information on the data is available [here](https://www.kaggle.com/c/titanic/data).

In [1]:
# You might have to figure out what other import statements you need
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# This is because we need to scale our algorithm
from sklearn.preprocessing import StandardScaler

# Figure out how to import the csv file 
df = pd.read_csv("titanic/train.csv")

In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Arrange Data into Features Matrix and Target Vector
Make at least 4 features (Use at least Age and Sex columns) for your X. Make **Survived** series as the target. Keep in mind that one of the features (Age) has nans in them (meaning you need to either remove rows in the dataset with nans or impute them). Sex also needs to be transformed into 1's and 0's (strings are not an acceptable input for a model). 

#### Transform Sex Column Values 

In [3]:
# One-hot encoding categorical data 
df_encoded = pd.get_dummies(df, columns=["Sex"], drop_first=True)



#### Remove or Impute missing values for the Age Column

In [None]:
#imputing missing values using the mean age
df_encoded['Age'].fillna(df_encoded['Age'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_encoded['Age'].fillna(df_encoded['Age'].mean(), inplace=True)


**Create X and y**

In [None]:
X = df_encoded.drop("Survived", axis=1)  # Features (independent variables)
y = df_encoded["Survived"]               # Target (dependent variable)

### Split the data into training and testing sets

### Standardize Data
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data. You can standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - mean) / std

The code below uses StandardScaler to accomplish this. 

In [6]:
# This is code you could use to standardize data

scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

NameError: name 'X_train' is not defined

### Fit a Logistic Regression (This is a classification algorithm)

Keep in mind that Logistic regression is NOT A REGRESSION ALGORITHM

<b>Step 1:</b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

<b>Step 2:</b> Make an instance of the Model

<b>Step 3:</b> Training the model on the data, storing the information learned from the data. Model is learning the relationship between features and labels

<b>Step 4:</b> Predict the labels of new data (new passengers)

Uses the information the model learned during the model training process

## Make predictions on the testing set and calculate the accuracy

### Compare your testing accuracy to the null accuracy
Null accuracy is usually considered the accuracy obtained by always predicting the most frequent class.

When interpreting the predictive power of a model, it's best to compare it to a baseline using a dummy model, sometimes called a baseline model. A dummy model is simply using the mean, median, or most common value as the prediction. This forms a benchmark to compare your model against and becomes especially important in classification where your null accuracy might be 95 percent.

For example, suppose your dataset is **imbalanced** -- it contains 99% one class and 1% the other class. Then, your baseline accuracy (always guessing the first class) would be 99%. So, if your model is less than 99% accurate, you know it is worse than the baseline. Imbalanced datasets generally must be trained differently (with less of a focus on accuracy) because of this.

Since this particular model has an accuracy of roughly x%. By comparison, the null accuracy was 57.54%. The model provides some value. 

### Confusion matrix of Titanic predictions

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. Hint you might wish to consider googling this one if you don't know how to do it. 