<a href="https://colab.research.google.com/github/Ashutosh-Singh2209/Logistic-Regression-on-Titanic-Dataset/blob/main/Logistic_Regression_on_Titanic_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Logistic Regression on Titanic Dataset

I have been given a training dataset CSV file with X train and Y train data, and an X test file. My task is to use Logistic Regression as a training algorithm and come up with predictions for the given data.

To accomplish this, I will follow these steps:

1. Import the necessary libraries and the dataset using the sklearn library.
2. Perform data preprocessing, if required, for the training dataset.
3. Fit the Logistic Regression model to the training data.
4. Use the trained model to make predictions on the X test data.
5. Generate a CSV file with the predicted results for the X test data, with only one column and no headers.

# **VARIABLE DESCRIPTIONS**

1. **Pclass:** Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
2. **Survival:** Survival (0 = No; 1 = Yes)
3. **Name:** Name
4. **Sex:** Sex
5. **Age:** Age
6. **SibSp:** Number of Siblings/Spouses Aboard
7. **Parch:** Number of Parents/Children Aboard
8. **Ticket:** Ticket Number
9. **Fare:** Passenger Fare (British pound)
10. **Cabin:** Cabin
11. **Embarked:** Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

### Import Libraries and Read CSV Files

In the next cell, I will import the `pandas` and `numpy` libraries and read the provided CSV files.


In [14]:
import numpy as np

In [15]:
import pandas as pd
data = pd.read_csv('train.csv', delimiter=',')
data.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,,S,1
1,3,"Williams, Mr. Howard Hugh ""Harry""",male,,0,0,A/5 2466,8.05,,S,0
2,2,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0,0,250655,26.0,,S,0
3,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S,0
4,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S,0


In [16]:
X = pd.read_csv('test.csv', delimiter=',')
X.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,"Davies, Master. John Morgan Jr",male,8.0,1,1,C.A. 33112,36.75,,S
1,1,"Leader, Dr. Alice (Farnham)",female,49.0,0,0,17465,25.9292,D17,S
2,3,"Kilgannon, Mr. Thomas J",male,,0,0,36865,7.7375,,Q
3,2,"Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr...",female,24.0,2,1,243847,27.0,,S
4,1,"McGough, Mr. James Robert",male,36.0,0,0,PC 17473,26.2875,E25,S


Next, I will display the shape of the datasets and identify the number of missing values in each column.


In [17]:
data.shape, X.shape

((668, 11), (223, 10))

In [18]:
data.isna().sum()

Pclass        0
Name          0
Sex           0
Age         132
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       514
Embarked      1
Survived      0
dtype: int64

In [19]:
X.isna().sum()

Pclass        0
Name          0
Sex           0
Age          45
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       173
Embarked      1
dtype: int64

Analyzing the "Ticket" Column



In [20]:
set(data['Ticket'])

{'110152',
 '110413',
 '110465',
 '110564',
 '111240',
 '111361',
 '111369',
 '111426',
 '111427',
 '111428',
 '112050',
 '112052',
 '112058',
 '112277',
 '113028',
 '113043',
 '113051',
 '113055',
 '113056',
 '113059',
 '113501',
 '113503',
 '113505',
 '113509',
 '113572',
 '113760',
 '113767',
 '113773',
 '113776',
 '113781',
 '113784',
 '113786',
 '113788',
 '113789',
 '113794',
 '113798',
 '113800',
 '113803',
 '113804',
 '113806',
 '11668',
 '11751',
 '11752',
 '11753',
 '11755',
 '11765',
 '11769',
 '11771',
 '11774',
 '11813',
 '11967',
 '12233',
 '12460',
 '12749',
 '13049',
 '13502',
 '13507',
 '13509',
 '13568',
 '14313',
 '14973',
 '1601',
 '16966',
 '17421',
 '17453',
 '17463',
 '17464',
 '17466',
 '17474',
 '19877',
 '19928',
 '19943',
 '19950',
 '19952',
 '19972',
 '19988',
 '19996',
 '211536',
 '218629',
 '220367',
 '220845',
 '2223',
 '223596',
 '226875',
 '228414',
 '229236',
 '230080',
 '230136',
 '230433',
 '230434',
 '231919',
 '231945',
 '233639',
 '233866',
 '2343

I am planning to incorporate the ticket number as a feature in my analysis.


In [21]:
data['Ticket'] = data['Ticket'].apply(lambda x : x.split()[-1])
X['Ticket'] = X['Ticket'].apply(lambda x : x.split()[-1])

In [22]:
data['Ticket'][ data['Ticket'].apply( lambda x : not (x.isdigit()) ) ]

117    LINE
301    LINE
526    LINE
Name: Ticket, dtype: object

In [23]:
def f(s) :
  if s.isdigit() :
    return int(s)
  else :
    return 0

data['Ticket'] = data['Ticket'].apply(lambda x : f(x))
X['Ticket'] = X['Ticket'].apply(lambda x : f(x))

I will now separate the Y values from the dataset.


In [24]:
# splitting training_df into training_df X values and Y_values
Y_data = data['Survived'].values
Y_data

array([1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1,

In [25]:
data.drop('Survived', axis=1, inplace=True)
data.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,,S
1,3,"Williams, Mr. Howard Hugh ""Harry""",male,,0,0,2466,8.05,,S
2,2,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0,0,250655,26.0,,S
3,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S
4,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,392076,7.05,,S


Next, I will proceed to handle the missing values as part of the data cleaning process.


In [26]:
# Impute Age with median
data['Age'].fillna(data['Age'].median(), inplace=True)

# Impute Embarked with mode
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Impute Age and Embarked for X using values from data
X['Age'].fillna(X['Age'].median(), inplace=True)
X['Embarked'].fillna(X['Embarked'].mode()[0], inplace=True)

In [27]:
# now check the missing values again
print(data.isna().sum())
print()
print(X.isna().sum())

Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       514
Embarked      0
dtype: int64

Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       173
Embarked      0
dtype: int64


Cabin column in both testing and training data have too many missing values, so we need to drop them.

In [28]:
data.drop('Cabin', axis=1, inplace=True)
X.drop('Cabin', axis=1, inplace=True)

In [29]:
# now check the missing values again
print(data.isna().sum())
print()
print(X.isna().sum())

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64


Now we need to do Label Encoding for non-numeric datapoints.

In [30]:
data.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,S
1,3,"Williams, Mr. Howard Hugh ""Harry""",male,29.0,0,0,2466,8.05,S
2,2,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0,0,250655,26.0,S
3,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,S
4,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,392076,7.05,S


To do the Label Encoding, we need to combine the two datasets. But 1st we drop the name column, as it has no significance in training the Logistic Regression model.

In [31]:
data.drop('Name', axis=1, inplace=True)
X.drop('Name', axis=1, inplace=True)

In [32]:
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,2,male,8.0,1,1,33112,36.75,S
1,1,female,49.0,0,0,17465,25.9292,S
2,3,male,27.0,0,0,36865,7.7375,Q
3,2,female,24.0,2,1,243847,27.0,S
4,1,male,36.0,0,0,17473,26.2875,S


Combining the two datasets.

In [33]:
from sklearn.preprocessing import LabelEncoder

# Combine training and testing data for encoding
combined_df = pd.concat([data, X], ignore_index=True)

# Initialize LabelEncoders for 'Sex', 'Embarked', and 'Ticket' columns
label_encoder_sex = LabelEncoder()
label_encoder_embarked = LabelEncoder()
# label_encoder_ticket = LabelEncoder()

# Fit and transform categorical columns
combined_df['Sex_Encoded'] = label_encoder_sex.fit_transform(combined_df['Sex'])
combined_df['Embarked_Encoded'] = label_encoder_embarked.fit_transform(combined_df['Embarked'])
# combined_df['Ticket_Encoded'] = label_encoder_ticket.fit_transform(combined_df['Ticket'])

# Split back into training and testing data
data_encoded = combined_df.iloc[:len(data)]
X_encoded = combined_df.iloc[len(data):]

# Use training_df_encoded and testing_df_encoded for further analysis


In [34]:
data_encoded.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_Encoded,Embarked_Encoded
0,2,female,29.0,1,0,228414,26.0,S,0,2
1,3,male,29.0,0,0,2466,8.05,S,1,2
2,2,male,39.0,0,0,250655,26.0,S,1,2
3,3,female,29.0,0,4,349909,21.075,S,0,2
4,3,male,25.0,0,0,392076,7.05,S,1,2


In [35]:
X_encoded.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_Encoded,Embarked_Encoded
668,2,male,8.0,1,1,33112,36.75,S,1,2
669,1,female,49.0,0,0,17465,25.9292,S,0,2
670,3,male,27.0,0,0,36865,7.7375,Q,1,1
671,2,female,24.0,2,1,243847,27.0,S,0,2
672,1,male,36.0,0,0,17473,26.2875,S,1,2


In [36]:
X_encoded.reset_index(drop=True, inplace=True)
X_encoded.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_Encoded,Embarked_Encoded
0,2,male,8.0,1,1,33112,36.75,S,1,2
1,1,female,49.0,0,0,17465,25.9292,S,0,2
2,3,male,27.0,0,0,36865,7.7375,Q,1,1
3,2,female,24.0,2,1,243847,27.0,S,0,2
4,1,male,36.0,0,0,17473,26.2875,S,1,2


Now get X_train and X_test

In [37]:
# Define the columns you want to include
selected_columns = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_Encoded', 'Embarked_Encoded']

# Extract values of selected columns for training data
X_data = data_encoded[selected_columns].values

# Extract values of selected columns for testing data
X_ = X_encoded[selected_columns].values

In [38]:
X_

array([[  2.    ,   8.    ,   1.    , ...,  36.75  ,   1.    ,   2.    ],
       [  1.    ,  49.    ,   0.    , ...,  25.9292,   0.    ,   2.    ],
       [  3.    ,  27.    ,   0.    , ...,   7.7375,   1.    ,   1.    ],
       ...,
       [  1.    ,  17.    ,   1.    , ..., 108.9   ,   0.    ,   0.    ],
       [  3.    ,  43.    ,   0.    , ...,   6.45  ,   1.    ,   2.    ],
       [  2.    ,  36.5   ,   0.    , ...,  26.    ,   1.    ,   2.    ]])

I will now split the `X_data` and `Y_data` into `X_train`, `X_test`, `Y_train`, and `Y_test`.


In [39]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.3, random_state=42)

# Print the shapes of the resulting arrays
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)


X_train shape: (467, 7)
X_test shape: (201, 7)
Y_train shape: (467,)
Y_test shape: (201,)


Now using the Logisting Regression and train LogisticRegression algorithm on training data.

In [40]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
logreg_model = LogisticRegression(C=1, max_iter=1000, tol=0.0001, penalty='l2', random_state=42)

# Fit the model on the training data
logreg_model.fit(X_train, Y_train)

# Evaluate the model
# Calculate accuracy using the score method
train_accuracy = logreg_model.score(X_train, Y_train)
test_accuracy = logreg_model.score(X_test, Y_test)
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

Training Accuracy: 0.7944325481798715
Testing Accuracy: 0.7910447761194029


We are achieving approximately 80% accuracy on both splits of the data.


I will now predict the values for the given `test.csv` and save the results in a `.csv` file.

In [41]:
# Predict on the testing data
Y = logreg_model.predict(X_)

In [42]:
# save Y_pred to a csv file
np.savetxt("Y_pred.csv", Y, delimiter=",")