# 🚢 Titanic Survival Prediction - Machine Learning Project

**Author:** Nidhi Sanni  
**Dataset:** [Kaggle Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)## 📌 Objective:
To build a predictive model that determines whether a passenger survived the Titanic shipwreck, using machine learning algorithms and data preprocessing techniques

## 🔍 Key Steps:
- Data Cleaning (handling missing values)
- Feature Engineering (`Title`, `FamilySize`)
- Encoding categorical variables
- Model Training using Random Forest Classifier
- Validation & accuracy check
- Creating Kaggle smission

## 🧠 Tools & Libries:
- Python 🐍
- Pandas, NumPy
- Scikit-learn
- Jyter Notebook

## 🎯 Accuracy:
Achieved 0.8324~[youaccuracy here]**

## 📂 File:
`tita_poject_nidhi.ipynb`

---

✅ Next Steps:
- Try other models like Logistic Regression, SVM, XGBoost
- Tune hyperparameters
- Plot feature importance for better insights


STEP 1

In [6]:
import pandas as pd
test= pd.read_csv('test.csv')
train= pd.read_csv('train (1).csv')
print('training data: ')
print(train.head())
print('/n training data info')
print(train.info())

training data: 
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   N

STEP 2


In [20]:
# Clean missing values safely
train['Age'].fillna(train['Age'].median(), inplace=True)
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

# Drop Cabin column if it exists
train.drop('Cabin', axis=1, inplace=True, errors='ignore')

# Check again for missing values
print(train.isnull().sum())


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


Step 3: Feature Engineering (Make Data More Useful)


In [22]:
# Extract Title from Name
train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Check the unique titles
print(train['Title'].value_counts())


Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64


In [26]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
train.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)



Step 4: Convert Categorical Data → Numbers (Encoding)


In [28]:
# One-hot encode Sex, Embarked, Title
train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Title'], drop_first=True)

# View new columns after encoding
print(train.head())


   Survived  Pclass   Age  SibSp  Parch     Fare  FamilySize  Sex_male  \
0         0       3  22.0      1      0   7.2500           2      True   
1         1       1  38.0      1      0  71.2833           2     False   
2         1       3  26.0      0      0   7.9250           1     False   
3         1       1  35.0      1      0  53.1000           2     False   
4         0       3  35.0      0      0   8.0500           1      True   

   Embarked_Q  Embarked_S  ...  Title_Major  Title_Master  Title_Miss  \
0       False        True  ...        False         False       False   
1       False       False  ...        False         False       False   
2       False        True  ...        False         False        True   
3       False        True  ...        False         False       False   
4       False        True  ...        False         False       False   

   Title_Mlle  Title_Mme  Title_Mr  Title_Mrs  Title_Ms  Title_Rev  Title_Sir  
0       False      False      True  


Step 5: Train a Machine Learning Model


In [30]:
# Separate target (y) and features (X)
X = train.drop('Survived', axis=1)
y = train['Survived']


In [34]:
from sklearn.model_selection import train_test_split

# 80% training, 20% validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize model
model = RandomForestClassifier()

# Train the model
model.fit(X_train, y_train)

# Predict on validation set
y_pred = model.predict(X_val)

# Check accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")


Validation Accuracy: 0.8324



Step 6: Predict on the Test Data and Create Submission


In [40]:
# Reload test data
test = pd.read_csv('test.csv')

# Fill missing Age and Fare
test['Age'].fillna(train['Age'].median(), inplace=True)
test['Fare'].fillna(train['Fare'].median(), inplace=True)

# Extract Title
test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Create FamilySize
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1

# Drop unwanted columns
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True, errors='ignore')

# One-hot encoding
test = pd.get_dummies(test, columns=['Sex', 'Embarked', 'Title'], drop_first=True)


In [50]:
# Add missing columns to test set (with 0 values)
for col in X.columns:
    if col not in test.columns:
        test[col] = 0

# Reorder columns to match training set
test = test[X.columns]


In [44]:
# Predict
test_preds = model.predict(test)


In [48]:
# Load PassengerId again (since we dropped it earlier)
test_data = pd.read_csv('test.csv')

# Create submission DataFrame
submission = pd.DataFrame({
    'PassengerId': test_data['PassengerId'],
    'Survived': test_preds
})

# Save to CSV
submission.to_csv('titanic_submission.csv', index=False)
print("Submission file created ")


Submission file created 
