In [22]:
from google.colab import files
import zipfile
import os
import pandas as pd

# Check if archive.zip exists, if not, prompt for upload
if not os.path.exists("archive.zip"):
    print("Please upload 'archive.zip' when prompted.")
    uploaded = files.upload()
    if "archive.zip" not in uploaded:
        print("Error: 'archive.zip' was not uploaded. Cannot proceed with data loading.")
        # Raise an error to stop execution if the file isn't uploaded.
        raise FileNotFoundError("'archive.zip' not found after upload attempt.")
else:
    print("'archive.zip' found. Skipping upload prompt.")

# Ensure titanic_data directory exists and extract if not already done
if not os.path.exists("titanic_data"):
    try:
        with zipfile.ZipFile("archive.zip", 'r') as zip_ref:
            zip_ref.extractall("titanic_data")
        print("Files extracted to 'titanic_data/'")
    except FileNotFoundError:
        print("Error: 'archive.zip' not found for extraction. Please ensure it's uploaded.")
        raise # Re-raise the error if extraction fails due to missing zip.
else:
    print("'titanic_data/' directory already exists. Skipping extraction.")

# Verify the CSV file exists within the extracted directory
csv_path = "titanic_data/Titanic-Dataset.csv"
if not os.path.exists(csv_path):
    print(f"Error: {csv_path} not found. Check the contents of 'archive.zip'.")
    raise FileNotFoundError(f"{csv_path} not found.")

# Load the DataFrame and display its head
df = pd.read_csv(csv_path)
df.head()

'archive.zip' found. Skipping upload prompt.
'titanic_data/' directory already exists. Skipping extraction.


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Reasoning**:
The next step is to get a concise summary of the DataFrame, including data types and non-null values, which is crucial for identifying potential missing values and understanding the dataset's structure. This can be achieved using the `df.info()` method.



In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**Reasoning**:
The next step is to generate descriptive statistics for the numerical columns to understand their distribution, which can be done using the `df.describe()` method.



In [24]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Reasoning**:
The next step is to calculate and display the total count of missing values for each column in the DataFrame, which is crucial for pinpointing columns that require missing value imputation or removal. This can be achieved using the `df.isnull().sum()` method.



In [25]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


**Reasoning**:
The next step is to handle missing values in the 'Age' column by filling them with the median age, which is a robust measure against outliers and preserves the distribution of the data more accurately than the mean.



In [26]:
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
print(f"Missing values in 'Age' column after imputation: {df['Age'].isnull().sum()}")

Missing values in 'Age' column after imputation: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


**Reasoning**:
The previous code for filling missing 'Age' values produced a `FutureWarning` due to the use of `inplace=True` with chained assignment. To align with best practices and remove the warning, I will explicitly assign the result of `fillna()` back to the 'Age' column.



In [27]:
median_age = df['Age'].median()
df['Age'] = df['Age'].fillna(median_age)
print(f"Missing values in 'Age' column after imputation: {df['Age'].isnull().sum()}")

Missing values in 'Age' column after imputation: 0


**Reasoning**:
The next step is to handle missing values in the 'Embarked' column by filling them with the most frequent embarking port (mode), as specified in the instructions. This is a suitable strategy for categorical data with a small number of missing values.



In [28]:
most_frequent_embarked = df['Embarked'].mode()[0]
df['Embarked'].fillna(most_frequent_embarked, inplace=True)
print(f"Missing values in 'Embarked' column after imputation: {df['Embarked'].isnull().sum()}")

Missing values in 'Embarked' column after imputation: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(most_frequent_embarked, inplace=True)


**Reasoning**:
The previous code for filling missing 'Embarked' values produced a `FutureWarning` due to the use of `inplace=True` with chained assignment. To align with best practices and remove the warning, I will explicitly assign the result of `fillna()` back to the 'Embarked' column.



In [29]:
most_frequent_embarked = df['Embarked'].mode()[0]
df['Embarked'] = df['Embarked'].fillna(most_frequent_embarked)
print(f"Missing values in 'Embarked' column after imputation: {df['Embarked'].isnull().sum()}")

Missing values in 'Embarked' column after imputation: 0


**Reasoning**:
The next step is to drop the 'Cabin' column, as specified in the instructions, due to its high percentage of missing values. This can be achieved using the `df.drop()` method.



In [30]:
df.drop('Cabin', axis=1, inplace=True)
print(f"'Cabin' column dropped. Remaining columns: {df.columns.tolist()}")

'Cabin' column dropped. Remaining columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked']


## Feature Engineering and Encoding

### Subtask:
Create new features ('FamilySize' and 'IsAlone') and convert categorical features ('Sex', 'Embarked', 'Pclass') into numerical representations using one-hot encoding.


**Reasoning**:
The first part of the subtask is to create the new features 'FamilySize' and 'IsAlone'. 'FamilySize' is calculated by summing 'SibSp' and 'Parch' and adding 1, and 'IsAlone' is derived from 'FamilySize'.



In [31]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

print("New columns 'FamilySize' and 'IsAlone' created.")
print(df[['SibSp', 'Parch', 'FamilySize', 'IsAlone']].head())

New columns 'FamilySize' and 'IsAlone' created.
   SibSp  Parch  FamilySize  IsAlone
0      1      0           2        0
1      1      0           2        0
2      0      0           1        1
3      1      0           2        0
4      0      0           1        1


**Reasoning**:
Now that 'FamilySize' and 'IsAlone' have been created, the next step is to convert the categorical columns 'Sex', 'Embarked', and 'Pclass' into numerical representations using one-hot encoding, as specified in the instructions. This is crucial for preparing the data for machine learning models.



In [32]:
df = pd.get_dummies(df, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)
print("Categorical columns 'Sex', 'Embarked', and 'Pclass' have been one-hot encoded.")
print(df.head())

Categorical columns 'Sex', 'Embarked', and 'Pclass' have been one-hot encoded.
   PassengerId  Survived                                               Name  \
0            1         0                            Braund, Mr. Owen Harris   
1            2         1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2            3         1                             Heikkinen, Miss. Laina   
3            4         1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4            5         0                           Allen, Mr. William Henry   

    Age  SibSp  Parch            Ticket     Fare  FamilySize  IsAlone  \
0  22.0      1      0         A/5 21171   7.2500           2        0   
1  38.0      1      0          PC 17599  71.2833           2        0   
2  26.0      0      0  STON/O2. 3101282   7.9250           1        1   
3  35.0      1      0            113803  53.1000           2        0   
4  35.0      0      0            373450   8.0500           1        1   

   Sex_

**Reasoning**:
The final step in this subtask is to drop the original 'SibSp', 'Parch', 'Sex', 'Embarked', and 'Pclass' columns, as their information has been incorporated into the new 'FamilySize', 'IsAlone', and one-hot encoded features. This reduces redundancy and prepares the DataFrame for modeling.



In [33]:
df.drop(['SibSp', 'Parch', 'Sex', 'Embarked', 'Pclass'], axis=1, inplace=True)
print("Original 'SibSp', 'Parch', 'Sex', 'Embarked', and 'Pclass' columns dropped.")
print(df.head())

KeyError: "['Sex', 'Embarked', 'Pclass'] not found in axis"

In [39]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Load data
df = pd.read_csv("/content/train.csv")

# Cleaning
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Fare"].fillna(df["Fare"].mean(), inplace=True)
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})

# Keep only useful columns
df = df[["Survived", "Pclass", "Sex", "Age", "Fare"]]

# Stats
print(df.describe())

# Visualization
df.groupby("Sex")["Survived"].mean().plot(kind="bar")
plt.title("Survival by Gender")
plt.show()

# ML preparation
X = df[["Pclass", "Sex", "Age", "Fare"]]
y = df["Survived"]

# Train model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))


FileNotFoundError: [Errno 2] No such file or directory: '/content/train.csv'

**Reasoning**:
The previous error occurred because 'Sex', 'Embarked', and 'Pclass' columns were already removed by the `pd.get_dummies` function when creating one-hot encoded columns. The corrected code will now only drop 'SibSp' and 'Parch', which are the remaining original columns whose information has been incorporated into new features.



In [34]:
df.drop(['SibSp', 'Parch'], axis=1, inplace=True)
print("Original 'SibSp' and 'Parch' columns dropped.")
print(df.head())

Original 'SibSp' and 'Parch' columns dropped.
   PassengerId  Survived                                               Name  \
0            1         0                            Braund, Mr. Owen Harris   
1            2         1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2            3         1                             Heikkinen, Miss. Laina   
3            4         1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4            5         0                           Allen, Mr. William Henry   

    Age            Ticket     Fare  FamilySize  IsAlone  Sex_male  Embarked_Q  \
0  22.0         A/5 21171   7.2500           2        0      True       False   
1  38.0          PC 17599  71.2833           2        0     False       False   
2  26.0  STON/O2. 3101282   7.9250           1        1     False       False   
3  35.0            113803  53.1000           2        0     False       False   
4  35.0            373450   8.0500           1        1      True       Fa

## Prepare Data for Modeling

### Subtask:
Separate the features (X) from the target variable (y, 'Survived'). Split the preprocessed data into training and testing sets to evaluate the model's performance on unseen data.


**Reasoning**:
The first step is to separate the target variable 'Survived' into `y` and the features into `X` by dropping unnecessary columns like 'PassengerId', 'Name', and 'Ticket', and then split these into training and testing sets using `train_test_split` to prepare for model training.



In [35]:
from sklearn.model_selection import train_test_split

# Define the target variable (y) and features (X)
y = df['Survived']
X = df.drop(['PassengerId', 'Name', 'Ticket', 'Survived'], axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (712, 9)
Shape of X_test: (179, 9)
Shape of y_train: (712,)
Shape of y_test: (179,)


## Train and Evaluate a Predictive Model

### Subtask:
Train a Logistic Regression model on the training data and evaluate its performance on the test set using accuracy, precision, recall, F1-score, and a confusion matrix.


**Reasoning**:
The first step in training and evaluating the model is to import the necessary libraries, instantiate the Logistic Regression model, train it on the `X_train` and `y_train` datasets, make predictions on `X_test`, and then calculate and print the required evaluation metrics (accuracy, precision, recall, F1-score) along with the confusion matrix.



In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Instantiate the Logistic Regression model
model = LogisticRegression(max_iter=200, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 0.7989
Precision: 0.7879
Recall: 0.7027
F1-Score: 0.7429

Confusion Matrix:
[[91 14]
 [22 52]]


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Reasoning**:
The previous Logistic Regression model issued a `ConvergenceWarning`, indicating that the optimization algorithm did not converge within the specified `max_iter` limit. Feature scaling is a common and effective technique to address this issue by standardizing the range of independent variables, which helps gradient-based optimizers converge more efficiently. Therefore, the next step is to apply `StandardScaler` to the training and testing feature sets.



In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Instantiate the Logistic Regression model
model = LogisticRegression(max_iter=200, random_state=42) # Increased max_iter already, but scaling will help

# Train the model on scaled data
model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 0.7989
Precision: 0.7879
Recall: 0.7027
F1-Score: 0.7429

Confusion Matrix:
[[91 14]
 [22 52]]


In [38]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score




df = pd.read_csv("/content/train.csv")

print(df.head())
print(df.info())




df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Fare"].fillna(df["Fare"].mean(), inplace=True)

df["Sex"] = df["Sex"].map({"male": 0, "female": 1})

df = df.dropna()   # drop remaining missing rows


# -----------------------------
# Simple statistics
# -----------------------------

print("\nAverage Age:", df["Age"].mean())
print("Max Fare:", df["Fare"].max())
print("Min Fare:", df["Fare"].min())
pri


FileNotFoundError: [Errno 2] No such file or directory: '/content/train.csv'

## Final Task

### Subtask:
Summarize the model's performance, key features that contributed to the prediction, and any insights gained from the analysis.


## Summary:

### Q&A
The final task asks to summarize the model's performance, key features that contributed to the prediction, and any insights gained from the analysis.

*   **Model's Performance:** The Logistic Regression model achieved an accuracy of 79.89%, precision of 78.79%, recall of 70.27%, and an F1-score of 74.29% on the test set. The confusion matrix showed 91 true negatives, 52 true positives, 14 false positives, and 22 false negatives.
*   **Key Features that Contributed to the Prediction:** While specific feature importance scores were not explicitly extracted in this analysis, the engineered features like 'FamilySize' and 'IsAlone', alongside the encoded categorical variables ('Sex\_male', 'Embarked\_Q', 'Embarked\_S', 'Pclass\_2', 'Pclass\_3'), transformed numerical features (e.g., 'Age', 'Fare'), likely played a significant role in the model's predictions.
*   **Insights Gained:**
    *   Missing data handling is crucial; 'Cabin' had too many missing values and was dropped, while 'Age' and 'Embarked' were imputed.
    *   Feature engineering, such as creating 'FamilySize' and 'IsAlone', can provide valuable information from existing columns.
    *   Categorical variables, when properly encoded, can be effectively used by machine learning models.
    *   Feature scaling is important for the convergence of optimization algorithms in models like Logistic Regression, even if it doesn't always drastically change the final metrics for this specific dataset.

### Data Analysis Key Findings

*   The initial dataset contained missing values primarily in the 'Age' (177 missing), 'Cabin' (687 missing), and 'Embarked' (2 missing) columns.
*   Missing 'Age' values were imputed with the median age of 28.
*   Missing 'Embarked' values were imputed with the most frequent port, 'S'.
*   The 'Cabin' column was dropped entirely due to a very high percentage of missing data (approximately 77%).
*   New features, 'FamilySize' (combining 'SibSp' and 'Parch' plus 1) and 'IsAlone' (binary indicator for 'FamilySize' == 1), were successfully engineered.
*   Categorical variables 'Sex', 'Embarked', and 'Pclass' were transformed into numerical representations using one-hot encoding with `drop_first=True` to avoid multicollinearity.
*   The original 'SibSp', 'Parch', 'Sex', 'Embarked', and 'Pclass' columns were removed after feature engineering and encoding.
*   The data was split into training and testing sets (80% training, 20% testing), resulting in 712 samples for training and 179 for testing.
*   Feature scaling using `StandardScaler` was applied to the numerical features, which resolved a `ConvergenceWarning` during Logistic Regression training.
*   The Logistic Regression model achieved the following performance metrics on the test set:
    *   Accuracy: 0.7989
    *   Precision: 0.7879
    *   Recall: 0.7027
    *   F1-Score: 0.7429
*   The confusion matrix indicated the model correctly identified 91 instances of non-survival (True Negatives) and 52 instances of survival (True Positives), while incorrectly predicting 14 non-survivors as survivors (False Positives) and 22 survivors as non-survivors (False Negatives).

### Insights or Next Steps

*   **Model Refinement:** Explore more advanced classification models (e.g., Random Forest, Gradient Boosting) or tune the hyperparameters of the current Logistic Regression model to potentially improve predictive performance, especially recall and precision.
*   **Feature Importance Analysis:** Quantify the contribution of each feature to the model's predictions (e.g., by examining Logistic Regression coefficients or using feature importance techniques for tree-based models) to gain deeper insights into survival factors and guide further feature engineering.
