
# Advanced Data Mining for Data-Driven Insights and Predictive Modeling  
## Deliverable 1: Data Collection, Cleaning, and Exploration

**Course:** MSCS 634  
**Deliverable:** 1  
**Focus:** Data Preparation and Exploratory Data Analysis (EDA)



## 1. Dataset Selection and Description

For this project, we use the **Titanic Passenger Dataset**, which contains demographic and travel information for passengers aboard the RMS Titanic.

**Why this dataset is appropriate:**
- Contains over **890 records**
- Includes **10+ attributes**
- Real-world dataset widely used for classification and regression tasks
- Includes missing values and mixed data types suitable for data cleaning and EDA

**Source:** Kaggle Titanic Dataset


In [None]:

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Display first few rows
df.head()



## 2. Dataset Structure Inspection


In [None]:

# Dataset information
df.info()


In [None]:

# Statistical summary
df.describe()



## 3. Data Cleaning
### 3.1 Handling Missing Values


In [None]:

# Check missing values
df.isnull().sum()


In [None]:

# Fill missing Age values with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Embarked values with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop Cabin column due to excessive missing values
df.drop(columns=['Cabin'], inplace=True)

df.isnull().sum()



### 3.2 Removing Duplicates


In [None]:

# Check for duplicates
df.duplicated().sum()


In [None]:

# Remove duplicates if any
df.drop_duplicates(inplace=True)



### 3.3 Handling Noisy and Inconsistent Data


In [None]:

# Check for unrealistic age values
df[df['Age'] < 0]



## 4. Exploratory Data Analysis (EDA)


In [None]:

# Distribution of Age
plt.figure()
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()


In [None]:

# Survival count
plt.figure()
sns.countplot(x='Survived', data=df)
plt.title('Survival Count')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()


In [None]:

# Survival by Gender
plt.figure()
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


In [None]:

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()



## 5. Insights from EDA

- Females had a significantly higher survival rate than males.
- Age shows a mild correlation with survival.
- Passenger class is strongly correlated with survival outcome.
- These insights suggest that **Sex, Age, and Pclass** will be important features for future predictive modeling.

The findings from this EDA phase will guide feature selection and model building in the next deliverables.



## 6. Conclusion

This deliverable focused on preparing the dataset for modeling by performing data cleaning and exploratory analysis. The insights gained provide a strong foundation for regression and classification modeling in the next phase.
