***Title*** :- *Mastering Data Preprocessing  : The Gateway to Effective M*achine Learning

Q 1**. What is Data Preprocessing? (Simple Definition + Analogy)**.


* **Definition**:-

“*Data preprocessing is the process of cleaning and transforming raw data into a format suitable for building machine learning models*.”


* **Real-World Analogy of Data Preprocessing:-**

*Data preprocessing like preparing ingredients before cooking a meal*.




**Analogy:- Cooking a Dish:-**

*WE can’t just throw raw vegetables, unwashed rice, and dirty utensils into a pan and expect a delicious meal.*
*So*,
*First we need to* :-

Wash the vegetables *(clean the data)*

Cut them properly (*organize the data*)

Remove spoiled parts (*remove outliers or missing values*)

Measure the right quantities (*scale the features*)

Select only the ingredients you need (*feature selection*)

Only then can we start cooking (i.e., *training your machine learning model*).



Q 2. **Why is Data Preprocessing Important? Explain how clean data leads to better models**.


*  ***Data Preprocessing Important*** :-



1. 🧠 Machine learning models need high-quality, well-structured input to learn effectively.

2. 🎯 The quality of input data directly determines the accuracy of the model’s predictions.

3. 🧹 Models work best when data is clean, consistent, and properly formatted.

4. ⚠️ Messy or incomplete data leads to models learning incorrect patterns and making poor predictions.

5. 🚀 Preprocessing enhances the accuracy, speed, and reliability of machine learning workflows.

6. 🔄 It is a foundational step — even the most advanced algorithms fail without clean data.

7. 📊 Proper preprocessing enables models to generalize well on new, unseen data.

8. 🧰 Mastering preprocessing is key to building efficient and trustworthy ML systems.

 **Clean Data Leads to Better Models** :-
* *Garbage In, Garbage Out:*-

*If we feed a model poor-quality data, it will learn incorrect patterns and produce inaccurate results — no matter how powerful the algorithm is.

* *Real-World Data is Messy*:-

*Missing values (e.g., empty cells)

*Inconsistent formats (e.g., "Yes", "yes", "Y")

* *Outliers* (e.g., age = 999)

*These issues can mislead the model unless fixed during preprocessing.

*  *Models Need Numeric Input*-

Most machine learning algorithms can’t directly process:-

*Text values (e.g., "Male", "Female") → must be encoded into numbers like 0 and 1

*Dates or categorical variables → must be transformed into numerical formats

  * We must encode or convert such data before training the model.

Q 3. Key Concepts to Cover (With 1–2 Line Explanations): -

* **Handling missing values** :-

  *Fill in or remove empty values to avoid errors in training*.

* **Removing duplicates**	:-

  *Deletes repeated rows to prevent bias or misleading results*.

* **Handling outliers with graphs** :-

	*Use boxplots or histograms to detect and treat extreme values that may distort the model.*

* **Encoding categorical variables** :-

	*Convert text (like "Male", "Female") into      numeric form so models can understand them.*

* **Feature scaling**  :-

  *Standardization and normalization bring all features to the same scale for fair model training.*

* **Feature selection**	:-

  *Choose the most important columns to improve model performance and reduce overfitting*.

* **Data splittin**g (***train/test***) :-

  *Divide data into training and testing sets to evaluate model performance fairly*.

✅ 4. **Tools Used** :-

🐍 **Python**: Language used for data science

* *The foundation language for machine learning, data analysis, and preprocessing tasks.*

🐼 **Pandas**: *For reading and cleaning data*

* *Used to load datasets (e.g., CSV files), handle missing values, remove duplicates, and explore/transform data.*

🧠 **scikit-learn** (*sklearn*)- *Core preprocessing and modeling library*
*Provides tools for*:-
* Encoding categorical variables

* Feature scaling (standardization, normalization)

* Splitting data (train/test)

* Imputation (handling missing values)

* Feature selection

* Pipelines (to automate preprocessing + modeling)

* Model evaluation tools (like cross-validation, scoring)



# *Dataset* Name:- Titanic - Machine Learning from Disaster

# Apply:
1. Missing value handling-

2.  Encoding-

3.  Scaling-

# Step 1: ***Import libraries***

In [2]:
import pandas as pd

In [3]:
from sklearn.preprocessing import LabelEncoder

#Used to convert categorical text data into numerical values.
#Label Encoding
#→ Good for binary or ordinal categories


In [4]:
from sklearn.preprocessing import OneHotEncoder

#One-Hot Encoding
# → Good for nominal (non-ordered) categories

In [5]:
from sklearn.preprocessing import StandardScaler

#Used to normalize numerical features so they’re on a similar scale.
# Standardization
# → Centers the data (mean=0, std=1)

In [6]:
from sklearn.preprocessing import MinMaxScaler

# Normalization (Min-Max Scaling)
# → Scales values to a range [0, 1]

In [7]:
from sklearn.model_selection import train_test_split


# 1. Training Set (usually 70–80% of the data)
# * Used to train your machine learning model.

# 2. Testing Set (usually 20–30% of the data)
# *Used to evaluate how well your model performs on unseen data.

# Step 2: ***Load the dataset***

In [9]:
df=pd.read_csv('/content/ Machine Learning from Disaster')

# Step 3: ***Explore data***

In [10]:
df.shape

(891, 12)

In [11]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [12]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [16]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [40]:
df.duplicated().sum()

np.int64(0)

# 1. Missing value handling-

In [17]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [18]:
df.isnull().sum().mean()*100

np.float64(7216.666666666667)

In [19]:
df['Age'].fillna(df['Age'].median(), inplace=True)  # Fill Age with median


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)  # Fill Age with median


In [20]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)  # Fill with most common

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)  # Fill with most common


In [21]:
df.drop('Cabin', axis=1, inplace=True)  # Drop too many missing values

In [37]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked_Q   891 non-null    bool   
 11  Embarked_S   891 non-null    bool   
dtypes: bool(2), float64(2), int64(6), object(2)
memory usage: 71.5+ KB


# 2. ***Encoding***-

In [22]:
#  Handle categorical data (Label Encoding & One-Hot)

label = LabelEncoder()
df['Sex'] = label.fit_transform(df['Sex'])  # Male=1, Female=0
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)  # One-hot for Embarked


# 3. ***Feature Scaling***

In [23]:
#  Feature Scaling

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

In [24]:
#  Feature Selection (Keep only important columns)

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']
X = df[features]
y = df['Survived']


In [36]:
#   Data Splitting
#712 samples for training
#179 samples for testing
# (80%–20% split)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
print("Preprocessing Complete.")

Preprocessing Complete.


In [None]:
import os
print("File saved:", os.path.exists("titanic_cleaned.csv"))

File saved: True


In [None]:
df=pd.read_csv("titanic_cleaned.csv")

In [None]:
from google.colab import files
files.download("titanic_cleaned.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print(df.shape)
print(df.columns)


(891, 12)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked_Q', 'Embarked_S'],
      dtype='object')
