Data Wrangling I
Perform the following operations using Python on any open-source
dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open-source data from the web (e.g.
https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas’ data frame.
4. Data Preprocessing: check for missing values in the data using
pandas isnull(), describe() function to get some initial statistics.
Provide variable descriptions. Types of variables etc. Check the
dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types
of variables by checking the data types (i.e., character, numeric,
integer, factor, and logical) of the variables in the data set. If
variables are not in the correct data type, apply proper type
conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you
do in the above steps and explain everything that you do to
import/read/scrape the data set.

In [1]:
# Step 1: Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

print("✅ Required Libraries Imported")


✅ Required Libraries Imported


In [2]:
# Step 2: Load Dataset
# Dataset Source: https://www.kaggle.com/competitions/titanic/data
df = pd.read_csv("Titanic-Dataset.csv")
print("✅ Dataset Loaded Successfully")

✅ Dataset Loaded Successfully


In [3]:
# Step 3: Initial Overview
print("📄 First 5 rows:\n", df.head())
print("\n🔢 Dataset Shape (Rows, Columns):", df.shape)
print("\n🔍 Data Types:\n", df.dtypes)

📄 First 5 rows:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   

In [4]:
# Step 4: Data Preprocessing
print("\n🔎 Missing Values:\n", df.isnull().sum())
# Handling missing values:

# 1. Age: fill missing with median (numerical, skewed distribution)
df['Age'].fillna(df['Age'].median(), inplace=True)

# 2. Cabin: too many missing values; drop the column
df.drop(columns=['Cabin'], inplace=True)

# 3. Embarked: fill with mode (most common port)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Check again after filling
print("\n✅ Missing Values (after handling):\n", df.isnull().sum())

print("\n📊 Descriptive Statistics:\n", df.describe(include='all'))



🔎 Missing Values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

✅ Missing Values (after handling):
 PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

📊 Descriptive Statistics:
         PassengerId    Survived      Pclass                 Name   Sex  \
count    891.000000  891.000000  891.000000                  891   891   
unique          NaN         NaN         NaN                  891     2   
top             NaN         NaN         NaN  Dooley, Mr. Patrick  male   
freq            NaN         NaN         NaN                    1   577   
mean     446.000000    0.383838    2.308642                  NaN   NaN   
std      257.353842    0.48659

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


## 🔢 Converting Data Types in Pandas

### 1. Converting Object to Numeric
- **Function:** `pandas.to_numeric()`
- **Purpose:** Converts values in a column from object (string) type to numeric (integer or float).
- **Example Use Case:** When numeric values are stored as strings (e.g., `'45'`), this function converts them to proper numbers (`45`).
- **Note:** If the conversion fails (e.g., due to non-numeric strings), use `errors='coerce'` to turn them into NaN.

```python
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
```

### 2. Converting Object to DateTime
- **Function:** `pandas.to_datetime()`
- **Purpose:** Converts a string column that contains date/time information into Python datetime objects.
- **Why:** Makes it easier to perform date operations like sorting, filtering by time, etc.

```python
df['date_column'] = pd.to_datetime(df['date_column'])
```

## 🧩 Converting Categorical to Numeric

### 1. Label Encoding
- **What it does:** Assigns a unique number to each category.
- **Best for:** Columns with nominal categories (no order).
- **Caution:** Some machine learning models might interpret numeric labels as ordered/ranked, which may be misleading.

```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded_column'] = le.fit_transform(df['category_column'])
```

### 2. Dummy Coding(boolean datatype)
- **What it does:** Creates a binary (0 or 1) column for each category.
- **Use Case:** Useful for regression models or when you want to avoid order bias.
- **Implementation:** Done using `pd.get_dummies()` in pandas.

```python
df = pd.get_dummies(df, columns=['category_column'], drop_first=True)
```

### 3. One-Hot Encoding
- **Similar to Dummy Coding**, but:
  - **No column is dropped by default** (can lead to multicollinearity).
  - Suitable for ML models that don’t assume linear relationships.
- **Why Use:** Avoids giving models false idea of order in categories (unlike label encoding).

```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['category_column']])
```


In [5]:
# Step 5: Data Formatting & Type Conversions

# Summary of variable types
print("\n📌 Variable Types Summary:\n", df.dtypes)

# Converting necessary columns to correct data types
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')
df['Survived'] = df['Survived'].astype(int)
df['SibSp'] = df['SibSp'].astype(int)
df['Parch'] = df['Parch'].astype(int)

print("\n✅ Data Types After Conversion:\n", df.dtypes)
df.head


📌 Variable Types Summary:
 PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
dtype: object

✅ Data Types After Conversion:
 PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
dtype: object


<bound method NDFrame.head of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                     

In [6]:
# Step 6: Converting Categorical to Quantitative

# Label Encoding for 'Sex' and 'Embarked'
le = LabelEncoder()
df['Sex_encoded'] = le.fit_transform(df['Sex'].astype(str))
df['Embarked_encoded'] = le.fit_transform(df['Embarked'].astype(str))

# Dummy Coding for Pclass and Embarked
df = pd.get_dummies(df, columns=['Pclass'], prefix='Class')
df = pd.get_dummies(df, columns=['Embarked'], prefix='Port')

#One Hot Encoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
# df.dtypes
one_hot = ohe.fit_transform(df[['Sex']])
print(one_hot)
print("\n✅ Categorical Columns Converted to Numerical:")
print(df[['Sex', 'Sex_encoded', 'Embarked_encoded']].head())


[[0. 1.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [0. 1.]
 [0. 1.]]

✅ Categorical Columns Converted to Numerical:
      Sex  Sex_encoded  Embarked_encoded
0    male            1                 2
1  female            0                 0
2  female            0                 2
3  female            0                 2
4    male            1                 2


In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Sex_encoded,Embarked_encoded,Class_1,Class_2,Class_3,Port_C,Port_Q,Port_S
0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,2,False,False,True,False,False,True
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,0,True,False,False,True,False,False
2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,2,False,False,True,False,False,True
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0,2,True,False,False,False,False,True
4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,2,False,False,True,False,False,True


In [8]:
# Final Check
print("\n📈 Final Dataset Shape:", df.shape)
print("📌 Final Columns:\n", df.columns.tolist())


📈 Final Dataset Shape: (891, 17)
📌 Final Columns:
 ['PassengerId', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Sex_encoded', 'Embarked_encoded', 'Class_1', 'Class_2', 'Class_3', 'Port_C', 'Port_Q', 'Port_S']


In [11]:
df.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Sex_encoded,Embarked_encoded,Class_1,Class_2,Class_3,Port_C,Port_Q,Port_S
0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,2,False,False,True,False,False,True
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,0,True,False,False,True,False,False
2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,2,False,False,True,False,False,True
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0,2,True,False,False,False,False,True
4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,2,False,False,True,False,False,True
