🌟 Exercise 1: Duplicate Detection and Removal
Instructions
Objective: Identify and remove duplicate entries in the Titanic dataset.

Load the Titanic dataset.
Identify if there are any duplicate rows based on all columns.
Remove any duplicate rows found in the dataset.
Verify the removal of duplicates by checking the number of rows before and after the duplicate removal.
Hint: Use the duplicated() and drop_duplicates() functions in Pandas.



In [2]:
import pandas as pd
df = pd.read_csv('/content/titanic_ds.csv')
duplicates = df.duplicated()
duplicates_qty = duplicates.sum()
rows_before = df.shape[0]
df = df.drop_duplicates()
rows_after = df.shape[0]
print(f"Number of rows before: {rows_before}")
print(f"Number of rows after: {rows_after}")
print(f"Number of duplicate rows: {df.duplicated().sum()}")

Number of rows before: 418
Number of rows after: 418
Number of duplicate rows: 0


🌟 Exercise 2: Handling Missing Values
Instructions
Identify columns in the Titanic dataset with missing values.
Explore different strategies for handling missing data, such as removal, imputation, and filling with a constant value.
Apply each strategy to different columns based on the nature of the data.
Hint: Review methods like dropna(), fillna(), and SimpleImputer from scikit-learn.

In [None]:
from sklearn.impute import SimpleImputer
df = pd.read_csv('/content/titanic_ds.csv')
missing_values = df.isnull().sum()
print("Columns with missing values:\n", missing_values[missing_values > 0])
# Strategy 1: Removal
df_dropped = df.dropna()
# Strategy 2: Imputation
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
imputer_num = SimpleImputer(strategy='mean')
df[num_cols] = imputer_num.fit_transform(df[num_cols])
# Strategy 3: Filling with a constant value
df['Age'] = df['Age'].fillna(0)
df['Embarked'] = df['Embarked'].fillna('Unknown')
print("Data after handling missing values:")
print(df.head())

Columns with missing values:
 Age       86
Fare       1
Cabin    327
dtype: int64
Data after handling missing values:
   PassengerId  Survived  Pclass  \
0        892.0       0.0     3.0   
1        893.0       1.0     3.0   
2        894.0       0.0     2.0   
3        895.0       0.0     3.0   
4        896.0       1.0     3.0   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5    0.0    0.0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0    1.0    0.0   
2                     Myles, Mr. Thomas Francis    male  62.0    0.0    0.0   
3                              Wirz, Mr. Albert    male  27.0    0.0    0.0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0    1.0    1.0   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  31

🌟 Exercise 3: Feature Engineering
Instructions
Create new features, such as Family Size from SibSp and Parch, and Title extracted from the Name column.
Convert categorical variables into numerical form using techniques like one-hot encoding or label encoding.
Normalize or standardize numerical features if required.
Hint: Utilize Pandas for data manipulation and scikit-learn’s preprocessing module for encoding.



In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
df = pd.read_csv('/content/titanic_ds.csv')
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1
df['Title'] = df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
print(df[['Name', 'Family_Size', 'Title']].head())
df = pd.get_dummies(df, columns=['Sex', 'Embarked', 'Title'])
print(df.head())

                                           Name  Family_Size Title
0                              Kelly, Mr. James            1    Mr
1              Wilkes, Mrs. James (Ellen Needs)            2   Mrs
2                     Myles, Mr. Thomas Francis            1    Mr
3                              Wirz, Mr. Albert            1    Mr
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)            3   Mrs
   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name   Age  SibSp  Parch   Ticket  \
0                              Kelly, Mr. James  34.5      0      0   330911   
1              Wilkes, Mrs. James (Ellen Needs)  47.0      1      0   363272   
2                     Myles, Mr. Thomas Francis  62.0      0      0   240276   
3                              Wirz, Mr. Albert  27.0      0 

🌟 Exercise 4: Outlier Detection and Handling
Instructions
Use statistical methods to detect outliers in columns like Fare and Age.
Decide on a strategy to handle the identified outliers, such as capping, transformation, or removal.
Implement the chosen strategy and assess its impact on the dataset.
Hint: Explore methods like IQR (Interquartile Range) and Z-score for outlier detection.



In [2]:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('/content/titanic_ds.csv')
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers
fare_outliers = detect_outliers_iqr(df, 'Fare')
age_outliers = detect_outliers_iqr(df, 'Age')

print("Fare outliers using IQR:\n", fare_outliers)
print("Age outliers using IQR:\n", age_outliers)

def cap_outliers(df, column):
    lower_cap = df[column].quantile(0.01)
    upper_cap = df[column].quantile(0.99)
    df[column] = np.where(df[column] < lower_cap, lower_cap, df[column])
    df[column] = np.where(df[column] > upper_cap, upper_cap, df[column])

cap_outliers(df, 'Fare')
cap_outliers(df, 'Age')

print("\nAfter capping outliers:")
print(df[['Fare', 'Age']].describe())



Fare outliers using IQR:
      PassengerId  Survived  Pclass  \
12           904         1       1   
24           916         1       1   
48           940         1       1   
53           945         1       1   
59           951         1       1   
64           956         0       1   
69           961         1       1   
74           966         1       1   
75           967         0       1   
81           973         0       1   
96           988         1       1   
114         1006         1       1   
118         1010         0       1   
141         1033         1       1   
142         1034         0       1   
150         1042         1       1   
156         1048         1       1   
179         1071         1       1   
181         1073         0       1   
184         1076         1       1   
188         1080         1       3   
196         1088         0       1   
202         1094         0       1   
212         1104         0       2   
217         1109        

🌟 Exercise 5: Data Standardization and Normalization
Instructions
Assess the scale and distribution of numerical columns in the dataset.
Apply standardization to features with a wide range of values.
Normalize data that requires a bounded range, like [0, 1].
Hint: Consider using StandardScaler and MinMaxScaler from scikit-learn’s preprocessing module.



In [4]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df = pd.read_csv('/content/titanic_ds.csv')
print("Scale and distribution of numerical columns in the dataset.:")
print(df.describe())
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
scaler = StandardScaler()
df[['Fare', 'Age']] = scaler.fit_transform(df[['Fare', 'Age']])
minmax_scaler = MinMaxScaler()
df[['Pclass', 'SibSp', 'Parch']] = minmax_scaler.fit_transform(df[['Pclass', 'SibSp', 'Parch']])
print("\nAfter Standardization and Normalization:")
print(df.describe())



Scale and distribution of numerical columns in the dataset.:
       PassengerId    Survived      Pclass         Age       SibSp  \
count   418.000000  418.000000  418.000000  332.000000  418.000000   
mean   1100.500000    0.363636    2.265550   30.272590    0.447368   
std     120.810458    0.481622    0.841838   14.181209    0.896760   
min     892.000000    0.000000    1.000000    0.170000    0.000000   
25%     996.250000    0.000000    1.000000   21.000000    0.000000   
50%    1100.500000    0.000000    3.000000   27.000000    0.000000   
75%    1204.750000    1.000000    3.000000   39.000000    1.000000   
max    1309.000000    1.000000    3.000000   76.000000    8.000000   

            Parch        Fare  
count  418.000000  417.000000  
mean     0.392344   35.627188  
std      0.981429   55.907576  
min      0.000000    0.000000  
25%      0.000000    7.895800  
50%      0.000000   14.454200  
75%      0.000000   31.500000  
max      9.000000  512.329200  

After Standardizati

 Exercise 6: Feature Encoding
Instructions
Identify categorical columns in the Titanic dataset, such as Sex and Embarked.
Use one-hot encoding for nominal variables and label encoding for ordinal variables.
Integrate the encoded features back into the main dataset.
Hint: Utilize pandas.get_dummies() for one-hot encoding and LabelEncoder from scikit-learn for label encoding.



In [15]:
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('/content/titanic_ds.csv')
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical columns:\n", categorical_cols)
df = pd.get_dummies(df, columns=['Sex', 'Embarked'])
label_encoder = LabelEncoder()
df['Pclass'] = label_encoder.fit_transform(df['Pclass'])
print("Dataset after feature encoding:\n", df.head())


Categorical columns:
 Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
Dataset after feature encoding:
    PassengerId  Survived  Pclass  \
0          892         0       2   
1          893         1       2   
2          894         0       1   
3          895         0       2   
4          896         1       2   

                                           Name   Age  SibSp  Parch   Ticket  \
0                              Kelly, Mr. James  34.5      0      0   330911   
1              Wilkes, Mrs. James (Ellen Needs)  47.0      1      0   363272   
2                     Myles, Mr. Thomas Francis  62.0      0      0   240276   
3                              Wirz, Mr. Albert  27.0      0      0   315154   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22.0      1      1  3101298   

      Fare Cabin  Sex_female  Sex_male  Embarked_C  Embarked_Q  Embarked_S  
0   7.8292   NaN       False      True       False        True       False  
1   7.0000   NaN       

🌟 Exercise 7: Data Transformation for Age Feature
Instructions
Create age groups (bins) from the Age column to categorize passengers into different age categories.
Apply one-hot encoding to the age groups to convert them into binary features.
Hint: Use pd.cut() for binning the Age column and pd.get_dummies() for one-hot encoding.

In [21]:
df = pd.read_csv('/content/titanic_ds.csv')
age_bins = [0, 12, 18, 25, 35, 60, 100]
age_labels = ['Kid', 'Teen', 'Young Adult', 'Adult', 'Middle-Aged', 'Senior']
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)
print("Dataset with AgeGroup column:\n", df[['Age', 'AgeGroup']].head())
df = pd.get_dummies(df, columns=['AgeGroup'])
print("Dataset after one-hot encoding AgeGroup:\n", df.head())


Dataset with AgeGroup column:
     Age     AgeGroup
0  34.5        Adult
1  47.0  Middle-Aged
2  62.0       Senior
3  27.0        Adult
4  22.0  Young Adult
Dataset after one-hot encoding AgeGroup:
    PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  AgeGroup_Kid  AgeGroup_Teen  \
0   330911   7.8292   NaN        Q        