🌟 Exercise 1: Duplicate Detection And Removal

Instructions

Objective: Identify and remove duplicate entries in the Titanic dataset.

Load the Titanic dataset.
Identify if there are any duplicate rows based on all columns.
Remove any duplicate rows found in the dataset.
Verify the removal of duplicates by checking the number of rows before and after the duplicate removal.
Hint: Use the duplicated() and drop_duplicates() functions in Pandas.

In [30]:
import pandas as pd

titanic_df = pd.read_csv('titanic_dataset.csv')
print(titanic_df.head())
rows_before = titanic_df.shape[0]
print(f'Number of rows before removing duplicates: {rows_before}')
duplicate_rows = titanic_df[titanic_df.duplicated()]
print(f'Number of duplicate rows: {duplicate_rows.shape[0]}')
df_cleaned = titanic_df.drop_duplicates()
rows_after = df_cleaned.shape[0]
print(f'Number of rows after removing duplicates: {rows_after}')
if rows_before == rows_after:
    print('No duplicates were found and removed.')
else:
    print(f'{rows_before - rows_after} duplicates were removed.')

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
Nu

🌟 Exercise 2: Handling Missing Values

Instructions

Identify columns in the Titanic dataset with missing values.
Explore different strategies for handling missing data, such as removal, imputation, and filling with a constant value.
Apply each strategy to different columns based on the nature of the data.
Hint: Review methods like dropna(), fillna(), and SimpleImputer from scikit-learn.

In [26]:
%pip install scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.5.1-cp311-cp311-macosx_12_0_arm64.whl (11.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.0/11.0 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.1 threadpoolctl-3.5.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restar

In [38]:
from sklearn.impute import SimpleImputer
missing_values = titanic_df.isnull().sum()
print("Missing Values:\n", missing_values)
titanic_df['Cabin'] = titanic_df['Cabin'].fillna('Unknown')
imputer = SimpleImputer(strategy='mean')
titanic_df['Age'] = imputer.fit_transform(titanic_df[['Age']])

titanic_df.head()

Missing Values:
 PassengerId     0
Survived        0
Pclass          0
Name            0
Sex             0
Age             0
SibSp           0
Parch           0
Ticket          0
Fare            0
Cabin           0
Embarked        2
FamilySize      0
Title          27
dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S,1,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,0,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S,0,Mr.


In [37]:
titanic_df = titanic_df.drop(columns=['FirstName'])

Exercise 3: Feature Engineering

Instructions

Create new features, such as ‘Family Size’ from ‘SibSp’ and ‘Parch’, and ‘Title’ extracted from the ‘Name’ column.
Convert categorical variables into numerical form using techniques like one-hot encoding or label encoding.
Normalize or standardize numerical features if required.
Hint: Utilize Pandas for data manipulation and scikit-learn’s preprocessing module for encoding.

In [35]:
titanic_df['FamilySize'] = titanic_df['SibSp'] + titanic_df['Parch']
titanic_df['Title'] = titanic_df['Name'].str.extract(r'(\bMr\.|\bMrs\.|\bMiss\.|\bMaster\.)')
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,FirstName,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S,1,Mr.,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs.,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,0,Miss.,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs.,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S,0,Mr.,Mr.
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,Unknown,Q,0,Mr.,Mr.
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0,Mr.,Mr.
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,Unknown,S,4,Master.,Master.
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,Unknown,S,2,Mrs.,Mrs.
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,Unknown,C,1,Mrs.,Mrs.


In [39]:
titanic_df = pd.get_dummies(titanic_df, columns=['Sex', 'Embarked', 'Title'], drop_first=True)
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilySize,Sex_male,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,Unknown,1,True,False,True,False,True,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,False,False,False,False,False,True
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,Unknown,0,False,False,True,True,False,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1,False,False,True,False,False,True
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,Unknown,0,True,False,True,False,True,False
5,6,0,3,"Moran, Mr. James",29.699118,0,0,330877,8.4583,Unknown,0,True,True,False,False,True,False
6,7,0,1,"McCarthy, Mr. Timothy J",54.0,0,0,17463,51.8625,E46,0,True,False,True,False,True,False
7,8,0,3,"Palsson, Master. Gosta Leonard",2.0,3,1,349909,21.075,Unknown,4,True,False,True,False,False,False
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,0,2,347742,11.1333,Unknown,2,False,False,True,False,False,True
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,1,0,237736,30.0708,Unknown,1,False,False,False,False,False,True


In [44]:
Q1 = titanic_df['Age'].quantile(0.25)
Q3 = titanic_df['Age'].quantile(0.75)
IQR = Q3 - Q1
# Calculate lower and upper bounds for the specified column
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(lower_bound)
print(upper_bound)

2.5
54.5


Exercise 4: Outlier Detection And Handling

Use statistical methods to detect outliers in columns like ‘Fare’ and ‘Age’.
Decide on a strategy to handle the identified outliers, such as capping, transformation, or removal.
Implement the chosen strategy and assess its impact on the dataset.
Hint: Explore methods like IQR (Interquartile Range) and Z-score for outlier detection.

In [53]:
import numpy as np
# Detecting Outliers using IQR method
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    outliers = data[((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))]
    return outliers

# Applying the IQR method to 'Fare' and 'Age' columns
outliers_fare_iqr = detect_outliers_iqr(titanic_df['Fare'])
outliers_age_iqr = detect_outliers_iqr(titanic_df['Age'])

print("Outliers in 'Fare' using IQR:\n", outliers_fare_iqr)
print("Outliers in 'Age' using IQR:\n", outliers_age_iqr)

# Z-score
def detect_outliers_zscore(data):
    mean = np.mean(data)
    std = np.std(data)
    z_scores = (data - mean) / std
    outliers = data[(z_scores > 3) | (z_scores < -3)]
    return outliers

outliers_fare_zscore = detect_outliers_zscore(titanic_df['Fare'])
outliers_age_zscore = detect_outliers_zscore(titanic_df['Age'])

print("Outliers in 'Fare' using Z-score:\n", outliers_fare_zscore)
print("Outliers in 'Age' using Z-score:\n", outliers_age_zscore)


Outliers in 'Fare' using IQR:
 1       71.2833
27     263.0000
31     146.5208
34      82.1708
52      76.7292
         ...   
846     69.5500
849     89.1042
856    164.8667
863     69.5500
879     83.1583
Name: Fare, Length: 116, dtype: float64
Outliers in 'Age' using IQR:
 7       2.00
11     58.00
15     55.00
16      2.00
33     66.00
       ...  
827     1.00
829    62.00
831     0.83
851    74.00
879    56.00
Name: Age, Length: 66, dtype: float64
Outliers in 'Fare' using Z-score:
 27     263.0000
88     263.0000
118    247.5208
258    512.3292
299    247.5208
311    262.3750
341    263.0000
377    211.5000
380    227.5250
438    263.0000
527    221.7792
557    227.5250
679    512.3292
689    211.3375
700    227.5250
716    227.5250
730    211.3375
737    512.3292
742    262.3750
779    211.3375
Name: Fare, dtype: float64
Outliers in 'Age' using Z-score:
 96     71.0
116    70.5
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: Age, dtype: float64


In [54]:
# Capping Outliers using IQR method
def cap_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data = np.where(data < lower_bound, lower_bound, data)
    data = np.where(data > upper_bound, upper_bound, data)
    return data

# Applying capping
titanic_df['Fare'] = cap_outliers_iqr(titanic_df['Fare'])
titanic_df['Age'] = cap_outliers_iqr(titanic_df['Age'])
print(titanic_df[['Fare', 'Age']].describe())

             Fare         Age
count  891.000000  891.000000
mean    24.046813   29.376817
std     20.481625   12.062035
min      0.000000    2.500000
25%      7.910400   22.000000
50%     14.454200   29.699118
75%     31.000000   35.000000
max     65.634400   54.500000


 Exercise 5: Data Standardization And Normalization
 
Assess the scale and distribution of numerical columns in the dataset.
Apply standardization to features with a wide range of values.
Normalize data that requires a bounded range, like [0, 1].
Hint: Consider using StandardScaler and MinMaxScaler from scikit-learn’s preprocessing module.

In [56]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
titanic_df['Age_normalized'] = scaler.fit_transform(titanic_df[['Age']])
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilySize,Sex_male,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Age_normalized
0,1,0,3,"Braund, Mr. Owen Harris",22.000000,1,0,A/5 21171,7.2500,Unknown,1,True,False,True,False,True,False,0.375000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.000000,1,0,PC 17599,65.6344,C85,1,False,False,False,False,False,True,0.682692
2,3,1,3,"Heikkinen, Miss. Laina",26.000000,0,0,STON/O2. 3101282,7.9250,Unknown,0,False,False,True,True,False,False,0.451923
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.000000,1,0,113803,53.1000,C123,1,False,False,True,False,False,True,0.625000
4,5,0,3,"Allen, Mr. William Henry",35.000000,0,0,373450,8.0500,Unknown,0,True,False,True,False,True,False,0.625000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.000000,0,0,211536,13.0000,Unknown,0,True,False,True,False,False,False,0.471154
887,888,1,1,"Graham, Miss. Margaret Edith",19.000000,0,0,112053,30.0000,B42,0,False,False,True,True,False,False,0.317308
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",29.699118,1,2,W./C. 6607,23.4500,Unknown,3,False,False,True,True,False,False,0.523060
889,890,1,1,"Behr, Mr. Karl Howell",26.000000,0,0,111369,30.0000,C148,0,True,False,False,False,True,False,0.451923


In [57]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
titanic_df['Fare_standardized'] = scaler.fit_transform(titanic_df[['Fare']])
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilySize,Sex_male,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Age_normalized,Fare_standardized
0,1,0,3,"Braund, Mr. Owen Harris",22.000000,1,0,A/5 21171,7.2500,Unknown,1,True,False,True,False,True,False,0.375000,-0.820552
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.000000,1,0,PC 17599,65.6344,C85,1,False,False,False,False,False,True,0.682692,2.031623
2,3,1,3,"Heikkinen, Miss. Laina",26.000000,0,0,STON/O2. 3101282,7.9250,Unknown,0,False,False,True,True,False,False,0.451923,-0.787578
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.000000,1,0,113803,53.1000,C123,1,False,False,True,False,False,True,0.625000,1.419297
4,5,0,3,"Allen, Mr. William Henry",35.000000,0,0,373450,8.0500,Unknown,0,True,False,True,False,True,False,0.625000,-0.781471
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.000000,0,0,211536,13.0000,Unknown,0,True,False,True,False,False,False,0.471154,-0.539655
887,888,1,1,"Graham, Miss. Margaret Edith",19.000000,0,0,112053,30.0000,B42,0,False,False,True,True,False,False,0.317308,0.290823
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",29.699118,1,2,W./C. 6607,23.4500,Unknown,3,False,False,True,True,False,False,0.523060,-0.029155
889,890,1,1,"Behr, Mr. Karl Howell",26.000000,0,0,111369,30.0000,C148,0,True,False,False,False,True,False,0.451923,0.290823


Exercise 6: Feature Encoding

Identify categorical columns in the Titanic dataset, such as ‘Sex’ and ‘Embarked’.
Use one-hot encoding for nominal variables and label encoding for ordinal variables.
Integrate the encoded features back into the main dataset.
Hint: Utilize pandas.get_dummies() for one-hot encoding and LabelEncoder from scikit-learn for label encoding.

In [61]:
from sklearn.preprocessing import LabelEncoder
"""Identify categorical columns in the Titanic dataset, such as ‘Sex’ and ‘Embarked’.
Use one-hot encoding for nominal variables and label encoding for ordinal variables.
Done in ex"""
label_encoder = LabelEncoder()
titanic_df['Pclass'] = label_encoder.fit_transform(titanic_df['Pclass'])
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilySize,Sex_male,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Age_normalized,Fare_standardized
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,Unknown,1,True,False,True,False,True,False,0.375,-0.820552
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,65.6344,C85,1,False,False,False,False,False,True,0.682692,2.031623
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,Unknown,0,False,False,True,True,False,False,0.451923,-0.787578
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1,False,False,True,False,False,True,0.625,1.419297
4,5,0,2,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,Unknown,0,True,False,True,False,True,False,0.625,-0.781471


Exercise 7: Data Transformation For Age Feature

Instructions

Create age groups (bins) from the ‘Age’ column to categorize passengers into different age categories.
Apply one-hot encoding to the age groups to convert them into binary features.
Hint: Use pd.cut() for binning the ‘Age’ column and pd.get_dummies() for one-hot encoding.

In [63]:
titanic_df['AgeGroup'] = pd.cut(titanic_df['Age'], bins=[0, 18, 35, 50, 100], labels=['0-18', '18-35', '35-50', '50+'], right=False)
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilySize,Sex_male,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Age_normalized,Fare_standardized,AgeGroup
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,Unknown,1,True,False,True,False,True,False,0.375,-0.820552,18-35
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,65.6344,C85,1,False,False,False,False,False,True,0.682692,2.031623,35-50
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,Unknown,0,False,False,True,True,False,False,0.451923,-0.787578,18-35
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1,False,False,True,False,False,True,0.625,1.419297,35-50
4,5,0,2,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,Unknown,0,True,False,True,False,True,False,0.625,-0.781471,35-50


In [64]:
# Apply one-hot encoding to 'AgeGroup'
titanic_df = pd.get_dummies(titanic_df, columns=['AgeGroup'], drop_first=True)
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Age_normalized,Fare_standardized,AgeGroup_18-35,AgeGroup_35-50,AgeGroup_50+
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,Unknown,...,False,True,False,True,False,0.375,-0.820552,True,False,False
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,65.6344,C85,...,False,False,False,False,True,0.682692,2.031623,False,True,False
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,Unknown,...,False,True,True,False,False,0.451923,-0.787578,True,False,False
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,...,False,True,False,False,True,0.625,1.419297,False,True,False
4,5,0,2,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,Unknown,...,False,True,False,True,False,0.625,-0.781471,False,True,False
