# Data Analytic Internship Level 1
## Project Title: Data Transformation and Standardization 


Dataset : Titanic Dataset  
Source : Kaggle

---

**Task2-Data Transformation & Standardization**  

*• Load a dataset and perform feature scaling (normalization/standardization).*  

*• Convert categorical variables into numerical representations (e.g., one-hot encoding, label encoding).*  

*• Ensure the transformed data is ready for analysis or modelling.*

**_________________________________________________________________________________________________________________________________________________________________________________________________**

In [3]:
#importing necessary libraries for feature scaling (Normalisation ,standerdization) and encoding

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler


In [4]:
#Loading dataset
data = pd.read_csv(r'E:\analytics files\Certify Internship\Level 1\Titanic Dataset.csv')

In [5]:
#Overview of dataset
data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


In [6]:
#lets take a basic info of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


## Data Cleaning

Basic data cleaning was performed to remove duplicates and handle missing values
before applying data transformation techniques.


In [7]:
#check for the missing values
data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [8]:
#Handling missing values

#Filling missing values in 'age' column with median age
data['age'].fillna(data['age'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['age'].fillna(data['age'].median(),inplace=True)


In [9]:
#Filling missing values in 'embarked' column with mode
data['embarked'].fillna(data['embarked'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['embarked'].fillna(data['embarked'].mode()[0],inplace=True)


In [10]:
#Filing missing value in 'Fare' column with mean fare
data['fare'].fillna(data['fare'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['fare'].fillna(data['fare'].mean(),inplace=True)


In [11]:
#Removing 'cabin' column as it has too many missing values
data.drop('cabin',axis=1,inplace=True)

In [12]:
#Filling missing values in 'Boat' column with 'None' as it indicates no boat assigned
data['boat'].fillna('None',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['boat'].fillna('None',inplace=True)


In [13]:
#Filling missing values in 'Body' column with 0 as it indicates no body found
data['body'].fillna(0,inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['body'].fillna(0,inplace=True)


In [14]:
#Filling missing values in 'home.dest' column with 'Unknown' as it indicates unknown destination
data['home.dest'].fillna('Unknown',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['home.dest'].fillna('Unknown',inplace=True)


In [15]:
#again check for missing values
data.isnull().sum()

pclass       0
survived     0
name         0
sex          0
age          0
sibsp        0
parch        0
ticket       0
fare         0
embarked     0
boat         0
body         0
home.dest    0
dtype: int64

In [16]:
#now check for duplicates

data.duplicated().sum() 
#no duplicates found
#So our data is clean now with no missing values and no duplicates


np.int64(0)

---

---

## Data Transformation

In [17]:
#check for the categorical columns

categorical_cols = data.select_dtypes(include=['object']).columns
categorical_cols

Index(['name', 'sex', 'ticket', 'embarked', 'boat', 'home.dest'], dtype='object')

In [18]:
#dropping non useful columns
#  name : Unique - no analytical value 
#  ticket : random - no pattern
#  home.dest : too many unique values
#  boat : lifeboat number - not useful for analysis

data.drop(['name','ticket','home.dest','boat'],axis=1,inplace=True)

In [19]:
#lets check the final dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   sex       1309 non-null   object 
 3   age       1309 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   fare      1309 non-null   float64
 7   embarked  1309 non-null   object 
 8   body      1309 non-null   float64
dtypes: float64(3), int64(4), object(2)
memory usage: 92.2+ KB


In [20]:
#label encoding : convert binary categorical columns into numerical form
#for sex column
le = LabelEncoder()

data.sex = le.fit_transform(data.sex)  #male : 1 , female : 0



In [21]:
data.sex.unique()

array([0, 1])

In [22]:
data.embarked.unique()

array(['S', 'C', 'Q'], dtype=object)

In [23]:
#One hot encoding : convert multi categorical columns into numerical form
#for embarked column

ohe = OneHotEncoder(drop='first',sparse_output=False)                         #drop first to avoid dummy variable trap , sparse_output = False to get dense array
embarked_encoded = ohe.fit_transform(data[['embarked']])

embarked_encoded

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 0.],
       [0., 0.],
       [0., 1.]])

In [24]:
#this return in array format , we need to convert it into dataframe and add column names
embarked_df = pd.DataFrame(embarked_encoded, columns=ohe.get_feature_names_out(['embarked']))

In [25]:
embarked_df

Unnamed: 0,embarked_Q,embarked_S
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0
...,...,...
1304,0.0,0.0
1305,0.0,0.0
1306,0.0,0.0
1307,0.0,0.0


In [26]:
#now merge this dataframe with original data and drop 'embarked' column
data = pd.concat([data, embarked_df], axis=1)
data.drop('embarked', axis=1, inplace=True)



In [27]:
#look at data after encoding
data

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,body,embarked_Q,embarked_S
0,1,1,0,29.00,0,0,211.3375,0.0,0.0,1.0
1,1,1,1,0.92,1,2,151.5500,0.0,0.0,1.0
2,1,0,0,2.00,1,2,151.5500,0.0,0.0,1.0
3,1,0,1,30.00,1,2,151.5500,135.0,0.0,1.0
4,1,0,0,25.00,1,2,151.5500,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
1304,3,0,0,14.50,1,0,14.4542,328.0,0.0,0.0
1305,3,0,0,28.00,1,0,14.4542,0.0,0.0,0.0
1306,3,0,1,26.50,0,0,7.2250,304.0,0.0,0.0
1307,3,0,1,27.00,0,0,7.2250,0.0,0.0,0.0


In [28]:
#Now feature scaling the numerical columns (Normalization/Standardization)

data

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,body,embarked_Q,embarked_S
0,1,1,0,29.00,0,0,211.3375,0.0,0.0,1.0
1,1,1,1,0.92,1,2,151.5500,0.0,0.0,1.0
2,1,0,0,2.00,1,2,151.5500,0.0,0.0,1.0
3,1,0,1,30.00,1,2,151.5500,135.0,0.0,1.0
4,1,0,0,25.00,1,2,151.5500,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
1304,3,0,0,14.50,1,0,14.4542,328.0,0.0,0.0
1305,3,0,0,28.00,1,0,14.4542,0.0,0.0,0.0
1306,3,0,1,26.50,0,0,7.2250,304.0,0.0,0.0
1307,3,0,1,27.00,0,0,7.2250,0.0,0.0,0.0


In [29]:
#Normalization : use when we want to bound the values within a specific range (0 to 1) (MinMaxScaler)

#scaler = MinMaxScaler()
#data[['age','fare']] = scaler.fit_transform(data[['age','fare']])

#Standardization : use when we want to center the values around mean 0 with standard deviation 1 (StandardScaler)

scaler = StandardScaler()
data[['age','fare']] = scaler.fit_transform(data[['age','fare']])


In [33]:
print("\n\t\tPreview of the Data after Transformation")
data


		Preview of the Data after Transformation


Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,body,embarked_Q,embarked_S
0,1,1,0,-0.039006,0,0,3.442480,0.0,0.0,1.0
1,1,1,1,-2.215698,1,2,2.286476,0.0,0.0,1.0
2,1,0,0,-2.131979,1,2,2.286476,0.0,0.0,1.0
3,1,0,1,0.038512,1,2,2.286476,135.0,0.0,1.0
4,1,0,0,-0.349076,1,2,2.286476,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
1304,3,0,0,-1.163010,1,0,-0.364300,328.0,0.0,0.0
1305,3,0,0,-0.116523,1,0,-0.364300,0.0,0.0,0.0
1306,3,0,1,-0.232799,0,0,-0.504078,304.0,0.0,0.0
1307,3,0,1,-0.194041,0,0,-0.504078,0.0,0.0,0.0


In [34]:
print("\n\t\t\tSummary Statistics of the Transformed dataset")
data.describe()


			Summary Statistics of the Transformed dataset


Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,body,embarked_Q,embarked_S
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0
mean,2.294882,0.381971,0.644003,6.513761000000001e-17,0.498854,0.385027,4.3425070000000004e-17,14.864782,0.093965,0.699771
std,0.837836,0.486055,0.478997,1.000382,1.041658,0.86556,1.000382,55.197471,0.291891,0.458533
min,1.0,0.0,0.0,-2.273836,0.0,0.0,-0.6437751,0.0,0.0,0.0
25%,2.0,0.0,0.0,-0.5816283,0.0,0.0,-0.4911082,0.0,0.0,0.0
50%,3.0,0.0,1.0,-0.1165232,0.0,0.0,-0.3643001,0.0,0.0,1.0
75%,3.0,1.0,1.0,0.4260994,1.0,0.0,-0.0390664,0.0,0.0,1.0
max,3.0,1.0,1.0,3.914388,8.0,9.0,9.262219,328.0,1.0,1.0


In [32]:
#ensuring final dataset is ready for modeling
#no missing values , categorical columns encoded and numerical columns scaled
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      1309 non-null   int64  
 1   survived    1309 non-null   int64  
 2   sex         1309 non-null   int64  
 3   age         1309 non-null   float64
 4   sibsp       1309 non-null   int64  
 5   parch       1309 non-null   int64  
 6   fare        1309 non-null   float64
 7   body        1309 non-null   float64
 8   embarked_Q  1309 non-null   float64
 9   embarked_S  1309 non-null   float64
dtypes: float64(5), int64(5)
memory usage: 102.4 KB


## Insights

• Label encoding and one-hot encoding successfully converted categorical data into numerical format.  
• Feature scaling transformed Age and Fare into standardized values.   
• Feature scaling helped standardize numerical values for better analysis.    
• Data transformation made the dataset ready for modeling and further analysis.  
