# Task 5: Data Cleaning Using Pandas

## Internship: Elevate Labs

### Objective
The objective of this task is to perform basic data cleaning operations using Python
and the Pandas library. This includes loading the dataset, understanding its structure,
handling missing values, checking duplicates, and preparing the data for further analysis.

### Dataset Used
Medical Insurance Cost Dataset (`insurance.csv`)


## Importing Required Libraries

In this step, we import the necessary Python libraries required for data analysis
and cleaning. Pandas is used for data manipulation and analysis, while NumPy is
used for numerical operations.


In [1]:
import pandas as pd
import numpy as np

## Loading the Dataset

In this step, the Medical Insurance dataset is loaded into a Pandas DataFrame.
This allows us to access, manipulate, and analyze the data efficiently using
Pandas functions.


In [2]:
df = pd.read_csv("insurance.csv")

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Understanding the Dataset

Before performing data cleaning, it is important to understand the structure of the
dataset. This includes checking the number of rows and columns, data types of each
column, and identifying any missing values present in the dataset.


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


## Checking for Missing Values

Missing values can affect the accuracy of data analysis. In this step, we check
for any null or missing values present in each column of the dataset using
Pandas functions.


In [6]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
bmi,0
children,0
smoker,0
region,0
charges,0


## Handling Missing Values

After checking the dataset, it was observed that there are no missing values in
any of the columns. Therefore, no imputation or removal of data was required at
this stage. The dataset is already clean with respect to missing values.


In [7]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
bmi,0
children,0
smoker,0
region,0
charges,0


## Checking for Duplicate Records

Duplicate records can lead to biased or incorrect analysis. In this step, we
check whether the dataset contains any duplicate rows and remove them if found.


In [8]:
df.duplicated().sum()

np.int64(1)

In [9]:
df.drop_duplicates(inplace=True)

In [10]:
df.duplicated().sum()

np.int64(0)

## Data Type Verification

Verifying and correcting data types is important to ensure accurate analysis
and efficient memory usage. In this step, we check the data types of each
column in the dataset.


In [11]:
df.dtypes

Unnamed: 0,0
age,int64
sex,object
bmi,float64
children,int64
smoker,object
region,object
charges,float64


## Data Type Conversion

Categorical columns are converted to the 'category' data type to improve memory
efficiency and make the dataset suitable for analysis and modeling.


In [12]:
categorical_cols = ['sex', 'smoker', 'region']
df[categorical_cols] = df[categorical_cols].astype('category')

In [13]:
df.dtypes

Unnamed: 0,0
age,int64
sex,category
bmi,float64
children,int64
smoker,category
region,category
charges,float64


## Feature Understanding

Understanding each feature helps in interpreting the dataset correctly. The
Medical Insurance dataset contains demographic and lifestyle-related attributes
that influence insurance charges.


In [14]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

## Feature Engineering

In this step, a new feature is created to improve understanding of the dataset.
The BMI category is derived to classify individuals based on their BMI values.


In [15]:
def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['bmi_category'] = df['bmi'].apply(bmi_category)

In [16]:
df[['bmi', 'bmi_category']].head()

Unnamed: 0,bmi,bmi_category
0,27.9,Overweight
1,33.77,Obese
2,33.0,Obese
3,22.705,Normal
4,28.88,Overweight


## Saving the Cleaned Dataset

After completing all data cleaning and preprocessing steps, the cleaned dataset
is saved as a new CSV file. This file can be used for further analysis or
machine learning tasks.


In [17]:
df.to_csv("cleaned_insurance.csv", index=False)

## Conclusion

In this task, the Medical Insurance dataset was successfully cleaned using
Python and Pandas. The process included understanding the dataset structure,
checking and handling missing values, removing duplicates, verifying data types,
performing basic feature engineering, and exporting the cleaned data. This task
helped strengthen practical skills in data preprocessing and data handling.
