# Data Cleaning using Python with Pandas Library

Data cleaning and organizing constitute 57% of the total weight when it comes to the part of the data science

The entire data cleaning process is divided into sub-tasks as shown below.

1. Importing the required libraries.
2. Getting the data-set from a different source (Kaggle) and displaying the dataset.
3. Removing the unused or irrelevant columns.
4. Renaming the column names as per our convenience.
5. Replacing the value of the rows and make it more meaningful

## Step 1: Importing the required libraries.

This step involves just importing the required libraries which are [pandas](https://pandas.pydata.org/), [numpy](https://www.numpy.org/), and [CSV](https://docs.python.org/3/library/csv.html) . These are the necessary libraries when it comes to data science.

In [9]:
import pandas as pd
import numpy as np
import csv

## Step 2: Getting the data-set from a different source and displaying the data-set.

This step involves getting the data-set from a different source, and the link for the data-set is provided below.

[data-set](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)

In [10]:
df = pd.read_csv('heart.csv')
df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


## Step 3: Removing the unused or irrelevant columns
This step involves removing irrelevant columns such as cp, fbs, thalach, and many more, and the code is pretty much self-explanatory.

In [11]:
to_drop = ['cp', 'fbs', 'restecg', 
           'thalach', 'exang', 'oldpeak',
           'slope', 'thal', 'target', 'ca']

df.drop(to_drop, inplace = True, axis = 1)
df.head(5)

Unnamed: 0,age,sex,trestbps,chol
0,52,1,125,212
1,53,1,140,203
2,70,1,145,174
3,61,1,148,203
4,62,0,138,294


## Step 4: Renaming the column names as per our convenience.

This step involves renaming the column names because many column names are kinda confusing and hard to understand.

In [13]:
new_name = {'age': 'Age', 'sex': 'Sex', 'trestbps': 'Bps', 'chol': 'Cholesterol'}
df.rename(columns = new_name, inplace = True)
df.head()

Unnamed: 0,Age,Sex,Bps,Cholesterol
0,52,1,125,212
1,53,1,140,203
2,70,1,145,174
3,61,1,148,203
4,62,0,138,294


## Step 5: Replacing the value of the rows if necessary.
This step involves replacing the incomplete values or making the values more readable, such as in here the Sex field consists of the values 1 and 0 being 1 as Male and 0 as Female, but it often seems ambiguous for the third person, so changing the value to an understandable one is a good idea.

In [14]:
replace_values = {0: 'F', 1: 'M'}
df = df.replace({"Sex": replace_values})
df.head()

Unnamed: 0,Age,Sex,Bps,Cholesterol
0,52,M,125,212
1,53,M,140,203
2,70,M,145,174
3,61,M,148,203
4,62,F,138,294
