#Pandas: Cleaning Data

Cleaning data is a crucial step before analyzing a dataset. Pandas provides many tools to handle missing, incorrect, or duplicate data.

##1. Cleaning Empty Cells

Empty cells can affect calculations and analysis. You can remove them or fill them.

###1.1 Remove Rows with Empty Cells

In [None]:
import pandas as pd

df = pd.read_csv('data.csv')

# Remove all rows with any empty cells
df_cleaned = df.dropna()

print(df_cleaned)


###1.2 Fill Empty Cells

You can replace empty cells with a default value or a calculated value:

In [None]:
# Fill empty cells with 0
df_filled = df.fillna(0)

# Fill empty cells with the column's mean
df_filled_mean = df.fillna(df['Calories'].mean())

print(df_filled_mean)


##2. Cleaning Data of Wrong Format

Sometimes, data is stored in the wrong type, like numbers as strings. You can convert the type:

In [None]:
# Convert 'Duration' column to integer
df['Duration'] = df['Duration'].astype(int)

# Convert 'Calories' column to float
df['Calories'] = df['Calories'].astype(float)


##3. Fixing Wrong Data

Data may contain incorrect values, such as negative numbers for a column that should only have positives.

In [None]:
# Replace negative values in 'Calories' with 0
df['Calories'] = df['Calories'].apply(lambda x: 0 if x < 0 else x)

# Replace values greater than a threshold
df['Pulse'] = df['Pulse'].apply(lambda x: 100 if x > 100 else x)


##4. Removing Duplicates

Duplicate rows can skew analysis. Use drop_duplicates() to remove them:

In [None]:
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Remove duplicates based on a specific column
df_no_duplicates = df.drop_duplicates(subset=['Calories'])

print(df_no_duplicates)
