<a href="https://colab.research.google.com/github/jemie-tech/data-covid/blob/main/work.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cleaning techniques**

In [None]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)


**# Drop rows with any missing values**

This method drops any row containing at least one missing value. It's a straightforward approach, but it might lead to a significant loss of data if there are many missing values.

In [None]:
# Drop rows with any missing values
df_dropped = df.dropna()


2. **Filling Missing Values:**

**Forward Fill (ffill)**: This method fills missing values with the preceding non-missing value in the column. It is useful when missing values are expected to have the same value as the previous entry.

**Backward Fill (bfill)**: This method fills missing values with the next non-missing value in the column. It is useful when missing values are expected to have the same value as the following entry.


**Mean Fill**: Filling missing values with the mean of the column. It is suitable for continuous numerical data.

**Median Fill:** Filling missing values with the median of the column. It is robust to outliers and is also suitable for numerical data.

**Mode Fill:** Filling missing values with the mode (most frequent value) of the column. It is suitable for categorical or discrete data.

In [None]:
# Forward fill missing values
df_forward_filled = df.ffill()

# Backward fill missing values
df_backward_filled = df.bfill()

# Fill missing values with a specific value (e.g., mean, median, mode)
df_mean_filled = df.fillna(df.mean())
df_median_filled = df.fillna(df.median())
df_mode_filled = df.fillna(df.mode().iloc[0])


**3. Interpolation:**

This method estimates missing values based on the values of other data points. In the case of time-series data, for example, it can fill in missing values by considering the trend between existing data points.

In [None]:
# Linear interpolation for missing values
df_interpolated = df.interpolate()


**4. Imputation with Scikit-Learn:**



**SimpleImputer:** This is a class from Scikit-Learn that provides various strategies for imputing missing values.

**Mean Imputation:** Filling missing values with the mean of the column.

**Median Imputation:** Filling missing values with the median of the column.

**Constant Imputation:** Filling missing values with a constant value (in this case, 0).

In [None]:
from sklearn.impute import SimpleImputer

# Impute missing values with mean, median, or a constant
imputer_mean = SimpleImputer(strategy='mean')
df_imputed_mean = pd.DataFrame(imputer_mean.fit_transform(df), columns=df.columns)

imputer_median = SimpleImputer(strategy='median')
df_imputed_median = pd.DataFrame(imputer_median.fit_transform(df), columns=df.columns)

imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
df_imputed_constant = pd.DataFrame(imputer_constant.fit_transform(df), columns=df.columns)


**5. Removing Duplicates:**

This method removes duplicate rows from the DataFrame, keeping only the first occurrence. It is useful when dealing with datasets that may have duplicate entries. Be cautious when using this, as it can lead to a loss of potentially valuable data.

In [None]:
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
