<a href="https://colab.research.google.com/github/mk7890/Python_Data_Analysis/blob/main/DataCleaning_MCAR(missing_completely_at_random).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Missing Completely at Random (MCAR) in Pandas
- Data is considered Missing Completely At Random (MCAR) when the probability of a data point being missing is unrelated to any observed or unobserved data. In other words, the missingness is entirely random and does not follow any systematic pattern.

## When Does MCAR Happen?

- Accidental skips in survey responses.
- Data entry errors where random values are left blank.
- Technical issues where random entries fail to be recorded.

## Example of MCAR
Imagine a survey where respondents are asked to fill in their height and weight. If some people accidentally skip the height question purely due to oversight, this would result in missing data that's MCAR. The reason for the missing values is random and doesn’t depend on any other information in the dataset (like weight or age).




## How to Detect MCAR in Pandas

We can analyze patterns in the missing data. Below is an example of a DataFrame with random missing values.


In [None]:
import pandas as pd
import numpy as np

# Sample data with random missing values
data = {
    'height': [5.9, np.nan, 5.5, 6.1, np.nan],
    'weight': [72, 80, np.nan, 90, 85],
    'age': [25, 30, 22, np.nan, 28]
}
df = pd.DataFrame(data)
print(df)

# Check the pattern of missing data
missing_data = df.isnull().sum()
missing_data


   height  weight   age
0     5.9    72.0  25.0
1     NaN    80.0  30.0
2     5.5     NaN  22.0
3     6.1    90.0   NaN
4     NaN    85.0  28.0


Unnamed: 0,0
height,2
weight,1
age,1


## How to Handle MCAR Data in Pandas

### Option 1: Drop Missing Values

If the missing data is MCAR and doesn’t constitute a large portion of your dataset, you can drop rows or columns with missing values.

### Option 2: Fill Missing Values

For continuous data, fill missing values with measures like the mean or median of the column.

### Option 3: Interpolation

Interpolation involves creating a function or model that fits a set of known data points and then using that function to estimate values for points that lie within the range of the known data.


In [None]:
# Option 1: Drop rows with any missing values
df_dropped = df.dropna()

# Option 2: Fill missing values with the mean of each column
df_filled = df.fillna(df.mean())

# Option 3: Use linear interpolation to fill missing values
df_interpolated = df.interpolate()

# Display results
print("Dropping data:\n", df_dropped, "\n")
print("Filling data:\n", df_filled, "\n")
print("Interpolating:\n", df_interpolated)


Dropping data:
    height  weight   age
0     5.9    72.0  25.0 

Filling data:
      height  weight    age
0  5.900000   72.00  25.00
1  5.833333   80.00  30.00
2  5.500000   81.75  22.00
3  6.100000   90.00  26.25
4  5.833333   85.00  28.00 

Interpolating:
    height  weight   age
0     5.9    72.0  25.0
1     5.7    80.0  30.0
2     5.5    85.0  22.0
3     6.1    90.0  25.0
4     6.1    85.0  28.0
