# 1. Remove Values:
## Complete Case Analysis (CCA) :
- Also called "**list-wise deletion**"

- Discards any row or column, where missing values are there.

- **CCA means:** Analyzing only those rows which have information in all columns.

## Assumptions for Using CCA:
1. **Missing Completely At Random (MCAR):**
   - The likelihood of a value being missing is completely unrelated to any **observed** or **unobserved** data.

   - Example: A survey question gets skipped randomly due to a printing error.

2. **Sufficient Sample Size After Deletion:**
   - Enough data should remain after **dropping incomplete rows** to maintain statistical power.


## When to Use CCA:

1. **MCAR is reasonable:**
   - If you have evidence or strong belief that missingness is purely random.

2. **Low Proportion of Missing Data:**
    - **If less than ~5% of your data is missing**, CCA usually won’t cause bias.

    - But still **make sure losing data doesn't compromise model's accuracy or power**.

## Advantages/Disadvantages : 

**Advantages of CCA:**
- **Simple to implement** – No complex modeling or imputation needed.

- **Maintains integrity** (as only real data is used, not using the imputed one)

- **Preserves the variable distribution** ( If data is MCAR, then distribution of variables of reduced dataset should match the distribution in original dataset )

**Disadvantages of CCA:**
- **Data loss :** Can significantly reduce sample size.

- **Bias results:**  If data is not MCAR, results will be biased.

- **Reduced statistical power :** 
  - Less Data = lower ability to detect true effects.   

  - As you have fewer samples left. So it's harder for **statistical tests** to **confidently** find real **patterns**. 

- **Inefficient :** Wastes partial information from incomplete rows.

- **Not robust to MAR/NMAR :** Fails if missingness is related to observed/unobserved values.



![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

In [2]:
# Let's work on a example: 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



In [3]:
df = pd.read_csv('../datasets/data_science_job.csv')
df.sample(5)


Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,training_hours,target
8100,7509,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,20.0,10000+,Pvt Ltd,36.0,0.0
13388,5972,city_97,0.925,Male,Has relevent experience,no_enrollment,Masters,STEM,16.0,1000-4999,Pvt Ltd,8.0,0.0
3068,31140,city_16,0.91,Male,No relevent experience,no_enrollment,Primary School,,0.0,,,31.0,0.0
1867,14572,city_21,0.624,Male,Has relevent experience,no_enrollment,Masters,STEM,8.0,50-99,Public Sector,33.0,1.0
12971,22076,city_114,0.926,Male,Has relevent experience,no_enrollment,High School,,5.0,500-999,Pvt Ltd,10.0,0.0


In [4]:
df.shape

(19158, 13)

In [5]:
df.isnull().sum()

enrollee_id                  0
city                         0
city_development_index     479
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
training_hours             766
target                       0
dtype: int64

In [6]:
df.isnull().mean() * 100   # Percentage missing values in each column

enrollee_id                0.000000
city                       0.000000
city_development_index     2.500261
gender                    23.530640
relevent_experience        0.000000
enrolled_university        2.014824
education_level            2.401086
major_discipline          14.683161
experience                 0.339284
company_size              30.994885
company_type              32.049274
training_hours             3.998330
target                     0.000000
dtype: float64

In [7]:
df.dropna().shape  # shape of dataframe after I remove the missing values

(8434, 13)

- Since, after removal of rows with missing values, the shape of the dataframe is drastically reduced. 

- Hence, applying **CCA(complete case analysis)** on each column won't be feasible.

- so, We'll apply CCA on *columns whose missing values are < 5%*

In [8]:
cols = [col for col in df.columns if df[col].isnull().mean() < 0.05 and df[col].isnull().mean() > 0]
cols

['city_development_index',
 'enrolled_university',
 'education_level',
 'experience',
 'training_hours']

In [9]:
df[cols].dropna().shape  # After removing above columns, which contains missing percentage < 5%,  the shape is not significantly reduced.

(17182, 5)