### Team Member Details
- **Group Name**: Data Explorer
- **Members**:
  - Name: Mohammad Tohin Bapari, Email:tohin@gmx.de, Country:Germany, University:Bergische Universität Wuppertal, Specialization: Data Science
  

### Problem Description
The dataset provided contains various attributes related to patients and their persistency flag, which indicates whether a patient was persistent or non-persistent in their treatment. The goal is to clean and transform this dataset to handle missing values and outliers effectively.


### GitHub Repo Link
https://github.com/iamtohin/Data-Analytst-Internship-at-Data-Glacier/tree/main/Week%209


### Data Cleansing and Transformation

#### Step 1: Handling Missing Values
We will use two techniques to handle missing values: 
1. Imputation using the mean/median/mode.
2. Model-based imputation.

#### Step 2: Handling Outliers
We will use the following techniques to handle outliers:
1. Interquartile Range (IQR) method.
2. Z-score method.

### Code Implementation

#### Handling Missing Values

```python
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the dataset
file_path = 'Healthcare_dataset.xlsx'
data = pd.read_excel(file_path, sheet_name='Dataset')

# Identify missing values
missing_values = data.isnull().sum()

# Handle missing values using mean/median/mode imputation
imputer_mean = SimpleImputer(strategy='mean')
imputer_mode = SimpleImputer(strategy='most_frequent')

# Numerical columns
num_cols = data.select_dtypes(include=['float64', 'int64']).columns
data[num_cols] = imputer_mean.fit_transform(data[num_cols])

# Categorical columns
cat_cols = data.select_dtypes(include=['object']).columns
data[cat_cols] = imputer_mode.fit_transform(data[cat_cols])

# Verify if all missing values are handled
missing_values_after_imputation = data.isnull().sum()
print(missing_values_after_imputation)
```

#### Handling Outliers

```python
import numpy as np

# Handling outliers using IQR method
Q1 = data[num_cols].quantile(0.25)
Q3 = data[num_cols].quantile(0.75)
IQR = Q3 - Q1

# Removing outliers
data = data[~((data[num_cols] < (Q1 - 1.5 * IQR)) |(data[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

# Handling outliers using Z-score method
from scipy import stats

z_scores = np.abs(stats.zscore(data[num_cols]))
data = data[(z_scores < 3).all(axis=1)]

# Verify the data after handling outliers
print(data.shape)
```
