## **Import Libraries and Data**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **Task: Preprocess data for further analysis.**
1. `Data Cleaning`
   1. Trimming and Standardizing Column Names
   2. Correcting Incosistent Data Entries
   3. Converting data types
2. `Anomaly Detection`
3. `Imputation`
   1. Replace the anomaly rows instead of deleting it.
   2. Replace the anomalies with respective correct numeric element.

In [2]:
# Read data

data = pd.read_excel("Hospital Data with Mixed Numbers and Characters.xlsx")
display(data)

Unnamed: 0,Description,Basic Unit,Beginning Balance,Quantity Received,Quantity Dispensed,Losses and Adjustments,Ending Balance (Physical Count)
0,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
1,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
2,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
3,Cap Ampicillin,100,350,0,0,0,0
4,Cap Ampicillin,100,350,0,0,0,0
...,...,...,...,...,...,...,...
5841,Inj.Vitamin B Complex,Vial,145,30,4350,0,30
5842,Insulin (Actrapid),Vial,0,0,0,0,0
5843,Insulin (Insulutard),Vial,0,2,0,0,2
5844,Inj Amoxicillin 500mg,vials,54,100,79,0,75


**Create versions of the data, original and working copy.**

In [3]:
# Backup of data original data (Without any changes) if in case it is required.

data_original = data.copy()
data_original.head()

Unnamed: 0,Description,Basic Unit,Beginning Balance,Quantity Received,Quantity Dispensed,Losses and Adjustments,Ending Balance (Physical Count)
0,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
1,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
2,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
3,Cap Ampicillin,100,350,0,0,0,0
4,Cap Ampicillin,100,350,0,0,0,0


In [4]:
# Understand Metadata

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5846 entries, 0 to 5845
Data columns (total 7 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Description                      5846 non-null   object
 1   Basic Unit                       5846 non-null   object
 2   Beginning Balance                5846 non-null   object
 3   Quantity Received                5846 non-null   object
 4   Quantity Dispensed               5846 non-null   object
 5   Losses and Adjustments           5846 non-null   object
 6   Ending Balance (Physical Count)  5846 non-null   object
dtypes: object(7)
memory usage: 319.8+ KB


**Note:** `Beginning Balance`, `Quantity Received`, `Quantity Dispensed`, `Losses and Adjustments`, `Ending Balance (Physical Count)`should be numeric columns. 

In [5]:
# Check columns

data.columns

Index(['Description', 'Basic Unit', 'Beginning Balance ', 'Quantity Received ',
       'Quantity Dispensed ', 'Losses and Adjustments',
       'Ending Balance (Physical Count)'],
      dtype='object')

**Note:** There is unnecessary space in column headers.

In [6]:
# Strip leading and trailing spaces in column headers

data.columns = data.columns.str.strip()
data.columns

Index(['Description', 'Basic Unit', 'Beginning Balance', 'Quantity Received',
       'Quantity Dispensed', 'Losses and Adjustments',
       'Ending Balance (Physical Count)'],
      dtype='object')

**Check for nulls and duplicated**

In [7]:
# Nulls check

data.isna().sum()

Description                        0
Basic Unit                         0
Beginning Balance                  0
Quantity Received                  0
Quantity Dispensed                 0
Losses and Adjustments             0
Ending Balance (Physical Count)    0
dtype: int64

In [8]:
# Duplicated check

data.duplicated().sum()

3704

**Note:** 
- Data does have duplicated rows, but it appears that it is nature of the data.
- No nulls.

**Change dtype of columns to numeric.**

In [9]:
# Change datatypes:

data['Beginning Balance'] = pd.to_numeric(data['Beginning Balance'], errors='coerce')
data['Ending Balance (Physical Count)'] = pd.to_numeric(data['Ending Balance (Physical Count)'], errors='coerce')
data['Losses and Adjustments'] = pd.to_numeric(data['Losses and Adjustments'], errors='coerce')
data['Quantity Dispensed'] = pd.to_numeric(data['Quantity Dispensed'], errors='coerce')
data['Quantity Received'] = pd.to_numeric(data['Quantity Received'], errors='coerce')

In [10]:
# Validation

data.isna().sum()

Description                        0
Basic Unit                         0
Beginning Balance                  2
Quantity Received                  7
Quantity Dispensed                 4
Losses and Adjustments             1
Ending Balance (Physical Count)    4
dtype: int64

**Note:**
- Upon converting columns to numeric, a few values that had non-numeric data type were converted to NaN.
- **Reason:** Those values were non numeric.
- As the data is valued we need to change them mannually. 
- We will input a replacement value.

In [11]:
# Separate the rows with anomalies

null_rows= data[data[['Beginning Balance', 'Quantity Received', 
                      'Quantity Dispensed', 'Losses and Adjustments', 
                      'Ending Balance (Physical Count)']].isna().any(axis= 1)]
null_rows

Unnamed: 0,Description,Basic Unit,Beginning Balance,Quantity Received,Quantity Dispensed,Losses and Adjustments,Ending Balance (Physical Count)
9,Surgical gloves,100,2.0,0.0,,0.0,
12,Tab Quinine,100,,0.0,0.0,0.0,0.0
490,Tabs Ibuprofen 200mg,1000's,0.0,,0.0,0.0,1.0
491,Tabs Ibuprofen 200mg,1000's,0.0,,0.0,0.0,
605,Tabs Paracetamol 500mg,1000's,0.0,,0.0,0.0,5.0
835,Caps Amoxicillin 500mg,100's,0.0,,0.0,0.0,2000.0
1084,inj Chlorpheniramine 10mg/amp,100's,0.0,,0.0,0.0,0.5
1781,Tab. Loratidine,100's,2000.0,0.0,,0.0,0.0
2508,Tab. Metformin + Glimepride,30's,1300.0,0.0,0.0,0.0,
2950,Inj Aminophyline,50's,0.0,50.0,0.0,,50.0


**Preliminary Plan:**
- Filter the row with null values from `null_rows`
- Grab their index
- Get original data with the index from null_rows
- Figure out how to gather the defective values and convert them

In [12]:
# Grab the index

null_rows_index = null_rows.index
null_rows_index

Index([   9,   12,  490,  491,  605,  835, 1084, 1781, 2508, 2950, 3352, 3354,
       3812, 3846, 4286, 5062],
      dtype='int64')

In [13]:
# Get a copy of original dataset

data_copy = data_original.copy()
data_copy

Unnamed: 0,Description,Basic Unit,Beginning Balance,Quantity Received,Quantity Dispensed,Losses and Adjustments,Ending Balance (Physical Count)
0,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
1,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
2,Tab. Cefuroxime 250mg,10,1000,0,0,0,0
3,Cap Ampicillin,100,350,0,0,0,0
4,Cap Ampicillin,100,350,0,0,0,0
...,...,...,...,...,...,...,...
5841,Inj.Vitamin B Complex,Vial,145,30,4350,0,30
5842,Insulin (Actrapid),Vial,0,0,0,0,0
5843,Insulin (Insulutard),Vial,0,2,0,0,2
5844,Inj Amoxicillin 500mg,vials,54,100,79,0,75


In [14]:
# Strip leading and trailing spaces in column headers

data_copy.columns = data_copy.columns.str.strip()
data_copy.columns

Index(['Description', 'Basic Unit', 'Beginning Balance', 'Quantity Received',
       'Quantity Dispensed', 'Losses and Adjustments',
       'Ending Balance (Physical Count)'],
      dtype='object')

In [15]:
# Seaprate data with null rows

data_copy_nullrows = data_copy.iloc[null_rows_index]
data_copy_nullrows

Unnamed: 0,Description,Basic Unit,Beginning Balance,Quantity Received,Quantity Dispensed,Losses and Adjustments,Ending Balance (Physical Count)
9,Surgical gloves,100,2,0,i,0,i
12,Tab Quinine,100,41oo,0,0,0,0
490,Tabs Ibuprofen 200mg,1000's,0,i,0,0,1
491,Tabs Ibuprofen 200mg,1000's,0,I000,0,0,i000
605,Tabs Paracetamol 500mg,1000's,0,s,0,0,5
835,Caps Amoxicillin 500mg,100's,0,20oo,0,0,2000
1084,inj Chlorpheniramine 10mg/amp,100's,0,0.S,0,0,0.5
1781,Tab. Loratidine,100's,2000,0,o,0,0
2508,Tab. Metformin + Glimepride,30's,1300,0,0,0,O
2950,Inj Aminophyline,50's,0,50,0,O,50


In [16]:
# For each relevant column, find the unique values

unique_values = data_copy_nullrows[['Beginning Balance', 'Quantity Received',
                                        'Quantity Dispensed', 'Losses and Adjustments',
                                        'Ending Balance (Physical Count)']].apply(lambda x: x.unique())

print(unique_values)


Beginning Balance                              [2, 41oo, 0, 2000, 1300, 12, 30, o]
Quantity Received                             [0, i, I000, s, 20oo, 0.S, 50, O, o]
Quantity Dispensed                                                    [i, 0, o, O]
Losses and Adjustments                                                      [0, O]
Ending Balance (Physical Count)    [i, 0, 1, i000, 5, 2000, 0.5, O, 50, 12, 30, o]
dtype: object


**Replacement Strategy:**
To clean the data, we'll replace these inconsistent values using a function that applies the following transformations:

- Replace 'i' with '1'
- Replace 'o' and 'O' with '0'
- Replace 's' and 'S' with '5'
- Replace 'i000' with '1000'
- Replace '20oo' with '2000'
- Replace '0.S' with '0.5' and so on..

In [17]:
# Create a User Defined Function to clean the data

import re

def clean_values(value):
    value= str(value)
    
    # Replace 
    value = re.sub(r"i", "1", value)
    value = re.sub(r"o", "0", value)
    value = re.sub(r"s", "5", value)
    value = re.sub(r"O", "0", value)
    value = re.sub(r"0\.S", "0.5", value)
    value = re.sub(r"i000", "1000", value)
    value = re.sub(r"I000", "1000", value)
    value = re.sub(r"41oo", "4100", value)
    value = re.sub(r"20oo", "2000", value)
    
    return value

In [18]:
# Columns where the function will be appied

columns_to_clean= ['Beginning Balance', 'Quantity Received',
                  'Quantity Dispensed', 'Losses and Adjustments',
                  'Ending Balance (Physical Count)']

In [19]:
# Apply the UDF to specific columns mentioned above

for i in columns_to_clean:
    data_copy[i] = data_copy[i].apply(clean_values)

In [20]:
# Convert columns to numeric after cleaning
data_copy[columns_to_clean] = data_copy[columns_to_clean].apply(pd.to_numeric, errors='coerce')

# Check for anomalies as nulls once converting
data_copy[columns_to_clean].isnull().sum()

Beginning Balance                  0
Quantity Received                  0
Quantity Dispensed                 0
Losses and Adjustments             0
Ending Balance (Physical Count)    0
dtype: int64

#### **Note:** We have cleaned the anomalies in the numeric data.

In [22]:
# Save it to the existing spreadsheet

with pd.ExcelWriter("Hospital Data with Mixed Numbers and Characters.xlsx", engine= "openpyxl", mode='a') as writer:
    data_copy.to_excel(writer, sheet_name= "Cleaned_data", index= False)