## Description

* Data cleaning is mainly done in this notebook.

## Variable Description

* **Medicine Name**: The name of the medicine.

* **Composition**: The chemical composition or ingredients of the medicine.

* **Uses**: The medical uses or indications for which the medicine is prescribed.

* **Side\_effects**: Potential side effects associated with the medicine.

* **Image URL**: URL link to an image of the medicine.

* **Manufacturer**: The company that manufactures the medicine.

* **Excellent Review %**: Percentage of excellent reviews for the medicine.

* **Average Review %**: Percentage of average reviews for the medicine.

* **Poor Review %**: Percentage of poor reviews for the medicine.

Data Source: https://www.kaggle.com/datasets/singhnavjot2062001/11000-medicine-details

In [None]:
# Import Necessary Libararies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Read and copy the data
data = pd.read_csv('YOUR_PATH')
df = data.copy()

In [None]:
# Check data shape
df.shape

(11825, 9)

* The dataset has 11825 rows and 9 columns

In [None]:
# Check 5 samples from the data
df.head()

Unnamed: 0,Medicine Name,Composition,Uses,Side_effects,Image URL,Manufacturer,Excellent Review %,Average Review %,Poor Review %
0,Avastin 400mg Injection,Bevacizumab (400mg),Cancer of colon and rectum Non-small cell lun...,Rectal bleeding Taste change Headache Noseblee...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Roche Products India Pvt Ltd,22,56,22
1,Augmentin 625 Duo Tablet,Amoxycillin (500mg) + Clavulanic Acid (125mg),Treatment of Bacterial infections,Vomiting Nausea Diarrhea Mucocutaneous candidi...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Glaxo SmithKline Pharmaceuticals Ltd,47,35,18
2,Azithral 500 Tablet,Azithromycin (500mg),Treatment of Bacterial infections,Nausea Abdominal pain Diarrhea,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Alembic Pharmaceuticals Ltd,39,40,21
3,Ascoril LS Syrup,Ambroxol (30mg/5ml) + Levosalbutamol (1mg/5ml)...,Treatment of Cough with mucus,Nausea Vomiting Diarrhea Upset stomach Stomach...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Glenmark Pharmaceuticals Ltd,24,41,35
4,Aciloc 150 Tablet,Ranitidine (150mg),Treatment of Gastroesophageal reflux disease (...,Headache Diarrhea Gastrointestinal disturbance,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Cadila Pharmaceuticals Ltd,34,37,29


In [None]:
# Check basic data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11825 entries, 0 to 11824
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Medicine Name       11825 non-null  object
 1   Composition         11825 non-null  object
 2   Uses                11825 non-null  object
 3   Side_effects        11825 non-null  object
 4   Image URL           11825 non-null  object
 5   Manufacturer        11825 non-null  object
 6   Excellent Review %  11825 non-null  int64 
 7   Average Review %    11825 non-null  int64 
 8   Poor Review %       11825 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 831.6+ KB


* The data does not seem to have null values.
* It is consisted of 6 features with object data type and 3 features with integer data type.

In [None]:
# Check duplication
df.duplicated().sum()

np.int64(84)

* There are 84 overlapping samples.

In [None]:
# Drop the duplicated rows
df.drop_duplicates(inplace = True)
df.duplicated().any()

np.False_

* Duplicated rows are successfully dropped.

In [None]:
# Check for null values to be sure
df.isnull().sum()

Unnamed: 0,0
Medicine Name,0
Composition,0
Uses,0
Side_effects,0
Image URL,0
Manufacturer,0
Excellent Review %,0
Average Review %,0
Poor Review %,0


* There are no null values in the data as observed in the information.

In [None]:
# Check the values in categorical features
cat_cols = df.select_dtypes(include = 'object').columns.tolist()
for col in df[cat_cols]:
  print(df[col].value_counts(1))
  print('-' * 50)

Medicine Name
Lulifin Cream                        0.000341
Hexidine Mouth Wash                  0.000256
Amrolstar Cream                      0.000256
Lulizol Cream                        0.000256
Numbex Cream                         0.000256
                                       ...   
Jilazo Solution for Injection        0.000085
Justin 12.5mg Suppository            0.000085
Just Tears Liquigel                  0.000085
Jenvac Vaccine                       0.000085
Jardiance Met 12.5mg/500mg Tablet    0.000085
Name: proportion, Length: 11498, dtype: float64
--------------------------------------------------
Composition
Luliconazole (1% w/w)                           0.008347
Levocetirizine (5mg) + Montelukast (10mg)       0.006473
Ketoconazole (2% w/w)                           0.005195
Domperidone (30mg) + Rabeprazole (20mg)         0.005025
Itraconazole (100mg)                            0.004514
                                                  ...   
Cefixime (25mg/ml)         

* The categorical features does not seem to have obscure values to be eliminated.

In [None]:
# Save the cleaned data to the google drive
df.to_csv('YOUR_PATH', index=False)