# Step One: Data Preparation DataPrep
Report Description

In this step, the dataset was prepared to ensure it was suitable for machine learning. The data preparation process focused on improving data quality by removing missing values and duplicate records. Only relevant numerical features were selected for modeling to avoid noise and unnecessary complexity.

Since Support Vector Machine (SVM) is sensitive to the scale of input features, numerical attributes were standardized so that all features contributed equally to the model. The target variable was encoded into numerical values to allow the model to process it correctly. Finally, the dataset was split into training and testing sets to enable fair model evaluation.

In [1]:
import pandas as pd

In [2]:
# Load data set
df=pd.read_excel('medicine_text_data_400_unclean.xlsx')

In [3]:
# get info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             400 non-null    int64  
 1   category       400 non-null    object 
 2   text           400 non-null    object 
 3   keyword_count  400 non-null    int64  
 4   sentiment      400 non-null    float64
 5   risk_score     400 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 18.9+ KB


In [4]:
# Check for missing values
df.isna().sum()

id               0
category         0
text             0
keyword_count    0
sentiment        0
risk_score       0
dtype: int64

In [5]:
# give column names
df.columns

Index(['id', 'category', 'text', 'keyword_count', 'sentiment', 'risk_score'], dtype='object')

In [6]:
df['location'] = df['text'].str.extract(r'in\s([A-Za-z]+)')
df.head()   

Unnamed: 0,id,category,text,keyword_count,sentiment,risk_score,location
0,1,storage_issue,pharmacies in giza reported issues with antide...,2,-0.5,4.5,giza
1,2,storage_issue,a weekly report from delta region connected an...,2,-0.5,4.5,
2,3,import_delay,"In Giza, anticoagulants availability has dropp...",4,-1.0,9.6,
3,4,manufacturing_issue,"In coastal governorates, anticoagulants availa...",3,-0.5,7.65,
4,5,manufacturing_issue,Pharmacies in Giza reported issues with asthma...,2,-0.5,5.1,Giza


In [7]:
# Standardize category names
df['category'] = df['category'].str.lower().str.replace(" ", "_")
df.head()

Unnamed: 0,id,category,text,keyword_count,sentiment,risk_score,location
0,1,storage_issue,pharmacies in giza reported issues with antide...,2,-0.5,4.5,giza
1,2,storage_issue,a weekly report from delta region connected an...,2,-0.5,4.5,
2,3,import_delay,"In Giza, anticoagulants availability has dropp...",4,-1.0,9.6,
3,4,manufacturing_issue,"In coastal governorates, anticoagulants availa...",3,-0.5,7.65,
4,5,manufacturing_issue,Pharmacies in Giza reported issues with asthma...,2,-0.5,5.1,Giza


In [8]:
# name columns clean_text to clean the text from a uppercase to lowercase
import re
def clean_text(text):
    text = text.lower()                 # lowercase
    text = re.sub(r"[^a-z\s]", "", text) # remove punctuation & numbers
    return text
df["clean_text"] = df["text"].apply(clean_text)

In [9]:
# Count keywords related to medicine supply issues
keywords = ["shortage", "delay", "issue", "stock", "problem"]
df['keyword_count_clean'] = df['clean_text'].apply(lambda x: sum(word in x for word in keywords))
df.head()

Unnamed: 0,id,category,text,keyword_count,sentiment,risk_score,location,clean_text,keyword_count_clean
0,1,storage_issue,pharmacies in giza reported issues with antide...,2,-0.5,4.5,giza,pharmacies in giza reported issues with antide...,1
1,2,storage_issue,a weekly report from delta region connected an...,2,-0.5,4.5,,a weekly report from delta region connected an...,0
2,3,import_delay,"In Giza, anticoagulants availability has dropp...",4,-1.0,9.6,,in giza anticoagulants availability has droppe...,0
3,4,manufacturing_issue,"In coastal governorates, anticoagulants availa...",3,-0.5,7.65,,in coastal governorates anticoagulants availab...,0
4,5,manufacturing_issue,Pharmacies in Giza reported issues with asthma...,2,-0.5,5.1,Giza,pharmacies in giza reported issues with asthma...,1


In [10]:
# Remove duplicate rows
df= df.drop_duplicates()
df.head()

Unnamed: 0,id,category,text,keyword_count,sentiment,risk_score,location,clean_text,keyword_count_clean
0,1,storage_issue,pharmacies in giza reported issues with antide...,2,-0.5,4.5,giza,pharmacies in giza reported issues with antide...,1
1,2,storage_issue,a weekly report from delta region connected an...,2,-0.5,4.5,,a weekly report from delta region connected an...,0
2,3,import_delay,"In Giza, anticoagulants availability has dropp...",4,-1.0,9.6,,in giza anticoagulants availability has droppe...,0
3,4,manufacturing_issue,"In coastal governorates, anticoagulants availa...",3,-0.5,7.65,,in coastal governorates anticoagulants availab...,0
4,5,manufacturing_issue,Pharmacies in Giza reported issues with asthma...,2,-0.5,5.1,Giza,pharmacies in giza reported issues with asthma...,1


In [11]:
#  Remove missing values
df = df.dropna()
df.head()

Unnamed: 0,id,category,text,keyword_count,sentiment,risk_score,location,clean_text,keyword_count_clean
0,1,storage_issue,pharmacies in giza reported issues with antide...,2,-0.5,4.5,giza,pharmacies in giza reported issues with antide...,1
4,5,manufacturing_issue,Pharmacies in Giza reported issues with asthma...,2,-0.5,5.1,Giza,pharmacies in giza reported issues with asthma...,1
6,7,forecasting_error,Pharmacies in coastal governorates reported is...,3,-0.5,4.05,coastal,pharmacies in coastal governorates reported is...,1
7,8,neutral_report,Pharmaci3s in Upp3r Egypt r3port3d stabl3 avai...,0,0.0,0.0,Upp,pharmacis in uppr egypt rportd stabl availabil...,0
8,9,neutral_report,Supply levels for insulin in co@st@l governor@...,0,0.0,0.0,in,supply levels for insulin in costl governortes...,0


In [12]:
df.to_csv('Data_CleanedFor_Text.csv')