Company Introduction
Your client for this project is a pharmaceutical company.

They have a long history of making effective drugs and are the leading producer of antibiotics for bacterial infection.
Their research and development team have recently developed five types of drugs to fight against chronic throat infection.
They want to quickly release the drug in the market so that they could cure people and increase revenue for the company.
Their R&D team made a brief analysis of the chemical composition present in the drug and made a brief report stating that each drug has a different effect according to their health.
The drug which has a higher concentration of chemicals should be given to those groups of people whose health report passes some criteria as suggested by the R&D team.


Current Scenario
The R&D group has invited some groups of people to test the drug, but going through each person’s health report might take a lot of time and cause a delay in launching the drug in the market.

The current process suffers from the following problems:

Testing phase takes a lot of time and it's done manually because they need to carefully examine each person for the side effects.
Most of the crucial time is being wasted in checking each person’s health report and dispensing specific drugs according to the health metric as suggested by the R&D team.
This process is time-consuming and wastage of resources.

The company has hired you as data science consultants. They want to automate the process of assigning the drug according to their health report.

Your Role
You are given a dataset containing the health report of the people from the test group.
Your task is to build a multi-class classification model using the dataset.
Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.


Project Deliverables
Deliverable: Drug classification.
Machine Learning Task: Multi-class classification
Target Variable: Drug
Win Condition: N/A (best possible model)


Evaluation Metric
The model evaluation will be based on the Accuracy Score.

The dataset contains all the necessary information about the person’s health like their sex, BP, Age, Cholesterol etc.

We have the health metrics of the person which is an essential factor for transcribing the drug to that person without any side effect.

This is the data that we have to predict for future samples.


The dataset is divided into two parts: Train, and Test sets.

Train Set:
The train set contains 160 rows and 7 columns.
The last column Drug is the target variable.

Test Set:
The test set contains 40 rows and 6 columns.
The test set doesn’t contain the Drug column.
It needs to be predicted for the test set.

Dataset Feature Description
The Dataset contains the following columns:

Column Name 	Description
Id	            Unique Id of the sample
Age	            Age of the person
Sex	            The sex of the person(M and F)
BP	            Blood pressure of the person
Cholesterol	    The level of cholesterol in a person's body
Na_to_K	        Sodium and potassium ratio
Drug	        Drug: Contains 5 classes of drugs encoded as(drug A : 3, drug B: 4, drug C: 2, drug X: 0, drug Y: 1)

In [1]:
import pandas as pd
drug_train=pd.read_csv('drug_train.csv')
drug_train.head()


Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,79,32,F,LOW,NORMAL,10.84,drugX
1,197,52,M,NORMAL,HIGH,9.894,drugX
2,38,39,F,NORMAL,NORMAL,9.709,drugX
3,24,33,F,LOW,HIGH,33.486,DrugY
4,122,34,M,NORMAL,HIGH,22.456,DrugY


In [2]:
drug_train.shape

(160, 7)

In [3]:
drug_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Id           160 non-null    int64  
 1   Age          160 non-null    int64  
 2   Sex          160 non-null    object 
 3   BP           160 non-null    object 
 4   Cholesterol  160 non-null    object 
 5   Na_to_K      160 non-null    float64
 6   Drug         160 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 8.9+ KB


In [4]:
drug_train.describe()

Unnamed: 0,Id,Age,Na_to_K
count,160.0,160.0,160.0
mean,99.075,45.3875,16.194988
std,59.374894,16.101481,7.254689
min,0.0,15.0,6.269
25%,45.5,32.0,10.44525
50%,100.5,46.0,14.0765
75%,149.5,58.25,19.48075
max,199.0,74.0,38.247


In [5]:
drug_train[drug_train.duplicated()]

Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K,Drug


In [6]:
drug_train['Sex'].value_counts()

M    83
F    77
Name: Sex, dtype: int64

In [7]:
drug_train['Drug'].value_counts()
#Should check if the data set is imbalanced because all the classes dont have same counts

DrugY    76
drugX    43
drugA    17
drugB    13
drugC    11
Name: Drug, dtype: int64

In [8]:
drug_train.groupby(['Drug'])['Na_to_K'].min()

Drug
DrugY    15.015
drugA     6.269
drugB     8.621
drugC     6.769
drugX     6.683
Name: Na_to_K, dtype: float64

In [9]:
drug_train['Id'].value_counts()

0      1
1      1
127    1
129    1
130    1
      ..
63     1
64     1
70     1
71     1
199    1
Name: Id, Length: 160, dtype: int64

In [10]:
#drop the column id since it is not a feature effecting the choice of drug

drug_train.drop('Id',axis=1,inplace=True)

In [11]:
drug_train['Cholesterol'].value_counts()

HIGH      88
NORMAL    72
Name: Cholesterol, dtype: int64

In [12]:
drug_train['BP'].value_counts()

HIGH      62
NORMAL    51
LOW       47
Name: BP, dtype: int64

In [13]:
cat_features=['Sex','BP','Cholesterol']

In [14]:
final_df=pd.DataFrame()
final_df=pd.get_dummies(drug_train,columns=cat_features,drop_first=True)

In [15]:
final_df.head()
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 160 non-null    int64  
 1   Na_to_K             160 non-null    float64
 2   Drug                160 non-null    object 
 3   Sex_M               160 non-null    uint8  
 4   BP_LOW              160 non-null    uint8  
 5   BP_NORMAL           160 non-null    uint8  
 6   Cholesterol_NORMAL  160 non-null    uint8  
dtypes: float64(1), int64(1), object(1), uint8(4)
memory usage: 4.5+ KB


In [16]:
X=pd.DataFrame()
X=final_df.drop('Drug',axis=1)
y=final_df['Drug']
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 160 non-null    int64  
 1   Na_to_K             160 non-null    float64
 2   Sex_M               160 non-null    uint8  
 3   BP_LOW              160 non-null    uint8  
 4   BP_NORMAL           160 non-null    uint8  
 5   Cholesterol_NORMAL  160 non-null    uint8  
dtypes: float64(1), int64(1), uint8(4)
memory usage: 3.2 KB


In [34]:
#With Pipeline
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PowerTransformer,StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=10)
dtree=DecisionTreeClassifier()

dtree.fit(X_train,y_train)

print("Testing Accuracy for dtree")
print(dtree.score(X_test,y_test))

print("Training Accuracy for dtree")
print(dtree.score(X_train,y_train))

Testing Accuracy for dtree
1.0
Training Accuracy for dtree
1.0


In [35]:
dtree.score(X_test,y_test)

1.0

In [36]:
drug_test=pd.read_csv('drug_test.csv')
drug_test.head()
drug_id=drug_test['Id']

In [37]:
drug_id.head()

0     95
1     15
2     30
3    158
4    128
Name: Id, dtype: int64

In [38]:
drug_test.drop('Id',axis=1,inplace=True)

In [39]:
drug_test=pd.get_dummies(drug_test,columns=cat_features,drop_first=True)

In [40]:
drug_test.head()

Unnamed: 0,Age,Na_to_K,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL
0,36,11.424,1,1,0,1
1,16,15.516,0,0,0,1
2,18,8.75,0,0,1,1
3,59,10.444,0,1,0,0
4,47,33.542,1,1,0,1


In [41]:
X_test_insaid=pd.DataFrame()
X_test_insaid=drug_test
X_test_insaid.head()
X_test_insaid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 40 non-null     int64  
 1   Na_to_K             40 non-null     float64
 2   Sex_M               40 non-null     uint8  
 3   BP_LOW              40 non-null     uint8  
 4   BP_NORMAL           40 non-null     uint8  
 5   Cholesterol_NORMAL  40 non-null     uint8  
dtypes: float64(1), int64(1), uint8(4)
memory usage: 928.0 bytes


In [42]:

X_test_insaid['ypredict']=lr.predict(X_test_insaid)


In [43]:
X_test_insaid.head()



Unnamed: 0,Age,Na_to_K,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL,ypredict
0,36,11.424,1,1,0,1,drugX
1,16,15.516,0,0,0,1,DrugY
2,18,8.75,0,0,1,1,drugX
3,59,10.444,0,1,0,0,drugC
4,47,33.542,1,1,0,1,DrugY


In [44]:

from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
scale_mapper = {"drugA":3,"drugB":4,"drugC":2,"drugX":0,"DrugY":1}
X_test_insaid['ypredict']=X_test_insaid['ypredict'].replace(scale_mapper)
print(X_test_insaid['ypredict'])

0     0
1     1
2     0
3     2
4     1
5     1
6     1
7     0
8     3
9     0
10    1
11    0
12    1
13    1
14    4
15    1
16    4
17    0
18    2
19    1
20    4
21    0
22    0
23    1
24    1
25    1
26    2
27    0
28    1
29    0
30    1
31    2
32    1
33    1
34    3
35    1
36    0
37    1
38    1
39    3
Name: ypredict, dtype: int64


In [45]:
X_test_insaid['Id']=drug_id

In [46]:
X_test_insaid.drop(['Age','Na_to_K','Sex_M','BP_LOW','BP_NORMAL','Cholesterol_NORMAL'],axis=1,inplace=True)
X_test_insaid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ypredict  40 non-null     int64
 1   Id        40 non-null     int64
dtypes: int64(2)
memory usage: 768.0 bytes


In [47]:
X_test_insaid= X_test_insaid.reindex(columns=['Id','ypredict'])

In [48]:
X_test_insaid.to_csv('submission_madhavi.csv', index=False, header=False)