INTRODUCTION

PCOS (Polycystic Ovary Syndrome) prediction using machine learning involves analyzing patient data to predict the likelihood of having PCOS based on factors like hormonal levels, age, and medical history.

IMPORT LIBRARIES AND DATA LOADING

In [80]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
data=pd.read_csv(r'C:\Users\Lenovo\Documents\Data analytics\data_analytics\project\pcos_dataset.csv')
data.head(10)

Unnamed: 0,Age,BMI,Menstrual_Irregularity,Testosterone_Level(ng/dL),Antral_Follicle_Count,PCOS_Diagnosis
0,24,34.7,1,25.2,20,0
1,37,26.4,0,57.1,25,0
2,32,23.6,0,92.7,28,0
3,28,28.8,0,63.1,26,0
4,25,22.1,1,59.8,8,0
5,38,19.3,0,28.4,6,0
6,24,20.2,1,72.5,29,0
7,43,20.2,1,85.8,17,0
8,36,20.6,0,50.4,5,0
9,40,20.4,0,82.0,21,0


DATA CLEANING

In [81]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        1000 non-null   int64  
 1   BMI                        1000 non-null   float64
 2   Menstrual_Irregularity     1000 non-null   int64  
 3   Testosterone_Level(ng/dL)  1000 non-null   float64
 4   Antral_Follicle_Count      1000 non-null   int64  
 5   PCOS_Diagnosis             1000 non-null   int64  
dtypes: float64(2), int64(4)
memory usage: 47.0 KB


In [82]:
data.isna().sum()

Age                          0
BMI                          0
Menstrual_Irregularity       0
Testosterone_Level(ng/dL)    0
Antral_Follicle_Count        0
PCOS_Diagnosis               0
dtype: int64

In [83]:
data.duplicated().sum()

np.int64(0)

EXTRACTING INDEPENDENT AND DEPENDENT VARIABLE

In [84]:
x=data.iloc[:,0:5].values
x=pd.DataFrame(x)
y=data['PCOS_Diagnosis'].values
y=pd.DataFrame(y)

SPLITING DATASET INTO TRAIN AND TEST DATA

In [85]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.25,random_state=67)

FEATURE SCALING

In [86]:
st_x=StandardScaler()
x_test=st_x.fit_transform(x_test)
x_train=st_x.fit_transform(x_train)

MODEL BUILDING

In [87]:
model=AdaBoostClassifier(n_estimators=50,learning_rate=1)
model.fit(x_train,y_train)

In [88]:
y_predict=model.predict(x_test)
print(y_test)
print(y_predict)

     0
604  0
391  1
691  0
925  0
610  1
..  ..
996  0
487  0
882  0
796  0
721  0

[250 rows x 1 columns]
[0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0]


In [89]:
print('mse:',metrics.mean_squared_error(y_test,y_predict))
print("accuracy:",metrics.accuracy_score(y_test,y_predict))


mse: 0.004
accuracy: 0.996


SUMMARY

After experimenting with different algorithms,AdaBoost emerging as the best model due to its ability to enhance weak classifiers, improve accuracy, and handle imbalanced datasets effectively.