In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
#loading datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [10]:
train

Unnamed: 0,ID,Age,Weight_kg,PCOS,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Type,Exercise_Duration,Sleep_Hours,Exercise_Benefit
0,0,20-25,64.0,No,No,No,No,No,No,Rarely,"Cardio (e.g., running, cycling, swimming)",30 minutes,Less than 6 hours,Somewhat
1,1,15-20,55.0,No,No,No,No,No,No,6-8 Times a Week,No Exercise,Less than 30 minutes,6-8 hours,Somewhat
2,2,15-20,91.0,No,No,No,Yes,No,No,Rarely,"Cardio (e.g., running, cycling, swimming)",Less than 30 minutes,6-8 hours,Somewhat
3,3,15-20,56.0,No,No,No,No,No,No,6-8 Times a Week,"Cardio (e.g., running, cycling, swimming)",45 minutes,6-8 hours,Not at All
4,4,15-20,47.0,No,Yes,No,No,No,No,Rarely,No Exercise,Not Applicable,6-8 hours,Not Much
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,205,20-25,57.0,No,No,No,No,No,No,Rarely,No Exercise,Not Applicable,Less than 6 hours,Somewhat
206,206,Less than 20,53.6,No,Yes,Yes,Yes,No,No,Rarely,No Exercise,Not Applicable,6-8 hours,Somewhat
207,207,30-35,30.0,No,No,No,No,No,No,1-2 Times a Week,Cardio (e.g.,30 minutes,6-8 hours,Somewhat
208,208,20-25,65.0,No,No,No,Yes,No,No,1-2 Times a Week,No Exercise,Less than 30 minutes,Less than 6 hours,Somewhat


In [11]:
test

Unnamed: 0,ID,Age,Weight_kg,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Type,Exercise_Duration,Sleep_Hours,Exercise_Benefit
0,0,20-25,54.0,No,No,No,No,No,Rarely,No Exercise,Less than 30 minutes,6-8 hours,Somewhat
1,1,20-25,65.0,Yes,No,No,No,No,3-4 Times a Week,No Exercise,Not Applicable,6-8 hours,Somewhat
2,2,20-25,64.0,Yes,No,No,No,No,6-8 Times a Week,Cardio (e.g.,Not Applicable,6-8 hours,Somewhat
3,3,Less than 20,57.0,Yes,No,Yes,No,Yes,Rarely,No Exercise,6-8 hours,6-8 hours,Somewhat
4,4,Less than 20,6.0,Yes,No,Yes,No,No,Rarely,Cardio (e.g.,30 minutes,6-8 hours,Somewhat
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,140,20-25,52.0,Yes,No,No,No,No,Rarely,Cardio (e.g.,Less than 30 minutes,6-8 hours,Somewhat
141,141,20-25,67.0,No,No,No,No,No,Rarely,Strength training (e.g.,Less than 30 minutes,6-8 hours,Not at All
142,142,20-25,55.0,Yes,Yes,Yes,Yes,No,Rarely,Cardio (e.g.,Less than 20 minutes,6-8 hours,Yes Significantly
143,143,Less than 20,49.0,No,No,Yes,No,No,1/2 Times a Week,Cardio (e.g.,Less than 30 minutes,6-8 hours,Not Much


# Business Case

The primary purpose of this dataset is to facilitate research and analysis on the impact of lifestyle choices on PCOS. It is designed for:            

- Exploratory Data Analysis (EDA): Understanding trends, correlations, and patterns in the data.         
- Predictive Modeling: Building machine learning models to predict PCOS based on health metrics.          
- Health Research: Supporting studies on how diet, exercise, and stress levels contribute to PCOS prevalence.       
- Awareness:Educating individuals and healthcare providers about the importance of lifestyle in managing reproductive health.    
    
    
**The aim is to use data-driven methods to deeply analyze the impact of lifestyle choices on PCOS, and provide scientific basis for clinical intervention and health management.**

# Domain Analysis

**1.ID** :   no effect on pcos, should be dropped         
**2.Age** :  PCOS is more common in reprpductive age women, typically between 15-45 years. younger women(below 20) might have
 undiagnosed PCOS, while older women might have better hormonal regulation.                       
**3.Weight_kg** : Obesity is a major risk factor for PCOS. Higher weight can lead to insulin resisitance and hormonal imbalances, wosening PCOS symptoms.
**4.PCOS** :     This is what we are predicting      
**5.Hormonal_Imbalance**: PCOS is fundamentally a hormal disorder, so this feature is highly correlated with PCOS.   
**6.Hyperandrogenism**:       High levels of androgens (male hormones) cause symptoms like acne, excessive hair growth, and irregular periods-key markers of pCOS.
**7.Hirsutism**:  Hirsutism(excess facial/body hair) is a direct symptom of PCOS caused by high androgen levels                
**8.Conception_Difficuly**: PCOS is a leading cause of infertility due to irregular ovulation. High correlation with PCOS.     
**9.Insulin_Resistance**: PCOS is strongly linked to insulin, leading to weight gain and metabolic issues.           
**10.Exercise_Frequency**:     Regular exercise helps improve insulin sensitivity and hormone balance, potentially reducing PCOS symtoms.       
**11.Exercise_Type**:  Strength training and mixed workouts may have a greater positive effect on insulin sensitivity then cardio alone.                             
**12.Exercise_duration**:  Longer exercise sessions can help regulate weight and insulin levels, reducing pCOS .   
**13.Sleep_Hours**:        poor sleep(<6 hours) is linked to hormonal imbalance and metabolic issues , increasing PCOS.       
**14.Exercise_Benefit**:   helps measure the impact of exercise on pCOS. if many PCOS patients see improvement, exercise could be a key intervention.    


In [12]:
train.shape

(210, 14)

In [13]:
test.shape

(145, 13)

# Data Cleaning

# 1. AGe

In [14]:
train['Age'] = train['Age'].replace({
    'Less than 20': '15-20',
    '30-25': '25-30',
    '30-40' : '35-45',
    '35-44' : '35-45',
    'Less than 20-25': '15-20'
})

In [15]:
test['Age'] = test['Age'].replace({
    'Less than 20': '15-20',
    'Less than 20)': '15-20',
    '30-30': '25-30',
    '25-25' : '25-30',
    '30-40' : '35-45',
    '50-60' : '45 and above',
    '22-25' : '25-30',
    '20' : '15-20',
    '45-49' : '45 and above',
    '35-44' : '35-45',
    'Less than 20-25': '15-20'
})

In [16]:
train.Age.value_counts()

Age
20-25           125
15-20            69
35-45             5
25-30             5
45 and above      3
30-35             2
Name: count, dtype: int64

In [17]:
test.Age.value_counts()

Age
20-25           95
15-20           40
25-30            3
35-45            2
30-35            2
45 and above     2
Name: count, dtype: int64

# HOrmonal

In [18]:
train['Hormonal_Imbalance'] = train['Hormonal_Imbalance'].replace({
   'No, Yes, not diagnosed by a doctor' : 'No',
    'Yes Significantly' : 'Yes'
})

In [19]:
train.Hormonal_Imbalance.value_counts()

Hormonal_Imbalance
Yes    112
No      96
Name: count, dtype: int64

In [20]:
test.Hormonal_Imbalance.value_counts()

Hormonal_Imbalance
Yes    79
No     63
Name: count, dtype: int64

# Hirsutism

In [21]:
train['Hirsutism'] = train['Hirsutism'].replace({
   'No, Yes, not diagnosed by a doctor' : 'No'
})

In [22]:
train.Hirsutism.value_counts()

Hirsutism
No     150
Yes     55
Name: count, dtype: int64

In [23]:
test.Hirsutism.value_counts()

Hirsutism
No     113
Yes     30
Name: count, dtype: int64

# Conception_Difficulty

In [24]:
test['Conception_Difficulty'] = test['Conception_Difficulty'].replace({
   'Somewhat': 'Yes'
})

In [25]:
train['Conception_Difficulty'] = train['Conception_Difficulty'].replace({
   'No, Yes, not diagnosed by a doctor' : 'No',
    'Yes, diagnosed by a doctor' : 'Yes'
})

In [26]:
train.Conception_Difficulty.value_counts()

Conception_Difficulty
No     202
Yes      7
Name: count, dtype: int64

In [27]:
test.Conception_Difficulty.value_counts()

Conception_Difficulty
No     137
Yes      6
Name: count, dtype: int64

# Exercise_Frequency

In [28]:
test['Exercise_Frequency'] = test['Exercise_Frequency'].replace({
   'Daily' : '6-8 Times a Week',
    'Less than 6-8 Times a Week' : '3-4 Times a Week',
    'Less than 6 hours' : 'Unknown',
    '30-35' : 'Unknown',
    'Somewhat' : 'Unknown',
    '6-8 hours' : 'Unknown',
    '1/2 Times a Week' : '1-2 Times a Week'
})

In [29]:
train['Exercise_Frequency'] = train['Exercise_Frequency'].replace({
   '6-8 hours' : 'Unknown',
    'Less than usual' : 'Rarely',
    'Less than 6 hours' : 'Unknown'
})

In [30]:
test.Exercise_Frequency.value_counts()

Exercise_Frequency
Rarely              82
1-2 Times a Week    33
Never               10
3-4 Times a Week     8
6-8 Times a Week     7
Unknown              4
Name: count, dtype: int64

In [31]:
train.Exercise_Frequency.value_counts()

Exercise_Frequency
Rarely              103
1-2 Times a Week     35
Never                28
3-4 Times a Week     23
6-8 Times a Week     17
Unknown               2
Name: count, dtype: int64

# Exercise_Duration

In [32]:
test['Exercise_Duration'] = test['Exercise_Duration'].replace({
   '6-8 hours' : 'unknown',
    'Less than 20 minutes' : 'Less than 30 minutes',
    '3-4 Times a Week' : 'unknown',
    '20 minutes' : 'Less than 30 minutes',
    'Less than 6 hours' : 'unknown',
    'Not Much' : 'unknown',
    '1-2 Times a Week' : 'unknown',
    '40 minutes' : '30 minutes to 1 hour' })

In [33]:
train['Exercise_Duration'] = train['Exercise_Duration'].replace({
   'More than 30 minutes' : '30 minutes to 1 hour',
    '20 minutes' : 'Less than 30 minutes',
    'Less than 6 hours' : 'unknown',
    '45 minutes' : '30 minutes to 1 hour'
})

In [34]:
test.Exercise_Duration.value_counts()

Exercise_Duration
Not Applicable          58
Less than 30 minutes    53
30 minutes              20
unknown                  7
30 minutes to 1 hour     7
Name: count, dtype: int64

In [35]:
train.Exercise_Duration.value_counts()

Exercise_Duration
Not Applicable          86
Less than 30 minutes    63
30 minutes              33
30 minutes to 1 hour    25
unknown                  1
Name: count, dtype: int64

# Exercise_Type

In [36]:
test['Exercise_Type'] = test['Exercise_Type'].replace({
    'Cardio (e.g.' : 'Cardio',
    'Flexibility and balance (e.g.' : 'Flexibility',
    'Strength training (e.g.' : 'Strength training',
    'Strength (e.g.' : 'Strength training',
    'No' : 'No Exercise',
    'Not Applicable' : 'No Exercise',
    'Yes Significantly' : 'No Exercise',
    'Sleep_Benefit' : 'No Exercise',
    'Somewhat' : 'No Exercise'
})
    

In [37]:
train['Exercise_Type'] = train['Exercise_Type'].replace({
    'Cardio (e.g., running, cycling, swimming)' : 'Cardio',
    'Cardio (e.g.' : 'Cardio',
    'Flexibility and balance (e.g., yoga, pilates)' : 'Flexibility',
    'Strength training (e.g., weightlifting, resistance exercises)' : 'Strength training',
    'Cardio (e.g., running, cycling, swimming), Strength training (e.g., weightlifting, resistance exercises)' : 'Cardio',
    'Cardio (e.g., running, cycling, swimming), Flexibility and balance (e.g., yoga, pilates)' : 'Cardio',
    'Cardio (e.g., running, cycling, swimming), Strength training (e.g., weightlifting, resistance exercises), Flexibility and balance (e.g., yoga, pilates)' :'Cardio',
    'Strength training (e.g., weightlifting, resistance exercises), Flexibility and balance (e.g., yoga, pilates)' : 'Strength training',
    'Flexibility and balance (e.g., yoga, pilates), None' : 'Flexibility',
    'Cardio (e.g., running, cycling, swimming), None' : 'Cardio',
    'Strength training (e.g.' : 'Strength training',
    'Somewhat' : 'No Exercise',
    'Flexibility and balance (e.g.' : 'Flexibility',
     'High-intensity interval training (HIIT)' : 'Cardio' })

In [38]:
test.Exercise_Type.value_counts()

Exercise_Type
No Exercise          68
Cardio               65
Flexibility           7
Strength training     4
Name: count, dtype: int64

In [39]:
train.Exercise_Type.value_counts()

Exercise_Type
No Exercise          91
Cardio               90
Flexibility          18
Strength training     9
Name: count, dtype: int64

# Sleep_Hours

In [40]:
train['Sleep_Hours'] = train['Sleep_Hours'].replace({
  '3-4 hours' : 'Less than 6 hours'
})

In [41]:
test['Sleep_Hours'] = test['Sleep_Hours'].replace({
  '3-4 hours' : 'Less than 6 hours',
 '6-8 Times a Week' : 'Less than 6 hours',
    '6-12 hours' : 'More than 12 hours',
    '20 minutes' : 'Less than 6 hours'
})

In [42]:
train.Sleep_Hours.value_counts()

Sleep_Hours
6-8 hours             135
Less than 6 hours      59
9-12 hours             13
More than 12 hours      1
Name: count, dtype: int64

In [43]:
test.Sleep_Hours.value_counts()

Sleep_Hours
6-8 hours             100
Less than 6 hours      38
9-12 hours              5
More than 12 hours      1
Name: count, dtype: int64

# Exercise_Benefit

In [44]:
test['Exercise_Benefit'] = test['Exercise_Benefit'].replace({
  'Not Much' : 'Somewhat'
})

In [45]:
train['Exercise_Benefit'] = train['Exercise_Benefit'].replace({
  'Not Much' : 'Somewhat'
})

In [46]:
train.Exercise_Benefit.value_counts()

Exercise_Benefit
Somewhat             158
Not at All            26
Yes Significantly     25
Name: count, dtype: int64

In [47]:
test.Exercise_Benefit.value_counts()

Exercise_Benefit
Somewhat             119
Yes Significantly     15
Not at All            10
Name: count, dtype: int64

## Hyperandrogenism

In [48]:
train.Hyperandrogenism.value_counts()

Hyperandrogenism
No     175
Yes     32
Name: count, dtype: int64

In [49]:
test.Hyperandrogenism.value_counts()

Hyperandrogenism
No     128
Yes     16
Name: count, dtype: int64

## Insulin_Resistance

In [50]:
train.Insulin_Resistance.value_counts()

Insulin_Resistance
No                                    185
Yes                                    23
No, Yes, not diagnosed by a doctor      1
Name: count, dtype: int64

In [51]:
test.Insulin_Resistance.value_counts()

Insulin_Resistance
No                   126
Yes                   16
Yes Significantly      2
Name: count, dtype: int64

In [52]:
train['Insulin_Resistance'] = train['Insulin_Resistance'].replace({
 "No, Yes, not diagnosed by a doctor"   : 'No'

})

In [53]:
test['Insulin_Resistance'] = test['Insulin_Resistance'].replace({
     'Yes Significantly' : 'Yes'

})

# Handling outliers

In [54]:
train.loc[train['ID'] == 33]

Unnamed: 0,ID,Age,Weight_kg,PCOS,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Type,Exercise_Duration,Sleep_Hours,Exercise_Benefit
33,33,20-25,116.0,No,No,No,No,No,No,1-2 Times a Week,Cardio,30 minutes to 1 hour,9-12 hours,Somewhat


In [55]:
import numpy as np

# Calculate Q1, Q3, and IQR
Q1 = train['Weight_kg'].quantile(0.25)
Q3 = train['Weight_kg'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Check if 116 is an outlier
is_outlier = 116 > upper_bound or 116 < lower_bound
print(f"Is Weight 116 an Outlier? {is_outlier}")

Lower Bound: 24.0, Upper Bound: 88.0
Is Weight 116 an Outlier? True


In [56]:
train.loc[train['Weight_kg'] > upper_bound, 'Weight_kg'] = train['Weight_kg'].median()

In [57]:
test.loc[test['ID'] == 4]

Unnamed: 0,ID,Age,Weight_kg,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Type,Exercise_Duration,Sleep_Hours,Exercise_Benefit
4,4,15-20,6.0,Yes,No,Yes,No,No,Rarely,Cardio,30 minutes,6-8 hours,Somewhat


In [58]:
# Calculate Q1, Q3, and IQR
Q1 = test['Weight_kg'].quantile(0.25)
Q3 = test['Weight_kg'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Check if 116 is an outlier
is_outlier = 6 > lower_bound or 6 < upper_bound
print(f"Is Weight 6 an Outlier? {is_outlier}")

Lower Bound: 28.25, Upper Bound: 86.25
Is Weight 6 an Outlier? True


In [59]:
test.loc[test['Weight_kg'] < lower_bound, 'Weight_kg'] = test['Weight_kg'].median()

### handling null values

In [60]:
train.isnull().sum()

ID                       0
Age                      1
Weight_kg                2
PCOS                     0
Hormonal_Imbalance       2
Hyperandrogenism         3
Hirsutism                5
Conception_Difficulty    1
Insulin_Resistance       1
Exercise_Frequency       2
Exercise_Type            2
Exercise_Duration        2
Sleep_Hours              2
Exercise_Benefit         1
dtype: int64

#### age

In [61]:
mostcommonage = train[train['PCOS'] =='No']['Age'].mode()[0]
mostcommonage

'20-25'

In [62]:
train.loc[train['Age'].isnull(), 'Age'] =  mostcommonage

#### weight

In [63]:
median_weight  = train[(train['PCOS'] =='No') & (train['Age'] == '20-25')]['Weight_kg'].median()
median_weight

54.0

In [64]:
train.loc[train['Weight_kg'].isnull(), 'Weight_kg'] =  median_weight

#### hormonal imbalance

In [65]:
train.loc[train['ID']== 75,'Hormonal_Imbalance'] ='Yes'

In [66]:
train.loc[train['ID']== 145,'Hormonal_Imbalance'] ='Yes'

In [67]:
train[(train['Hyperandrogenism'] =='Yes') & (train['Hirsutism'] =='Yes')]['Hormonal_Imbalance'].value_counts()

Hormonal_Imbalance
Yes    22
Name: count, dtype: int64

#### Hyperandrogenism

In [68]:
train.loc[train['ID'] == 44, 'Hyperandrogenism'] = 'No'
train.loc[train['ID'] == 92, 'Hyperandrogenism'] = 'No'
train.loc[train['ID'] == 164, 'Hyperandrogenism'] = 'No'

#### Hirsutism

In [69]:
train.loc[train['ID'] == 89, 'Hirsutism'] = 'No'
train.loc[train['ID'] == 96, 'Hirsutism'] = 'No'
train.loc[train['ID'] == 165, 'Hirsutism'] = 'Yes'
train.loc[train['ID'] == 172, 'Hirsutism'] = 'Yes'
train.loc[train['ID'] == 192, 'Hirsutism'] = 'No'

#### Conception_Difficulty

In [70]:
train.loc[train['ID'] ==41, 'Conception_Difficulty'] ='No'

#### Insulin_Resistance

In [71]:
train[train['PCOS'] == 'No']['Insulin_Resistance'].value_counts()

Insulin_Resistance
No     157
Yes      6
Name: count, dtype: int64

In [72]:
train[(train['PCOS'] == 'No') & 
   (train['Hormonal_Imbalance'] == 'No') & 
   (train['Weight_kg'] >= 50) & (train['Weight_kg'] <= 60)]['Insulin_Resistance'].value_counts()

Insulin_Resistance
No     36
Yes     1
Name: count, dtype: int64

In [73]:
train[['Weight_kg', 'Insulin_Resistance']].groupby('Insulin_Resistance').mean()

Unnamed: 0_level_0,Weight_kg
Insulin_Resistance,Unnamed: 1_level_1
No,54.86129
Yes,60.73913


In [74]:
train.loc[train['ID'] ==41, 'Insulin_Resistance'] ='No'

#### Exercise_Frequency

In [75]:
train.loc[train['ID'] ==58, 'Exercise_Frequency'] ='Never'

In [76]:
train[(train['Exercise_Type'] == 'Strength training') & 
   (train['Exercise_Duration'] == 'Not Applicable')]['Exercise_Frequency'].value_counts()

Exercise_Frequency
Rarely    1
Name: count, dtype: int64

In [77]:
train.loc[train['ID'] ==158, 'Exercise_Frequency'] ='Rarely'

#### Exercise_Type

In [78]:
train[train['Exercise_Frequency'] == 'Rarely']['Exercise_Type'].value_counts()

Exercise_Type
No Exercise          51
Cardio               37
Flexibility          10
Strength training     4
Name: count, dtype: int64

In [79]:
train.loc[train['ID'] == 65, 'Exercise_Type'] = 'No Exercise'
train.loc[train['ID'] == 128, 'Exercise_Type'] = 'No Exercise'

In [80]:
train.Exercise_Type.value_counts()

Exercise_Type
No Exercise          93
Cardio               90
Flexibility          18
Strength training     9
Name: count, dtype: int64

#### Exercise_Duration

In [81]:
train.loc[train['ID'] == 42, 'Exercise_Duration'] = '30 minutes to 1 hour'


In [82]:
train[(train['Exercise_Type'].str.contains("Flexibility")) & 
   (train['Exercise_Frequency'] == "Rarely")]['Exercise_Duration'].value_counts()

Exercise_Duration
Not Applicable          4
Less than 30 minutes    4
30 minutes to 1 hour    1
Name: count, dtype: int64

In [83]:
train.loc[train['ID'] == 183, 'Exercise_Duration'] = 'Less than 30 minutes'


#### Sleep_Hours

In [84]:
train[(train['Age'] == '20-25') & 
   (train['Exercise_Frequency'] == '3-4 Times a Week')]['Sleep_Hours'].value_counts()

Sleep_Hours
6-8 hours            8
Less than 6 hours    5
9-12 hours           2
Name: count, dtype: int64

In [85]:
train.loc[train['ID'] == 12, 'Sleep_Hours'] = '6-8 hours'


In [86]:
train[(train['Age'] == '15-20') & 
   (train['Exercise_Frequency'] == 'Rarely')]['Sleep_Hours'].value_counts()

Sleep_Hours
6-8 hours            21
Less than 6 hours     9
9-12 hours            3
Name: count, dtype: int64

In [87]:
train.loc[train['ID'] == 105, 'Sleep_Hours'] = '6-8 hours'


#### Exercise_Benefit

In [88]:

train[(train['Exercise_Frequency'] == '1-2 Times a Week') & 
   (train['Exercise_Type'] == 'No Exercise')]['Exercise_Benefit'].value_counts()

Exercise_Benefit
Somewhat      4
Not at All    1
Name: count, dtype: int64

In [89]:
train.loc[train['ID'] == 140, 'Exercise_Benefit'] = 'Not at All'



# testing

In [90]:
test.isnull().sum()

ID                       0
Age                      1
Weight_kg                2
Hormonal_Imbalance       3
Hyperandrogenism         1
Hirsutism                2
Conception_Difficulty    2
Insulin_Resistance       1
Exercise_Frequency       1
Exercise_Type            1
Exercise_Duration        0
Sleep_Hours              1
Exercise_Benefit         1
dtype: int64

##### AGE

In [91]:
common_age = train[(train['Weight_kg'] >= 50) & (train['Weight_kg'] <= 55)]['Age'].mode()[0]
common_age

'20-25'

In [92]:
test.loc[test['Age'].isnull(), 'Age'] =  common_age

#### weight

In [93]:
median_weight = train[(train['Age'] == '20-25') & (train['Hormonal_Imbalance'] == 'Yes')]['Weight_kg'].median()

In [94]:
test.loc[test['ID'].isin([10, 33]), 'Weight_kg'] = median_weight

#### Hormonal_Imbalance

In [95]:
test.loc[test['ID'] == 50, 'Hormonal_Imbalance'] = 'No'
test.loc[test['ID'] == 62, 'Hormonal_Imbalance'] = 'Yes'  # If trend confirms
test.loc[test['ID'] == 119, 'Hormonal_Imbalance'] = 'No'

#### Hyperandrogenism

In [96]:
test.loc[test['ID'] == 46, 'Hyperandrogenism'] = 'Yes'

#### Hirsutism

In [97]:
test.loc[test['ID'] == 63, 'Hirsutism'] = 'No'
test.loc[test['ID'] == 96, 'Hirsutism'] = 'No'

#### Conception

In [98]:
test.loc[test['ID'] == 23, 'Conception_Difficulty'] = 'No'
test.loc[test['ID'] == 49, 'Conception_Difficulty'] = 'No'

#### Insulin_Resistance

In [99]:
test.loc[test['ID'] == 60, 'Insulin_Resistance'] = 'No'

#### Exercise_Frequency

In [100]:
train.Exercise_Frequency.value_counts()

Exercise_Frequency
Rarely              104
1-2 Times a Week     35
Never                29
3-4 Times a Week     23
6-8 Times a Week     17
Unknown               2
Name: count, dtype: int64

In [101]:
train[(train['Exercise_Duration'] == '30 minutes')]['Exercise_Frequency'].value_counts()

Exercise_Frequency
Rarely              12
1-2 Times a Week    11
3-4 Times a Week     6
6-8 Times a Week     4
Name: count, dtype: int64

In [102]:
test.loc[test['ID'] == 46, 'Exercise_Frequency'] = 'Rarely'

#### Exercise type

In [103]:
train[train['Exercise_Duration'] == '30 minutes']['Exercise_Type'].value_counts()

Exercise_Type
Cardio               25
No Exercise           4
Flexibility           3
Strength training     1
Name: count, dtype: int64

In [104]:
test.loc[test['ID'] == 66, 'Exercise_Type'] = 'Cardio'

#### sleep

In [105]:
train[(train['Age'] == '20-25') & 
   (train['Exercise_Frequency'] == '1-2 Times a Week')]['Sleep_Hours'].value_counts()

Sleep_Hours
6-8 hours            9
Less than 6 hours    7
9-12 hours           3
Name: count, dtype: int64

In [106]:
test.loc[test['ID'] == 101, 'Sleep_Hours'] = '6-8 hours'

#### benefit

In [107]:
train.Exercise_Benefit.value_counts()

Exercise_Benefit
Somewhat             158
Not at All            27
Yes Significantly     25
Name: count, dtype: int64

In [108]:
test.loc[test['ID'] == 96, 'Exercise_Benefit'] = 'Yes Significantly'

# handling duplicates

In [109]:
train.duplicated().sum()

0

In [110]:
test.duplicated().sum()

0

# Feature Engineering

In [111]:
train['pcos_severity'] = train[['Hormonal_Imbalance', 'Hyperandrogenism', 'Hirsutism']].apply(lambda x: sum(x == 'Yes'), axis=1)
test['pcos_severity'] = test[['Hormonal_Imbalance', 'Hyperandrogenism', 'Hirsutism']].apply(lambda x: sum(x == 'Yes'), axis=1)

In [112]:
def lifestyle_score(row):
    score = 0
    # Assign scores for exercise frequency
    if row['Exercise_Frequency'] == 'Never':
        score += 0
    elif row['Exercise_Frequency'] == 'Rarely':
        score += 1
    elif row['Exercise_Frequency'] == '1-2 times a week':
        score += 2
    elif row['Exercise_Frequency'] == '3-4 times a week':
        score += 3
    elif row['Exercise_Frequency'] == '6-8 times a week':
        score += 4
    

    # Assign scores for sleep hours
    if row['Sleep_Hours'] == 'Less than 6 hours':
        score -= 2
    elif row['Sleep_Hours'] == '6-8 hours':
        score += 2
    elif row['Sleep_Hours'] == '9-12 hours':
        score += 1
    elif row['Sleep_Hours'] == 'More than 12 hours':
        score -= 1

    return score

train['lifestyle_score'] = train.apply(lifestyle_score, axis=1)
test['lifestyle_score'] = test.apply(lifestyle_score, axis=1)

In [113]:
train['reproductive_risk_score'] = train[['Insulin_Resistance', 'Conception_Difficulty']].apply(lambda x: sum(x == 'Yes'), axis=1)
test['reproductive_risk_score'] = test[['Insulin_Resistance', 'Conception_Difficulty']].apply(lambda x: sum(x == 'Yes'), axis=1)

In [114]:
train['Sleep_Deficiency'] = train['Sleep_Hours'].apply(lambda x: 1 if x in ['Less than 6 hours', 'More than 12 hours'] else 0)
test['Sleep_Deficiency'] = test['Sleep_Hours'].apply(lambda x: 1 if x in ['Less than 6 hours', 'More than 12 hours'] else 0)

# Encoding

In [115]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       210 non-null    int64  
 1   Age                      210 non-null    object 
 2   Weight_kg                210 non-null    float64
 3   PCOS                     210 non-null    object 
 4   Hormonal_Imbalance       210 non-null    object 
 5   Hyperandrogenism         210 non-null    object 
 6   Hirsutism                210 non-null    object 
 7   Conception_Difficulty    210 non-null    object 
 8   Insulin_Resistance       210 non-null    object 
 9   Exercise_Frequency       210 non-null    object 
 10  Exercise_Type            210 non-null    object 
 11  Exercise_Duration        210 non-null    object 
 12  Sleep_Hours              210 non-null    object 
 13  Exercise_Benefit         210 non-null    object 
 14  pcos_severity            2

In [116]:
train['Age'] = train['Age'].replace({'15-20':0,'20-25':1,'25-30':2,'30-35':3,'35-45' : 4, '45 and above' : 5})
test['Age'] = test['Age'].replace({'15-20':0,'20-25':1,'25-30':2,'30-35':3,'35-45' : 4, '45 and above' : 5})

  train['Age'] = train['Age'].replace({'15-20':0,'20-25':1,'25-30':2,'30-35':3,'35-45' : 4, '45 and above' : 5})
  test['Age'] = test['Age'].replace({'15-20':0,'20-25':1,'25-30':2,'30-35':3,'35-45' : 4, '45 and above' : 5})


In [117]:
train['PCOS'] = train['PCOS'].replace({'No': 0,'Yes':1})

  train['PCOS'] = train['PCOS'].replace({'No': 0,'Yes':1})


In [118]:
train['Hormonal_Imbalance'] = train['Hormonal_Imbalance'].replace({'No': 0,'Yes':1})
test['Hormonal_Imbalance'] = test['Hormonal_Imbalance'].replace({'No': 0,'Yes':1})

  train['Hormonal_Imbalance'] = train['Hormonal_Imbalance'].replace({'No': 0,'Yes':1})
  test['Hormonal_Imbalance'] = test['Hormonal_Imbalance'].replace({'No': 0,'Yes':1})


In [119]:
train['Hyperandrogenism'] = train['Hyperandrogenism'].replace({'No': 0,'Yes':1})
test['Hyperandrogenism'] = test['Hyperandrogenism'].replace({'No': 0,'Yes':1})

  train['Hyperandrogenism'] = train['Hyperandrogenism'].replace({'No': 0,'Yes':1})
  test['Hyperandrogenism'] = test['Hyperandrogenism'].replace({'No': 0,'Yes':1})


In [120]:
train['Hirsutism'] = train['Hirsutism'].replace({'No': 0,'Yes':1})
test['Hirsutism'] = test['Hirsutism'].replace({'No': 0,'Yes':1})

  train['Hirsutism'] = train['Hirsutism'].replace({'No': 0,'Yes':1})
  test['Hirsutism'] = test['Hirsutism'].replace({'No': 0,'Yes':1})


In [121]:
train['Conception_Difficulty'] = train['Conception_Difficulty'].replace({'No': 0,'Yes':1})
test['Conception_Difficulty'] = test['Conception_Difficulty'].replace({'No': 0,'Yes':1})

  train['Conception_Difficulty'] = train['Conception_Difficulty'].replace({'No': 0,'Yes':1})
  test['Conception_Difficulty'] = test['Conception_Difficulty'].replace({'No': 0,'Yes':1})


In [122]:
train['Insulin_Resistance'] = train['Insulin_Resistance'].replace({'No': 0,'Yes':1})
test['Insulin_Resistance'] = test['Insulin_Resistance'].replace({'No': 0,'Yes':1})

  train['Insulin_Resistance'] = train['Insulin_Resistance'].replace({'No': 0,'Yes':1})
  test['Insulin_Resistance'] = test['Insulin_Resistance'].replace({'No': 0,'Yes':1})


In [123]:
train['Exercise_Frequency'] = train['Exercise_Frequency'].replace({'Rarely':0,'1-2 Times a Week':1,'Never':2,
                                '3-4 Times a Week' : 3,'6-8 Times a Week' : 4,'Unknown' : 5})
test['Exercise_Frequency'] = test['Exercise_Frequency'].replace({'Rarely':0,'1-2 Times a Week':1,'Never':2,
                                '3-4 Times a Week' : 3,'6-8 Times a Week' : 4,'Unknown' : 5})


  train['Exercise_Frequency'] = train['Exercise_Frequency'].replace({'Rarely':0,'1-2 Times a Week':1,'Never':2,
  test['Exercise_Frequency'] = test['Exercise_Frequency'].replace({'Rarely':0,'1-2 Times a Week':1,'Never':2,


In [124]:
train['Exercise_Type'] = train['Exercise_Type'].replace({'No Exercise' : 0,'Cardio' : 1,'Flexibility' : 2,
                                                         'Strength training': 3})
test['Exercise_Type'] = test['Exercise_Type'].replace({'No Exercise' : 0,'Cardio' : 1,'Flexibility' : 2,
                                                         'Strength training': 3})

  train['Exercise_Type'] = train['Exercise_Type'].replace({'No Exercise' : 0,'Cardio' : 1,'Flexibility' : 2,
  test['Exercise_Type'] = test['Exercise_Type'].replace({'No Exercise' : 0,'Cardio' : 1,'Flexibility' : 2,


In [125]:
train['Exercise_Duration'] = train['Exercise_Duration'].replace({'Not Applicable' : 0,'Less than 30 minutes' : 1,
        '30 minutes' : 2,'30 minutes to 1 hour' :3,'unknown': 4})
test['Exercise_Duration'] = test['Exercise_Duration'].replace({'Not Applicable' : 0,'Less than 30 minutes' : 1,
        '30 minutes' : 2,'30 minutes to 1 hour' :3,'unknown': 4})

  train['Exercise_Duration'] = train['Exercise_Duration'].replace({'Not Applicable' : 0,'Less than 30 minutes' : 1,
  test['Exercise_Duration'] = test['Exercise_Duration'].replace({'Not Applicable' : 0,'Less than 30 minutes' : 1,


In [126]:
train['Sleep_Hours'] = train['Sleep_Hours'].replace({'6-8 hours' : 0,'Less than 6 hours' : 1,
        '9-12 hours' : 2,'More than 12 hours' :3})
test['Sleep_Hours'] = test['Sleep_Hours'].replace({'6-8 hours' : 0,'Less than 6 hours' : 1,
        '9-12 hours' : 2,'More than 12 hours' :3})

  train['Sleep_Hours'] = train['Sleep_Hours'].replace({'6-8 hours' : 0,'Less than 6 hours' : 1,
  test['Sleep_Hours'] = test['Sleep_Hours'].replace({'6-8 hours' : 0,'Less than 6 hours' : 1,


In [127]:
train['Exercise_Benefit'] = train['Exercise_Benefit'].replace({'Somewhat' : 0,'Not at All' : 1,
        'Yes Significantly' : 2})
test['Exercise_Benefit'] = test['Exercise_Benefit'].replace({'Somewhat' : 0,'Not at All' : 1,
        'Yes Significantly' : 2})

  train['Exercise_Benefit'] = train['Exercise_Benefit'].replace({'Somewhat' : 0,'Not at All' : 1,
  test['Exercise_Benefit'] = test['Exercise_Benefit'].replace({'Somewhat' : 0,'Not at All' : 1,


In [130]:
train['Weight_kg'] = train['Weight_kg'].round().astype(int)
test['Weight_kg'] = test['Weight_kg'].round().astype(int)

# Balancing

In [131]:
from imblearn.over_sampling import SMOTE
sm = SMOTE()

###### apply after train test split

In [132]:
from sklearn.model_selection import train_test_split

In [133]:
x = train.drop(['PCOS','ID'], axis =1)

In [134]:
y = train['PCOS']

In [135]:
x_test = test.drop('ID',axis =1)

In [136]:
x_sm,y_sm = sm.fit_resample(x,y)

# random forest

In [191]:
from sklearn.ensemble import RandomForestClassifier

In [192]:
rf  = RandomForestClassifier()

In [193]:
rf.fit(x_sm,y_sm)

In [194]:
rf_pred = rf.predict_proba(x_test)[:,1] 

In [None]:
from sklearn.model_selection import cross_val_score

In [195]:
cv_scores = cross_val_score(rf, x_sm,y_sm,cv = 5)

In [196]:
cv_scores.mean()

0.8533799533799534

In [136]:
pcos_submission_fe = pd.DataFrame({'ID': test['ID'], 'PCOS' : rf_pred})

In [137]:
pcos_submission_fe.to_csv('pcos_submission_fe.csv', index = False)

# XGB

In [197]:
import xgboost as xgb
from xgboost import XGBClassifier

In [198]:
xgb = XGBClassifier()

In [199]:
xgb.fit(x_sm,y_sm)

In [200]:
cv_scores = cross_val_score(xgb, x_sm,y_sm,cv = 5)

In [201]:
cv_scores.mean()

0.8351981351981351

In [141]:
xgb_pred = xgb.predict_proba(x_test)[:,1] 

In [143]:
pcos_submission_xgb_1102 = pd.DataFrame({'ID': test['ID'], 'PCOS' : xgb_pred})

In [144]:
pcos_submission_xgb_1102.to_csv('pcos_submission_xgb_1102.csv', index = False)

# lightgbm

In [215]:
import lightgbm as lgb

In [216]:
from lightgbm import LGBMClassifier

In [217]:
lgb = LGBMClassifier()

In [164]:
lgb.fit(x_sm,y_sm)

[LightGBM] [Info] Number of positive: 164, number of negative: 164
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000168 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 95
[LightGBM] [Info] Number of data points in the train set: 328, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


In [218]:
cv_scores = cross_val_score(lgb, x_sm,y_sm,cv = 5)

[LightGBM] [Info] Number of positive: 131, number of negative: 131
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000740 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 88
[LightGBM] [Info] Number of data points in the train set: 262, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 131, number of negative: 131
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000178 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 92
[LightGBM] [Info] Number of data points in the train set: 262, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM]

In [219]:
cv_scores.mean()

0.8441958041958042

In [165]:
lgb_pred = lgb.predict_proba(x_test)[:,1] 

In [167]:
pcos_submission_lgb_1102 = pd.DataFrame({'ID': test['ID'], 'PCOS' : lgb_pred})

In [168]:
pcos_submission_lgb_1102.to_csv('pcos_submission_lgb_1102.csv', index = False)

# Neural

In [220]:
from sklearn.neural_network import MLPClassifier

In [221]:
# Initialize the MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 64, 32),  # 3 layers with 128, 64, 32 neurons
                      activation='relu',  # ReLU activation for non-linearity
                      solver='adam',  # Adam optimizer
                      max_iter=500,  # Number of epochs
                      random_state=42)



In [222]:
mlp.fit(x_sm,y_sm)

In [223]:
cv_scores = cross_val_score(mlp, x_sm,y_sm,cv = 5)

In [224]:
cv_scores.mean()

0.7923076923076924

In [172]:
mlp_pred = mlp.predict_proba(x_test)[:,1] 

In [174]:
pcos_submission_mlp_1102 = pd.DataFrame({'ID': test['ID'], 'PCOS' : mlp_pred})

In [175]:
pcos_submission_mlp_1102.to_csv('pcos_submission_mlp_1102.csv', index = False)

# Ada Boost

In [184]:
from sklearn.ensemble import AdaBoostClassifier

In [185]:
adaboost = AdaBoostClassifier(n_estimators=100,learning_rate=0.05,random_state=42)

In [186]:
adaboost.fit(x_sm,y_sm)

In [187]:
ada_pred = adaboost.predict_proba(x_test)[:,1] 

## applying cv to check overfitting

In [188]:
from sklearn.model_selection import cross_val_score

In [189]:
cv_scores = cross_val_score(adaboost, x_sm,y_sm,cv = 5)

In [190]:
cv_scores.mean()

0.7346853146853147

In [189]:
pcos_submission_ada_1102 = pd.DataFrame({'ID': test['ID'], 'PCOS' : ada_pred})

In [190]:
pcos_submission_ada_1102.to_csv('pcos_submission_ada_1102.csv', index = False)

# H20 GB

In [139]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [140]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.441-b07, mixed mode)
  Starting server from C:\Users\mdsmb\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\mdsmb\AppData\Local\Temp\tmpqrfajvvn
  JVM stdout: C:\Users\mdsmb\AppData\Local\Temp\tmpqrfajvvn\h2o_mdsmb_started_from_python.out
  JVM stderr: C:\Users\mdsmb\AppData\Local\Temp\tmpqrfajvvn\h2o_mdsmb_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,3 months and 10 days
H2O_cluster_name:,H2O_from_python_mdsmb_j113gk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.071 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


In [141]:
df_train = h2o.H2OFrame(pd.concat([x_sm,y_sm],axis =1))

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [142]:
df_test = h2o.H2OFrame(x_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [143]:
features = x_sm.columns.tolist()

In [144]:
target = y_sm.name

In [145]:
gbm_model = H2OGradientBoostingEstimator(ntrees = 500,max_depth = 10,learn_rate=0.05,balance_classes=True, seed = 42)

In [147]:
gbm_model.train(x = features,y = target, training_frame=df_train)

gbm Model Build progress: |



██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,500.0,500.0,171838.0,7.0,10.0,9.68,15.0,28.0,22.482

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2025-02-12 06:31:17,0.076 sec,0.0,0.5,0.5,0.25
,2025-02-12 06:31:17,0.336 sec,1.0,0.4855035,0.4853474,0.2357137
,2025-02-12 06:31:17,0.378 sec,2.0,0.4720318,0.4714240,0.2228140
,2025-02-12 06:31:17,0.408 sec,3.0,0.4595320,0.4581969,0.2111697
,2025-02-12 06:31:17,0.438 sec,4.0,0.4486707,0.4465172,0.2013054
,2025-02-12 06:31:17,0.460 sec,5.0,0.4378199,0.4344968,0.1916863
,2025-02-12 06:31:17,0.484 sec,6.0,0.4284662,0.4239176,0.1835833
,2025-02-12 06:31:17,0.502 sec,7.0,0.4190812,0.4129909,0.1756291
,2025-02-12 06:31:17,0.520 sec,8.0,0.4105583,0.4030605,0.1685581
,2025-02-12 06:31:17,0.539 sec,9.0,0.4027044,0.3938692,0.1621708

variable,relative_importance,scaled_importance,percentage
pcos_severity,300.4179688,1.0,0.3616173
Weight_kg,187.3761597,0.6237182,0.2255473
lifestyle_score,65.5707245,0.218265,0.0789284
Exercise_Type,51.2709732,0.1706655,0.0617156
Exercise_Duration,43.9815025,0.146401,0.0529411
Exercise_Frequency,42.5853043,0.1417535,0.0512605
Sleep_Hours,30.7805462,0.1024591,0.037051
Hyperandrogenism,27.4012222,0.0912103,0.0329832
Hirsutism,22.9259186,0.0763134,0.0275962
Age,19.5298576,0.065009,0.0235084


In [148]:
preds = gbm_model.predict(df_test).as_data_frame()['predict']

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%





In [149]:
h20gbm_submission_1202 = pd.DataFrame({'ID' : x_test.index, 'PCOS' : preds})

In [151]:
h20gbm_submission_1202.to_csv('h20gbm_submission_1202.csv', index = False)

# Extra Trees Classifier

In [230]:
from sklearn.ensemble import ExtraTreesClassifier

In [231]:
extra_trees = ExtraTreesClassifier(n_estimators=200,random_state=42)

In [232]:
extra_trees.fit(x_sm,y_sm)

In [233]:
cv_scores = cross_val_score(extra_trees, x_sm,y_sm,cv = 5)

In [234]:
cv_scores.mean()

0.8625641025641027

In [135]:
extra_trees_pred = extra_trees.predict_proba(x_test)[:,1] 

In [137]:
pcos_submission_extratrees_1202 = pd.DataFrame({'ID': test['ID'], 'PCOS' : extra_trees_pred})

In [138]:
pcos_submission_extratrees_1202.to_csv('pcos_submission_extratrees_1202.csv', index = False)

# optimized adaboost

In [160]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [161]:
base_estimator = DecisionTreeClassifier()

adaboost = AdaBoostClassifier(base_estimator=base_estimator)

In [162]:
# Hyperparameter Grid
param_grid = {
    'n_estimators': [50, 100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
    'base_estimator__max_depth': [1, 3, 5, 7]
}

In [163]:
# Grid Search CV
grid_search = GridSearchCV(adaboost, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(x_sm, y_sm)

Fitting 5 folds for each of 80 candidates, totalling 400 fits




In [164]:
# Best Model
best_model = grid_search.best_estimator_

In [170]:
# Predicions
optimized_ada_preds = best_model.predict_proba(x_test)[:,1]

In [172]:
pcos_submission_optimized_ada_1202 = pd.DataFrame({'ID': test['ID'], 'PCOS' : optimized_ada_preds})

In [173]:
pcos_submission_optimized_ada_1202.to_csv('pcos_submission_optimized_ada_1202.csv', index = False)

# CatBooost

In [137]:
from catboost import CatBoostClassifier

In [138]:
cat_boost = CatBoostClassifier(iterations=500,depth =6,learning_rate =0.1,loss_function='Logloss')

In [139]:
cat_boost.fit(x_sm,y_sm)

0:	learn: 0.6243905	total: 151ms	remaining: 1m 15s
1:	learn: 0.5861922	total: 153ms	remaining: 38.2s
2:	learn: 0.5475115	total: 156ms	remaining: 25.8s
3:	learn: 0.5110544	total: 159ms	remaining: 19.7s
4:	learn: 0.4927944	total: 161ms	remaining: 15.9s
5:	learn: 0.4697391	total: 163ms	remaining: 13.4s
6:	learn: 0.4452537	total: 166ms	remaining: 11.7s
7:	learn: 0.4299519	total: 168ms	remaining: 10.3s
8:	learn: 0.4127522	total: 173ms	remaining: 9.42s
9:	learn: 0.3952358	total: 175ms	remaining: 8.56s
10:	learn: 0.3884188	total: 176ms	remaining: 7.84s
11:	learn: 0.3816087	total: 179ms	remaining: 7.26s
12:	learn: 0.3691636	total: 180ms	remaining: 6.75s
13:	learn: 0.3614362	total: 182ms	remaining: 6.32s
14:	learn: 0.3568489	total: 184ms	remaining: 5.93s
15:	learn: 0.3433214	total: 185ms	remaining: 5.6s
16:	learn: 0.3400972	total: 187ms	remaining: 5.31s
17:	learn: 0.3293256	total: 188ms	remaining: 5.04s
18:	learn: 0.3206314	total: 190ms	remaining: 4.81s
19:	learn: 0.3133377	total: 192ms	remaini

180:	learn: 0.0308927	total: 526ms	remaining: 928ms
181:	learn: 0.0307535	total: 529ms	remaining: 924ms
182:	learn: 0.0304608	total: 532ms	remaining: 921ms
183:	learn: 0.0302963	total: 534ms	remaining: 917ms
184:	learn: 0.0301357	total: 536ms	remaining: 913ms
185:	learn: 0.0298968	total: 539ms	remaining: 909ms
186:	learn: 0.0296209	total: 541ms	remaining: 905ms
187:	learn: 0.0294163	total: 542ms	remaining: 900ms
188:	learn: 0.0292200	total: 544ms	remaining: 896ms
189:	learn: 0.0290403	total: 547ms	remaining: 892ms
190:	learn: 0.0285540	total: 549ms	remaining: 888ms
191:	learn: 0.0283816	total: 551ms	remaining: 884ms
192:	learn: 0.0280803	total: 553ms	remaining: 879ms
193:	learn: 0.0277012	total: 554ms	remaining: 874ms
194:	learn: 0.0275345	total: 556ms	remaining: 869ms
195:	learn: 0.0272545	total: 557ms	remaining: 864ms
196:	learn: 0.0271114	total: 559ms	remaining: 860ms
197:	learn: 0.0268878	total: 561ms	remaining: 855ms
198:	learn: 0.0265080	total: 562ms	remaining: 850ms
199:	learn: 

342:	learn: 0.0116414	total: 914ms	remaining: 418ms
343:	learn: 0.0115741	total: 917ms	remaining: 416ms
344:	learn: 0.0115436	total: 920ms	remaining: 413ms
345:	learn: 0.0114948	total: 922ms	remaining: 410ms
346:	learn: 0.0114299	total: 924ms	remaining: 408ms
347:	learn: 0.0113341	total: 927ms	remaining: 405ms
348:	learn: 0.0112833	total: 929ms	remaining: 402ms
349:	learn: 0.0112652	total: 931ms	remaining: 399ms
350:	learn: 0.0112459	total: 934ms	remaining: 396ms
351:	learn: 0.0112112	total: 936ms	remaining: 393ms
352:	learn: 0.0111522	total: 938ms	remaining: 391ms
353:	learn: 0.0110915	total: 940ms	remaining: 388ms
354:	learn: 0.0110352	total: 942ms	remaining: 385ms
355:	learn: 0.0109838	total: 944ms	remaining: 382ms
356:	learn: 0.0109557	total: 947ms	remaining: 379ms
357:	learn: 0.0109178	total: 949ms	remaining: 376ms
358:	learn: 0.0109012	total: 951ms	remaining: 374ms
359:	learn: 0.0108767	total: 953ms	remaining: 371ms
360:	learn: 0.0108449	total: 956ms	remaining: 368ms
361:	learn: 

<catboost.core.CatBoostClassifier at 0x2026afc96d0>

In [144]:
from sklearn.model_selection import cross_val_score

In [145]:
cv_scores = cross_val_score(cat_boost, x_sm,y_sm,cv = 5)

0:	learn: 0.6317068	total: 2.69ms	remaining: 1.34s
1:	learn: 0.5882322	total: 5.4ms	remaining: 1.34s
2:	learn: 0.5466108	total: 8.03ms	remaining: 1.33s
3:	learn: 0.5130666	total: 10.5ms	remaining: 1.31s
4:	learn: 0.4889041	total: 13.4ms	remaining: 1.32s
5:	learn: 0.4660104	total: 15.9ms	remaining: 1.31s
6:	learn: 0.4456699	total: 18.4ms	remaining: 1.3s
7:	learn: 0.4353497	total: 20.8ms	remaining: 1.28s
8:	learn: 0.4186289	total: 23.3ms	remaining: 1.27s
9:	learn: 0.4086288	total: 25.5ms	remaining: 1.25s
10:	learn: 0.3963338	total: 28.1ms	remaining: 1.25s
11:	learn: 0.3854305	total: 30.7ms	remaining: 1.25s
12:	learn: 0.3749487	total: 33.4ms	remaining: 1.25s
13:	learn: 0.3636132	total: 36.2ms	remaining: 1.26s
14:	learn: 0.3547072	total: 38.6ms	remaining: 1.25s
15:	learn: 0.3387093	total: 41.5ms	remaining: 1.25s
16:	learn: 0.3321609	total: 43.9ms	remaining: 1.25s
17:	learn: 0.3241130	total: 46.4ms	remaining: 1.24s
18:	learn: 0.3187978	total: 48.9ms	remaining: 1.24s
19:	learn: 0.3076073	tot

160:	learn: 0.0338072	total: 391ms	remaining: 824ms
161:	learn: 0.0335774	total: 394ms	remaining: 822ms
162:	learn: 0.0329795	total: 396ms	remaining: 820ms
163:	learn: 0.0327960	total: 399ms	remaining: 817ms
164:	learn: 0.0322374	total: 402ms	remaining: 815ms
165:	learn: 0.0319742	total: 404ms	remaining: 813ms
166:	learn: 0.0317978	total: 406ms	remaining: 810ms
167:	learn: 0.0315930	total: 409ms	remaining: 808ms
168:	learn: 0.0310805	total: 411ms	remaining: 805ms
169:	learn: 0.0304885	total: 413ms	remaining: 803ms
170:	learn: 0.0303018	total: 416ms	remaining: 800ms
171:	learn: 0.0300978	total: 418ms	remaining: 797ms
172:	learn: 0.0296375	total: 420ms	remaining: 794ms
173:	learn: 0.0291657	total: 423ms	remaining: 792ms
174:	learn: 0.0289417	total: 425ms	remaining: 789ms
175:	learn: 0.0284880	total: 427ms	remaining: 786ms
176:	learn: 0.0281016	total: 430ms	remaining: 784ms
177:	learn: 0.0277612	total: 432ms	remaining: 782ms
178:	learn: 0.0275723	total: 435ms	remaining: 780ms
179:	learn: 

385:	learn: 0.0090955	total: 935ms	remaining: 276ms
386:	learn: 0.0090497	total: 938ms	remaining: 274ms
387:	learn: 0.0090123	total: 941ms	remaining: 272ms
388:	learn: 0.0089930	total: 943ms	remaining: 269ms
389:	learn: 0.0089439	total: 946ms	remaining: 267ms
390:	learn: 0.0088940	total: 949ms	remaining: 265ms
391:	learn: 0.0088650	total: 953ms	remaining: 263ms
392:	learn: 0.0088410	total: 956ms	remaining: 260ms
393:	learn: 0.0088237	total: 958ms	remaining: 258ms
394:	learn: 0.0087626	total: 961ms	remaining: 255ms
395:	learn: 0.0087226	total: 963ms	remaining: 253ms
396:	learn: 0.0086677	total: 966ms	remaining: 251ms
397:	learn: 0.0086433	total: 968ms	remaining: 248ms
398:	learn: 0.0085790	total: 972ms	remaining: 246ms
399:	learn: 0.0085669	total: 974ms	remaining: 244ms
400:	learn: 0.0085509	total: 977ms	remaining: 241ms
401:	learn: 0.0085346	total: 980ms	remaining: 239ms
402:	learn: 0.0085287	total: 983ms	remaining: 237ms
403:	learn: 0.0084707	total: 986ms	remaining: 234ms
404:	learn: 

90:	learn: 0.0883384	total: 207ms	remaining: 931ms
91:	learn: 0.0855082	total: 210ms	remaining: 933ms
92:	learn: 0.0846074	total: 213ms	remaining: 933ms
93:	learn: 0.0841990	total: 215ms	remaining: 931ms
94:	learn: 0.0839216	total: 218ms	remaining: 928ms
95:	learn: 0.0823547	total: 220ms	remaining: 927ms
96:	learn: 0.0821284	total: 223ms	remaining: 925ms
97:	learn: 0.0803966	total: 225ms	remaining: 922ms
98:	learn: 0.0796988	total: 227ms	remaining: 919ms
99:	learn: 0.0787663	total: 229ms	remaining: 917ms
100:	learn: 0.0776030	total: 232ms	remaining: 916ms
101:	learn: 0.0768604	total: 235ms	remaining: 916ms
102:	learn: 0.0768303	total: 237ms	remaining: 912ms
103:	learn: 0.0758017	total: 240ms	remaining: 913ms
104:	learn: 0.0754905	total: 242ms	remaining: 910ms
105:	learn: 0.0751822	total: 244ms	remaining: 906ms
106:	learn: 0.0748586	total: 246ms	remaining: 903ms
107:	learn: 0.0726857	total: 248ms	remaining: 899ms
108:	learn: 0.0706035	total: 250ms	remaining: 895ms
109:	learn: 0.0697790	

253:	learn: 0.0201997	total: 572ms	remaining: 554ms
254:	learn: 0.0200074	total: 575ms	remaining: 553ms
255:	learn: 0.0199529	total: 578ms	remaining: 551ms
256:	learn: 0.0198452	total: 580ms	remaining: 548ms
257:	learn: 0.0196962	total: 582ms	remaining: 546ms
258:	learn: 0.0194311	total: 584ms	remaining: 543ms
259:	learn: 0.0194128	total: 586ms	remaining: 541ms
260:	learn: 0.0192593	total: 588ms	remaining: 539ms
261:	learn: 0.0191348	total: 590ms	remaining: 536ms
262:	learn: 0.0190336	total: 593ms	remaining: 534ms
263:	learn: 0.0189757	total: 595ms	remaining: 532ms
264:	learn: 0.0189187	total: 597ms	remaining: 529ms
265:	learn: 0.0187046	total: 599ms	remaining: 527ms
266:	learn: 0.0186538	total: 601ms	remaining: 524ms
267:	learn: 0.0185647	total: 603ms	remaining: 522ms
268:	learn: 0.0183530	total: 605ms	remaining: 519ms
269:	learn: 0.0182412	total: 606ms	remaining: 517ms
270:	learn: 0.0181379	total: 608ms	remaining: 514ms
271:	learn: 0.0180452	total: 610ms	remaining: 511ms
272:	learn: 

427:	learn: 0.0098761	total: 931ms	remaining: 157ms
428:	learn: 0.0098343	total: 934ms	remaining: 155ms
429:	learn: 0.0097761	total: 936ms	remaining: 152ms
430:	learn: 0.0097760	total: 937ms	remaining: 150ms
431:	learn: 0.0097756	total: 939ms	remaining: 148ms
432:	learn: 0.0097442	total: 941ms	remaining: 146ms
433:	learn: 0.0097228	total: 943ms	remaining: 143ms
434:	learn: 0.0097228	total: 944ms	remaining: 141ms
435:	learn: 0.0096994	total: 946ms	remaining: 139ms
436:	learn: 0.0096731	total: 948ms	remaining: 137ms
437:	learn: 0.0096197	total: 950ms	remaining: 134ms
438:	learn: 0.0095997	total: 952ms	remaining: 132ms
439:	learn: 0.0095713	total: 954ms	remaining: 130ms
440:	learn: 0.0095493	total: 956ms	remaining: 128ms
441:	learn: 0.0095132	total: 958ms	remaining: 126ms
442:	learn: 0.0094970	total: 960ms	remaining: 124ms
443:	learn: 0.0094861	total: 962ms	remaining: 121ms
444:	learn: 0.0094434	total: 964ms	remaining: 119ms
445:	learn: 0.0093994	total: 967ms	remaining: 117ms
446:	learn: 

97:	learn: 0.0858426	total: 191ms	remaining: 784ms
98:	learn: 0.0851311	total: 194ms	remaining: 784ms
99:	learn: 0.0839170	total: 196ms	remaining: 782ms
100:	learn: 0.0819072	total: 197ms	remaining: 780ms
101:	learn: 0.0797991	total: 199ms	remaining: 777ms
102:	learn: 0.0778323	total: 202ms	remaining: 777ms
103:	learn: 0.0761519	total: 204ms	remaining: 775ms
104:	learn: 0.0760218	total: 206ms	remaining: 773ms
105:	learn: 0.0746645	total: 207ms	remaining: 771ms
106:	learn: 0.0729505	total: 210ms	remaining: 770ms
107:	learn: 0.0716781	total: 212ms	remaining: 768ms
108:	learn: 0.0704089	total: 214ms	remaining: 767ms
109:	learn: 0.0698014	total: 216ms	remaining: 766ms
110:	learn: 0.0687052	total: 218ms	remaining: 763ms
111:	learn: 0.0674601	total: 220ms	remaining: 761ms
112:	learn: 0.0663357	total: 222ms	remaining: 759ms
113:	learn: 0.0652735	total: 223ms	remaining: 757ms
114:	learn: 0.0652237	total: 225ms	remaining: 754ms
115:	learn: 0.0651361	total: 227ms	remaining: 752ms
116:	learn: 0.0

291:	learn: 0.0144617	total: 561ms	remaining: 400ms
292:	learn: 0.0143357	total: 564ms	remaining: 398ms
293:	learn: 0.0142211	total: 566ms	remaining: 397ms
294:	learn: 0.0141061	total: 568ms	remaining: 395ms
295:	learn: 0.0140292	total: 570ms	remaining: 393ms
296:	learn: 0.0139601	total: 572ms	remaining: 391ms
297:	learn: 0.0138660	total: 574ms	remaining: 389ms
298:	learn: 0.0137460	total: 576ms	remaining: 388ms
299:	learn: 0.0136894	total: 579ms	remaining: 386ms
300:	learn: 0.0136040	total: 581ms	remaining: 384ms
301:	learn: 0.0135110	total: 583ms	remaining: 382ms
302:	learn: 0.0134177	total: 585ms	remaining: 381ms
303:	learn: 0.0133128	total: 587ms	remaining: 379ms
304:	learn: 0.0131874	total: 589ms	remaining: 377ms
305:	learn: 0.0130856	total: 591ms	remaining: 375ms
306:	learn: 0.0129788	total: 593ms	remaining: 373ms
307:	learn: 0.0129304	total: 595ms	remaining: 371ms
308:	learn: 0.0128733	total: 597ms	remaining: 369ms
309:	learn: 0.0128159	total: 599ms	remaining: 367ms
310:	learn: 

471:	learn: 0.0068942	total: 923ms	remaining: 54.8ms
472:	learn: 0.0068745	total: 925ms	remaining: 52.8ms
473:	learn: 0.0068562	total: 927ms	remaining: 50.9ms
474:	learn: 0.0068371	total: 929ms	remaining: 48.9ms
475:	learn: 0.0068230	total: 931ms	remaining: 46.9ms
476:	learn: 0.0068021	total: 933ms	remaining: 45ms
477:	learn: 0.0067922	total: 935ms	remaining: 43ms
478:	learn: 0.0067913	total: 937ms	remaining: 41.1ms
479:	learn: 0.0067807	total: 939ms	remaining: 39.1ms
480:	learn: 0.0067612	total: 940ms	remaining: 37.1ms
481:	learn: 0.0067416	total: 942ms	remaining: 35.2ms
482:	learn: 0.0067139	total: 944ms	remaining: 33.2ms
483:	learn: 0.0066891	total: 946ms	remaining: 31.3ms
484:	learn: 0.0066696	total: 948ms	remaining: 29.3ms
485:	learn: 0.0066431	total: 950ms	remaining: 27.4ms
486:	learn: 0.0066262	total: 953ms	remaining: 25.4ms
487:	learn: 0.0065933	total: 955ms	remaining: 23.5ms
488:	learn: 0.0065686	total: 957ms	remaining: 21.5ms
489:	learn: 0.0065354	total: 958ms	remaining: 19.6

226:	learn: 0.0247811	total: 445ms	remaining: 536ms
227:	learn: 0.0246865	total: 448ms	remaining: 534ms
228:	learn: 0.0245016	total: 450ms	remaining: 532ms
229:	learn: 0.0241825	total: 452ms	remaining: 530ms
230:	learn: 0.0240585	total: 453ms	remaining: 528ms
231:	learn: 0.0237497	total: 455ms	remaining: 526ms
232:	learn: 0.0233156	total: 457ms	remaining: 524ms
233:	learn: 0.0231621	total: 459ms	remaining: 522ms
234:	learn: 0.0230785	total: 461ms	remaining: 520ms
235:	learn: 0.0228908	total: 463ms	remaining: 518ms
236:	learn: 0.0227609	total: 465ms	remaining: 516ms
237:	learn: 0.0225984	total: 467ms	remaining: 515ms
238:	learn: 0.0224941	total: 469ms	remaining: 513ms
239:	learn: 0.0222758	total: 471ms	remaining: 511ms
240:	learn: 0.0221274	total: 473ms	remaining: 509ms
241:	learn: 0.0220673	total: 475ms	remaining: 506ms
242:	learn: 0.0220000	total: 477ms	remaining: 504ms
243:	learn: 0.0216950	total: 479ms	remaining: 502ms
244:	learn: 0.0214402	total: 481ms	remaining: 500ms
245:	learn: 

423:	learn: 0.0093571	total: 816ms	remaining: 146ms
424:	learn: 0.0093080	total: 819ms	remaining: 144ms
425:	learn: 0.0092925	total: 821ms	remaining: 143ms
426:	learn: 0.0092580	total: 823ms	remaining: 141ms
427:	learn: 0.0092119	total: 824ms	remaining: 139ms
428:	learn: 0.0091546	total: 827ms	remaining: 137ms
429:	learn: 0.0090839	total: 829ms	remaining: 135ms
430:	learn: 0.0090540	total: 831ms	remaining: 133ms
431:	learn: 0.0090391	total: 833ms	remaining: 131ms
432:	learn: 0.0090140	total: 835ms	remaining: 129ms
433:	learn: 0.0090001	total: 837ms	remaining: 127ms
434:	learn: 0.0089879	total: 839ms	remaining: 125ms
435:	learn: 0.0089325	total: 841ms	remaining: 123ms
436:	learn: 0.0089185	total: 843ms	remaining: 121ms
437:	learn: 0.0089070	total: 844ms	remaining: 120ms
438:	learn: 0.0088803	total: 846ms	remaining: 118ms
439:	learn: 0.0088671	total: 848ms	remaining: 116ms
440:	learn: 0.0087821	total: 850ms	remaining: 114ms
441:	learn: 0.0087519	total: 852ms	remaining: 112ms
442:	learn: 

93:	learn: 0.0827551	total: 196ms	remaining: 845ms
94:	learn: 0.0826286	total: 198ms	remaining: 844ms
95:	learn: 0.0814944	total: 200ms	remaining: 842ms
96:	learn: 0.0807206	total: 202ms	remaining: 840ms
97:	learn: 0.0806175	total: 204ms	remaining: 837ms
98:	learn: 0.0803064	total: 206ms	remaining: 834ms
99:	learn: 0.0792574	total: 208ms	remaining: 832ms
100:	learn: 0.0780186	total: 210ms	remaining: 830ms
101:	learn: 0.0765467	total: 212ms	remaining: 827ms
102:	learn: 0.0763294	total: 214ms	remaining: 824ms
103:	learn: 0.0759918	total: 216ms	remaining: 821ms
104:	learn: 0.0740975	total: 218ms	remaining: 820ms
105:	learn: 0.0727597	total: 220ms	remaining: 817ms
106:	learn: 0.0715498	total: 223ms	remaining: 818ms
107:	learn: 0.0702259	total: 229ms	remaining: 831ms
108:	learn: 0.0692493	total: 232ms	remaining: 831ms
109:	learn: 0.0674987	total: 234ms	remaining: 829ms
110:	learn: 0.0665704	total: 236ms	remaining: 826ms
111:	learn: 0.0658498	total: 238ms	remaining: 824ms
112:	learn: 0.06498

270:	learn: 0.0165435	total: 564ms	remaining: 477ms
271:	learn: 0.0164988	total: 567ms	remaining: 475ms
272:	learn: 0.0164208	total: 569ms	remaining: 473ms
273:	learn: 0.0163433	total: 572ms	remaining: 471ms
274:	learn: 0.0162667	total: 574ms	remaining: 469ms
275:	learn: 0.0161768	total: 576ms	remaining: 467ms
276:	learn: 0.0160966	total: 578ms	remaining: 465ms
277:	learn: 0.0159914	total: 580ms	remaining: 463ms
278:	learn: 0.0159057	total: 582ms	remaining: 461ms
279:	learn: 0.0158373	total: 584ms	remaining: 459ms
280:	learn: 0.0157695	total: 586ms	remaining: 456ms
281:	learn: 0.0156809	total: 588ms	remaining: 454ms
282:	learn: 0.0156208	total: 590ms	remaining: 452ms
283:	learn: 0.0155599	total: 592ms	remaining: 450ms
284:	learn: 0.0155227	total: 594ms	remaining: 448ms
285:	learn: 0.0154145	total: 596ms	remaining: 446ms
286:	learn: 0.0152651	total: 597ms	remaining: 443ms
287:	learn: 0.0151823	total: 600ms	remaining: 442ms
288:	learn: 0.0151201	total: 602ms	remaining: 440ms
289:	learn: 

456:	learn: 0.0078473	total: 935ms	remaining: 87.9ms
457:	learn: 0.0078211	total: 937ms	remaining: 85.9ms
458:	learn: 0.0078054	total: 939ms	remaining: 83.9ms
459:	learn: 0.0077902	total: 941ms	remaining: 81.8ms
460:	learn: 0.0077748	total: 943ms	remaining: 79.8ms
461:	learn: 0.0077442	total: 945ms	remaining: 77.7ms
462:	learn: 0.0077188	total: 947ms	remaining: 75.7ms
463:	learn: 0.0077188	total: 949ms	remaining: 73.6ms
464:	learn: 0.0076988	total: 951ms	remaining: 71.6ms
465:	learn: 0.0076934	total: 953ms	remaining: 69.5ms
466:	learn: 0.0076831	total: 955ms	remaining: 67.5ms
467:	learn: 0.0076678	total: 957ms	remaining: 65.4ms
468:	learn: 0.0076497	total: 959ms	remaining: 63.4ms
469:	learn: 0.0076230	total: 961ms	remaining: 61.3ms
470:	learn: 0.0075807	total: 963ms	remaining: 59.3ms
471:	learn: 0.0075586	total: 964ms	remaining: 57.2ms
472:	learn: 0.0075300	total: 966ms	remaining: 55.1ms
473:	learn: 0.0075161	total: 968ms	remaining: 53.1ms
474:	learn: 0.0074849	total: 970ms	remaining: 

In [146]:
cv_scores.mean()

0.868904428904429

In [140]:
y_pred_proba = cat_boost.predict_proba(x_test)[:,1]

In [141]:
pcos_submission_catboost_1202 = pd.DataFrame({'ID': test['ID'], 'PCOS' : y_pred_proba})

In [142]:
pcos_submission_catboost_1202.to_csv('pcos_submission_catboost_1202.csv', index = False)

# LR

In [147]:
from sklearn.linear_model import LogisticRegression

In [148]:
lr = LogisticRegression()

In [149]:
lr.fit(x_sm,y_sm)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [151]:
cv_scores = cross_val_score(lr, x_sm,y_sm,cv = 5)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [152]:
cv_scores.mean()

0.7802797202797203

In [153]:
y_pred_proba = lr.predict_proba(x_test)[:,1]

In [154]:
pcos_submission_lr_1202 = pd.DataFrame({'ID': test['ID'], 'PCOS' : y_pred_proba})

In [155]:
pcos_submission_lr_1202.to_csv('pcos_submission_lr_1202.csv', index = False)