**Infertility problem:**
Infertility is defined as not being able to get pregnant despite having frequent, unprotected sex for at least a year for most couples. It can be the problem with either male or female, or a combination of factors that prevents pregnancy. 

**Dataset for Analysis:**
The data is from UCI Machine Learning. Dataset collection from 100 volunteers provided. The attributes which are identified, which consist of the season, age, childish diseases, accident or serious trauma, surgical intervention, high fevers in the last year, frequency of alcohol consumption, smoking habits, number of hours spent sitting per day and class.

**Machine Learning Classification Algorithms:**
1) Logistic Regression
2) K-Nearest Neighbor
3) Support Vector Machine
4) Decision Tree

**Performance of model:**
F1-Score, Accuracy

# **Importing Libraries**

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Importing Dataset

In [14]:
file = '../input/fertility-data-set/fertility.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0,Season,Age,Childish diseases,Accident or serious trauma,Surgical intervention,High fevers in the last year,Frequency of alcohol consumption,Smoking habit,Number of hours spent sitting per day,Diagnosis
0,spring,30,no,yes,yes,more than 3 months ago,once a week,occasional,16,Normal
1,spring,35,yes,no,yes,more than 3 months ago,once a week,daily,6,Altered
2,spring,27,yes,no,no,more than 3 months ago,hardly ever or never,never,9,Normal
3,spring,32,no,yes,yes,more than 3 months ago,hardly ever or never,never,7,Normal
4,spring,30,yes,yes,no,more than 3 months ago,once a week,never,9,Altered


# Exploratory Data Analysis

In [15]:
df['Diagnosis'].value_counts()

Normal     88
Altered    12
Name: Diagnosis, dtype: int64

In [16]:
#Checking the probability of fertility status for smoking habits
df.groupby(['Smoking habit'])['Diagnosis'].value_counts(normalize = True)

Smoking habit  Diagnosis
daily          Normal       0.857143
               Altered      0.142857
never          Normal       0.892857
               Altered      0.107143
occasional     Normal       0.869565
               Altered      0.130435
Name: Diagnosis, dtype: float64

# Feature Engineering

In [17]:
# Replacing the variables of binary feature into 0 and 1
df['Childish diseases'].replace(to_replace=['yes','no'], value=[0,1],inplace=True)
df['Accident or serious trauma'].replace(to_replace=['yes','no'], value=[0,1],inplace=True)
df['Surgical intervention'].replace(to_replace=['yes','no'], value=[0,1],inplace=True)
df.head()

Unnamed: 0,Season,Age,Childish diseases,Accident or serious trauma,Surgical intervention,High fevers in the last year,Frequency of alcohol consumption,Smoking habit,Number of hours spent sitting per day,Diagnosis
0,spring,30,1,0,0,more than 3 months ago,once a week,occasional,16,Normal
1,spring,35,0,1,0,more than 3 months ago,once a week,daily,6,Altered
2,spring,27,0,1,1,more than 3 months ago,hardly ever or never,never,9,Normal
3,spring,32,1,0,0,more than 3 months ago,hardly ever or never,never,7,Normal
4,spring,30,0,0,1,more than 3 months ago,once a week,never,9,Altered


In [18]:
binary_data = df[['Age','Childish diseases','Accident or serious trauma','Surgical intervention','Number of hours spent sitting per day']]
binary_data.head()

Unnamed: 0,Age,Childish diseases,Accident or serious trauma,Surgical intervention,Number of hours spent sitting per day
0,30,1,0,0,16
1,35,0,1,0,6
2,27,0,1,1,9
3,32,1,0,0,7
4,30,0,0,1,9


# One Hot Encoding of multi label features

In [20]:
# Getting the dummey variables
multilable_data = df[['Season','High fevers in the last year','Frequency of alcohol consumption','Smoking habit']]
multilable_data = pd.get_dummies(columns)

In [21]:
#concatinating the predictors features
Features = pd.concat([binary_data,multilable_data], axis = 1)
Features.shape

(100, 20)

In [22]:
# Target feature
y = df['Diagnosis'].values
y[0:5]

array(['Normal', 'Altered', 'Normal', 'Normal', 'Altered'], dtype=object)

# Data Normalization

In [23]:
X = Features
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.04920383,  2.5869495 , -1.12815215, -0.98019606,  0.15546303,
        -0.67028006,  1.30487651, -0.20412415, -0.62360956, -0.31448545,
         0.76635604, -0.62360956, -0.10050378, -0.81649658,  1.25064086,
        -0.10050378, -0.4843221 , -0.51558005, -1.12815215,  1.82970656],
       [ 2.18733387, -0.38655567,  0.88640526, -0.98019606, -0.14350433,
        -0.67028006,  1.30487651, -0.20412415, -0.62360956, -0.31448545,
         0.76635604, -0.62360956, -0.10050378, -0.81649658,  1.25064086,
        -0.10050378, -0.4843221 ,  1.93956303, -1.12815215, -0.54653573],
       [-1.39112645, -0.38655567,  0.88640526,  1.02020406, -0.05381412,
        -0.67028006,  1.30487651, -0.20412415, -0.62360956, -0.31448545,
         0.76635604, -0.62360956, -0.10050378,  1.22474487, -0.79959006,
        -0.10050378, -0.4843221 , -0.51558005,  0.88640526, -0.54653573],
       [ 0.84541125,  2.5869495 , -1.12815215, -0.98019606, -0.1136076 ,
        -0.67028006,  1.30487651, -0.20412415, -

# Splitting training and testing data set

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30)

# Support Vector Machine

In [25]:
from sklearn.svm import SVC

In [26]:
svm=SVC().fit(X_train,y_train)
pred_svm=svm.predict(X_test)

In [27]:
print('F1-SCORE : ',f1_score(y_test,pred_svm,average=None))
print('Train Accuracy: ',metrics.accuracy_score(y_train, svm.predict(X_train))*100,'%')

F1-SCORE :  [0.         0.96551724]
Train Accuracy:  85.71428571428571 %


# Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
lgm=LogisticRegression()
lgm.fit(X_train,y_train)
pred_lgm=lgm.predict(X_test)

In [30]:
print('F1-SCORE : ',f1_score(y_test,pred_lgm,average=None))
print('Train Accuracy: ',metrics.accuracy_score(y_train, lgm.predict(X_train))*100,'%')

F1-SCORE :  [0.33333333 0.92592593]
Train Accuracy:  92.85714285714286 %


# K-Nearness Neighbor

In [31]:
from sklearn.neighbors import KNeighborsClassifier

In [32]:
for k in range(1,71):
    knn=KNeighborsClassifier(n_neighbors=k,weights='uniform')
    knn.fit(X_train,y_train)
    predKNN=knn.predict(X_test)
    accuracy=metrics.accuracy_score(predKNN,y_test)
    print (k,': ',accuracy)

1 :  0.8333333333333334
2 :  0.7
3 :  0.9333333333333333
4 :  0.9
5 :  0.9333333333333333
6 :  0.9333333333333333
7 :  0.9333333333333333
8 :  0.9333333333333333
9 :  0.9333333333333333
10 :  0.9333333333333333
11 :  0.9333333333333333
12 :  0.9333333333333333
13 :  0.9333333333333333
14 :  0.9333333333333333
15 :  0.9333333333333333
16 :  0.9333333333333333
17 :  0.9333333333333333
18 :  0.9333333333333333
19 :  0.9333333333333333
20 :  0.9333333333333333
21 :  0.9333333333333333
22 :  0.9333333333333333
23 :  0.9333333333333333
24 :  0.9333333333333333
25 :  0.9333333333333333
26 :  0.9333333333333333
27 :  0.9333333333333333
28 :  0.9333333333333333
29 :  0.9333333333333333
30 :  0.9333333333333333
31 :  0.9333333333333333
32 :  0.9333333333333333
33 :  0.9333333333333333
34 :  0.9333333333333333
35 :  0.9333333333333333
36 :  0.9333333333333333
37 :  0.9333333333333333
38 :  0.9333333333333333
39 :  0.9333333333333333
40 :  0.9333333333333333
41 :  0.9333333333333333
42 :  0.933333

In [33]:
knn=KNeighborsClassifier(n_neighbors=15,weights='uniform')
knn.fit(X_train,y_train)
predKNN=knn.predict(X_test)
accuracy=metrics.accuracy_score(predKNN,y_test)
print("accuracy : ",round(accuracy,3)*100,'%')

accuracy :  93.30000000000001 %


In [34]:
print('F1-SCORE : ',f1_score(y_test,predKNN,average=None))
print('Train Accuracy: ',metrics.accuracy_score(y_train, knn.predict(X_train))*100,'%')

F1-SCORE :  [0.         0.96551724]
Train Accuracy:  85.71428571428571 %


# Decision Tree

In [35]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
dtree=DecisionTreeClassifier()

In [None]:
parameter_grid = {'max_depth': [1, 2, 3, 4, 5,6,5,9,15,20],'max_features': [1, 2, 3, 4,5,6,7,8],'random_state':[0,15,20,35,50,80,100,150,180,200],'criterion':['gini','entropy'],}
grid_search = GridSearchCV(dtree, param_grid = parameter_grid,cv =10)
grid_search.fit(X_train, y_train)
print ("Best Score: {}".format(grid_search.best_score_))
print ("Best params: {}".format(grid_search.best_params_))

In [None]:
dtree=DecisionTreeClassifier(max_depth=5,criterion='entropy',max_features=2,random_state=0)
dtree.fit(X_train,y_train)
pred_Dtree=dtree.predict(X_test)

In [None]:
print('F1-SCORE : ',f1_score(y_test,pred_Dtree,average=None))
print('Train Accuracy: ',metrics.accuracy_score(y_train, dtree.predict(X_train))*100,'%')

In [None]:
Algorithm = ['KNN','DTree','SVM','LR']
f1_knn=f1_score(y_test,predKNN,average=None)
f1_dtree=f1_score(y_test,pred_Dtree,average=None)
f1_svm=f1_score(y_test,pred_svm,average=None)
f1_lgm=f1_score(y_test,pred_lgm,average=None)
F1_score=[f1_knn,f1_dtree,f1_svm,f1_lgm]

In [None]:
Train_Accuracy_knn= metrics.accuracy_score(y_train, knn.predict(X_train))*100
Train_Accuracy_dtree= metrics.accuracy_score(y_train, dtree.predict(X_train))*100
Train_Accuracy_svm= metrics.accuracy_score(y_train, svm.predict(X_train))*100
Train_Accuracy_lgm= metrics.accuracy_score(y_train, lgm.predict(X_train))*100
Train_Accuracy_score=[Train_Accuracy_knn,Train_Accuracy_dtree,Train_Accuracy_svm,Train_Accuracy_lgm]

# Report

In [None]:
table = pd.DataFrame({"Algorithm":Algorithm,"F1-Score": F1_score,"Train_Accuracy(%)":Train_Accuracy_score})
table