##**Classification Project using Loan Status Data**

##**Main Goals of this Program:**
**I) Check and decide the ML Learning Type and sub-type as applicable**

**II) Check and remove the duplicate records, if any**

**III) Check the class balance**

**IV) Check for Missing Values and handle them as required**

**V) Check for the necessity of creating new column(s) and create the columns as required**

**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **1) Wrong Data in the columns, if any** 
* **2) Wrong format of the data in the columns, if any**
* **3) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**

**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

**VIII) Implement the Scaling as required**

**IX) Write out the transformed Input file for further usage**

**1) Install/ Import the required Python Packages/ Libraries**

In [None]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import preprocessing
%matplotlib inline

In [None]:
pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**2) Mounting the Google Drive**

In [None]:
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**3) Read the Data file and check**

In [None]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
df = pd.read_csv('gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/03-Diabetes-Project/Data-Files/diabetes-train.csv')
print(df.shape)
df.head()

(700, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


##**I) Check and decide the ML Learning Type and sub-type as applicable**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               700 non-null    int64  
 1   Glucose                   700 non-null    int64  
 2   BloodPressure             700 non-null    int64  
 3   SkinThickness             700 non-null    int64  
 4   Insulin                   700 non-null    int64  
 5   BMI                       700 non-null    float64
 6   DiabetesPedigreeFunction  700 non-null    float64
 7   Age                       700 non-null    int64  
 8   Outcome                   700 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 49.3 KB


In [None]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

**Observations on the given Dataset:**
* a) Number of Independet Variables: 12 (Identified)
* b) Number of Dependent Variable : 1 (Loan_Status) (Identified)
* c) There is no Missing Value in the Dependent Variable column "Loan_Status"


**Conclusions:**
###**a) The given dataset probably belongs to the"Supervised Learning" main-type**
###**b) Since the Dependent variable values are categorical in nature, the given dataset is of "Classification" sub-type.**

##**II) Check and remove the duplicate records, if any**

In [None]:
df.shape

(700, 9)

In [None]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

(700, 9)

###**Conclusion: No Duplicate Records**

##**III) Check the Class balance**

In [None]:
df["Outcome"].value_counts()

0    459
1    241
Name: Outcome, dtype: int64

###**Conclusion: It is a Binary Classification with imbalanced Classes**

##**V) Check for necessity of creating new column(s) and create the columns as required**

###**Decision: As of now, there is no necessity to create new column(s).**

##**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **a) Wrong Data in the columns, if any** 
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


###**Column-1: Pregnancies**

In [None]:
df['Pregnancies'].value_counts()

1     120
0     106
2      91
3      68
4      63
5      53
6      46
7      43
8      35
9      24
10     20
11     10
13      9
12      8
14      2
15      1
17      1
Name: Pregnancies, dtype: int64

**Action: We will keep the data in this column as it is**

###**Column-2: Glucose**

In [None]:
df['Glucose'].value_counts()

100    16
99     16
111    13
125    13
105    12
       ..
153     1
44      1
170     1
62      1
169     1
Name: Glucose, Length: 133, dtype: int64

**Action: We will keep the data in this column as it is**

###**Column-3: BloodPressure**

In [None]:
df['BloodPressure'].value_counts()

70     53
74     48
68     44
64     41
72     40
80     38
78     36
76     34
0      33
60     33
62     30
82     30
66     29
90     21
84     21
88     19
58     18
86     17
50     12
54     11
56     11
52     10
65      7
92      7
75      7
85      6
48      5
96      4
106     3
100     3
98      3
110     3
44      3
94      3
55      2
108     2
104     2
30      2
122     1
95      1
46      1
102     1
61      1
24      1
38      1
40      1
114     1
Name: BloodPressure, dtype: int64

**Action: We will keep the data in this column as it is**

###**Column-4: SkinThickness**

In [None]:
df['SkinThickness'].value_counts()

0     209
32     28
30     26
28     20
23     19
33     18
18     18
19     17
27     17
31     16
25     16
35     15
40     15
39     15
22     15
29     14
15     14
26     14
37     13
17     13
41     12
42     11
36     11
24     11
13     10
20     10
21      9
34      8
38      7
46      7
12      7
45      6
11      6
43      6
16      6
14      6
10      5
44      4
47      4
50      3
49      3
48      3
8       2
7       2
52      2
54      2
63      1
60      1
56      1
51      1
99      1
Name: SkinThickness, dtype: int64

###**Column-5: Insulin**

In [None]:
df['Insulin'].value_counts()

0      338
130      9
105      9
140      8
120      6
      ... 
375      1
258      1
680      1
370      1
200      1
Name: Insulin, Length: 176, dtype: int64

###**Column-6: BMI**

In [None]:
df['BMI'].value_counts()

31.6    12
32.0    11
0.0     10
33.3    10
31.2    10
        ..
48.8     1
52.9     1
19.1     1
30.7     1
44.5     1
Name: BMI, Length: 245, dtype: int64

###**Column-7: DiabetesPedigreeFunction**

In [None]:
df['DiabetesPedigreeFunction'].value_counts()

0.254    6
0.268    5
0.207    5
0.258    5
0.238    5
        ..
0.293    1
0.394    1
0.645    1
0.089    1
0.904    1
Name: DiabetesPedigreeFunction, Length: 487, dtype: int64

###**Column-8: Age**

In [None]:
df['Age'].value_counts()

22    63
21    59
25    46
24    43
23    36
28    32
27    29
26    29
29    29
31    23
41    21
30    19
37    18
33    16
36    15
42    14
38    14
32    14
43    12
46    12
40    12
45    11
34    11
35    10
39    10
51     8
44     8
58     7
50     7
54     6
47     5
60     5
52     5
57     5
62     4
49     4
55     4
48     4
53     4
65     3
59     3
66     3
63     3
67     3
69     2
61     2
56     2
72     1
81     1
64     1
70     1
68     1
Name: Age, dtype: int64

###**Column-9: Outcome**

In [None]:
df['Outcome'].value_counts()

0    459
1    241
Name: Outcome, dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 700 entries, 0 to 699
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               700 non-null    int64  
 1   Glucose                   700 non-null    int64  
 2   BloodPressure             700 non-null    int64  
 3   SkinThickness             700 non-null    int64  
 4   Insulin                   700 non-null    int64  
 5   BMI                       700 non-null    float64
 6   DiabetesPedigreeFunction  700 non-null    float64
 7   Age                       700 non-null    int64  
 8   Outcome                   700 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.7 KB


In [None]:
df.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.138734,0.150595,-0.092291,-0.06368,0.033234,-0.043731,0.55171,0.227744
Glucose,0.138734,1.0,0.148237,0.05977,0.333537,0.225048,0.138362,0.273091,0.45928
BloodPressure,0.150595,0.148237,1.0,0.203496,0.103445,0.272566,0.033937,0.240968,0.060193
SkinThickness,-0.092291,0.05977,0.203496,1.0,0.442612,0.385116,0.178332,-0.115348,0.087405
Insulin,-0.06368,0.333537,0.103445,0.442612,1.0,0.199906,0.192932,-0.0177,0.145922
BMI,0.033234,0.225048,0.272566,0.385116,0.199906,1.0,0.13789,0.036958,0.306597
DiabetesPedigreeFunction,-0.043731,0.138362,0.033937,0.178332,0.192932,0.13789,1.0,0.033667,0.170532
Age,0.55171,0.273091,0.240968,-0.115348,-0.0177,0.036958,0.033667,1.0,0.22699
Outcome,0.227744,0.45928,0.060193,0.087405,0.145922,0.306597,0.170532,0.22699,1.0


##**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

##**Step-5: Slice X and y Values**

In [None]:
X = df.drop(['Outcome'], axis = 1)
Y = df['Outcome']
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [None]:
Y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

##**Step-6: Execute Train-Test-Split Command and Verify**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 66)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(560, 8)
(560,)
(140, 8)
(140,)


##**Step-7: Learn the Data and Predict the dependent Variable values for the "X_test"data using "LogisticRegression()" algorithm**

In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [None]:
y_pred = logmodel.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 0])

##**Step-8: Calculate the Accuracy of the Model**

In [None]:
accuracy_lr = logmodel.score(X_test, y_test)
print("Accuracy of Logistic Regression on test set:",accuracy_lr)

Accuracy of Logistic Regression on test set: 0.75


##**Step-9: Display the Confusion Matrix and Classification Report of the Model**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  

[[81 16]
 [19 24]]
              precision    recall  f1-score   support

           0       0.81      0.84      0.82        97
           1       0.60      0.56      0.58        43

    accuracy                           0.75       140
   macro avg       0.71      0.70      0.70       140
weighted avg       0.75      0.75      0.75       140



##**VIII) Implement the Scaling as required**

###**Use Normalization**

In [None]:
X_train.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [None]:
columnNames = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']

In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_train1 = min_max_scaler_object.fit_transform(X_train)
X_train1 = pd.DataFrame(X_train1 , columns = columnNames)
X_train1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.294118,0.718593,0.639344,0.0,0.0,0.670641,0.047822,0.433333
1,0.352941,0.0,0.557377,0.414141,0.0,0.581222,0.277114,0.333333
2,0.294118,0.723618,0.672131,0.262626,0.336879,0.4769,0.159693,0.616667
3,0.058824,0.603015,0.655738,0.484848,0.236407,0.579732,0.462852,0.333333
4,0.0,0.527638,0.737705,0.0,0.0,0.441133,0.050811,0.416667


In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_test1 = min_max_scaler_object.fit_transform(X_test)
X_test1 = pd.DataFrame(X_test1 , columns = columnNames)
X_test1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.071429,0.493421,0.88,0.82,0.25,0.856333,0.233407,0.098039
1,0.0,0.414474,0.76,0.0,0.0,0.856333,0.332412,0.058824
2,0.5,0.993421,0.7,0.66,0.213235,0.47448,0.043142,0.666667
3,0.571429,0.532895,0.96,0.0,0.0,0.0,0.081305,0.647059
4,0.071429,0.177632,0.78,1.0,0.066176,0.627599,0.186394,0.0


In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel1 = LogisticRegression()
logmodel1.fit(X_train1, y_train)

LogisticRegression()

In [None]:
#predictions
predictions1 = logmodel1.predict(X_test1)

In [None]:
print(confusion_matrix(y_test, predictions1))
print(classification_report(y_test,predictions1))

[[81 16]
 [16 27]]
              precision    recall  f1-score   support

           0       0.84      0.84      0.84        97
           1       0.63      0.63      0.63        43

    accuracy                           0.77       140
   macro avg       0.73      0.73      0.73       140
weighted avg       0.77      0.77      0.77       140



###**Use Standardization**

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_train2 = std_scaler_object.fit_transform(X_train)
X_train2 = pd.DataFrame(X_train2 , columns = columnNames)
X_train2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.356828,0.701621,0.481266,-1.282331,-0.699296,1.620803,-0.856343,1.237655
1,0.653743,-3.715153,-0.03385,1.264803,-0.699296,0.876625,0.685487,0.716733
2,0.356828,0.732507,0.687312,0.332925,1.792114,0.008416,-0.104091,2.192679
3,-0.830832,-0.00877,0.584289,1.69968,1.049061,0.864222,1.934455,0.716733
4,-1.127746,-0.472067,1.099405,-1.282331,-0.699296,-0.289255,-0.836245,1.150835


In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_test2 = std_scaler_object.fit_transform(X_test)
X_test2 = pd.DataFrame(X_test2 , columns = columnNames)
X_test2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,-0.882294,-0.081918,0.91371,1.38148,0.755699,1.821723,0.282521,-0.683611
1,-1.182102,-0.464203,0.312208,-1.234718,-0.662614,1.821723,0.915282,-0.841803
2,0.916558,2.339219,0.011457,0.871003,0.547123,-0.889625,-0.933512,1.610159
3,1.216366,0.109224,1.314711,-1.234718,-0.662614,-4.258676,-0.689599,1.531064
4,-0.882294,-1.611058,0.412458,1.955768,-0.287179,0.197599,-0.017953,-1.079089


In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel2 = LogisticRegression()
logmodel2.fit(X_train2, y_train)

LogisticRegression()

In [None]:
#predictions
predictions2 = logmodel2.predict(X_test2)

In [None]:
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test,predictions2))

[[80 17]
 [19 24]]
              precision    recall  f1-score   support

           0       0.81      0.82      0.82        97
           1       0.59      0.56      0.57        43

    accuracy                           0.74       140
   macro avg       0.70      0.69      0.69       140
weighted avg       0.74      0.74      0.74       140



**Observation: Both the scaling methods gives the accuracy of 83%**

**Decision: We will use the "Normalization" method for our model.**


##**IX) Write out the transformed Input file for further usage**

In [None]:
X1 = df.drop(['Outcome'], axis = 1)
Y1 = df['Outcome']
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X2 = min_max_scaler_object.fit_transform(X1)
X2 = pd.DataFrame(X2 , columns = columnNames)
print(X2.shape)
X2.head()

(700, 8)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2


In [None]:
df1 = pd.DataFrame(data=X2)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               700 non-null    float64
 1   Glucose                   700 non-null    float64
 2   BloodPressure             700 non-null    float64
 3   SkinThickness             700 non-null    float64
 4   Insulin                   700 non-null    float64
 5   BMI                       700 non-null    float64
 6   DiabetesPedigreeFunction  700 non-null    float64
 7   Age                       700 non-null    float64
dtypes: float64(8)
memory usage: 43.9 KB


In [None]:
df1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2


In [None]:
df1 = pd.concat([df1,Y1], axis=1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               700 non-null    float64
 1   Glucose                   700 non-null    float64
 2   BloodPressure             700 non-null    float64
 3   SkinThickness             700 non-null    float64
 4   Insulin                   700 non-null    float64
 5   BMI                       700 non-null    float64
 6   DiabetesPedigreeFunction  700 non-null    float64
 7   Age                       700 non-null    float64
 8   Outcome                   700 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 49.3 KB


In [None]:
df1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333,1
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667,0
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333,1
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0,0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2,1


In [None]:
from google.colab import files
df1.to_csv("gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/03-Diabetes-Project/Data-Files/Diabetes_train_Preprocessed1.csv", index = False)