In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENN 
from collections import Counter
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import LocalOutlierFactor

### 1. Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [2]:
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

#Standard Scaler:
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [4]:
# Applying SMOTEENN:
sme = SMOTEENN(random_state=42)

X_res, y_res = sme.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

Resampled dataset shape Counter({1: 215, 0: 201})


In [5]:
# Applying Logistic Regression on balanced class set:
diabetes_log= LogisticRegression(solver='liblinear')
diabetes_log.fit(X_train,y_train)

y_pred= diabetes_log.predict(X_test)

In [6]:
diabetes_log.coef_

array([[ 0.42957868,  1.10919168, -0.20374881, -0.00767468, -0.07913502,
         0.72513685,  0.23777214,  0.15312895]])

In [7]:
diabetes_log.intercept_

array([-0.86016522])

In [8]:
con_log=confusion_matrix(y_test,y_pred)
con_log

array([[105,  20],
       [ 31,  36]])

In [9]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.77      0.84      0.80       125
           1       0.64      0.54      0.59        67

    accuracy                           0.73       192
   macro avg       0.71      0.69      0.69       192
weighted avg       0.73      0.73      0.73       192



### Combined Sampling:
#### Machine learning algorithms when applied over classiciation problems often give less performance due to imbalanced datasets. Therefore, we need to balance class distribution using sampling methods.  Commonly used, sampling methods are: Oversampling, which duplicates the data from the minority class and Undersampling, which deletes the data from the majority class. While these sampling methods are applied individually on the dataset it gives good results. However, when used both methods together datasets giving better results than before. SMOTE and random sampling, SMOTEENN are most commonly used combined sampling methods. Initally SMOTE is applied over the data to create duplicates of the minority set and undersampling methods are applied to the resultant of the SMOTE to delete  the majority class.

#### SMOTEENN :Initially SMOTE randomly choose data from the minority class. Then calculate distance between the random data and its knearest neighbors. Then multiply the differnce value with 0 or 1 and add the result as a new data point. This process is repeated untill desired accuracy is reached.  Next we have apply ENN to apply undersampling. Determine the value for k then find k nearest neighbors and observe the majority class in both data and k nearest. If both are different then the observation along with the nearest neighbor are deleted. Repeat this untill the desired propotion of class is reached.

### 2. Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

In [10]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

#Standard Scaler:
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [11]:
classifiers = [LogisticRegression(solver='liblinear'),SVC(), KNeighborsClassifier(),DecisionTreeClassifier(max_depth=5)]
acc=[]
pre=[]
rec=[]

for model in classifiers:
    model=model.fit(X_train,y_train)
    y_pred= model.predict(X_test)
    #con_log=confusion_matrix(y_test,y_pred)
    con=classification_report(y_test,y_pred,output_dict=True)
    pre.append(con['weighted avg']['precision'])
    rec.append(con['weighted avg']['recall'])
    acc.append(con['accuracy'])

In [12]:
df=pd.DataFrame({'Accuracy':acc,'Precission': pre, 'Recall':rec })
df.index=('LogisticRegiossion','SVC','KNN','DecissionTree')
df

Unnamed: 0,Accuracy,Precission,Recall
LogisticRegiossion,0.734375,0.726973,0.734375
SVC,0.744792,0.737991,0.744792
KNN,0.71875,0.710993,0.71875
DecissionTree,0.770833,0.765716,0.770833


### Removing Outliers:

In [13]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

#Standard Scaler:
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)
# Remove Outliers:

print(X_train.shape, y_train.shape)
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
print(X_train.shape, y_train.shape)

(576, 8) (576,)
(550, 8) (550,)


In [14]:
classifiers = [LogisticRegression(solver='liblinear'),SVC(), KNeighborsClassifier(),DecisionTreeClassifier(max_depth=5)]
acc1=[]
pre1=[]
rec1=[]


for model in classifiers:
    model=model.fit(X_train,y_train)
    y_pred= model.predict(X_test)
    con_log=confusion_matrix(y_test,y_pred)
    con=classification_report(y_test,y_pred,output_dict=True)
    pre1.append(con['weighted avg']['precision'])
    rec1.append(con['weighted avg']['recall'])
    acc1.append(con['accuracy'])

In [15]:
df_o=pd.DataFrame({'Accuracy':acc1,'Precission': pre1, 'Recall': rec1})
df_o.index=('LogisticRegiossion_o','SVC_o','KNN_o','DecissionTree_o')
df_o

Unnamed: 0,Accuracy,Precission,Recall
LogisticRegiossion_o,0.729167,0.721038,0.729167
SVC_o,0.734375,0.727897,0.734375
KNN_o,0.71875,0.710993,0.71875
DecissionTree_o,0.75,0.743827,0.75


### Applying SMOTEENN to balance classes:

In [16]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)

#Standard Scaler:
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

# Remove Outliers:

print(X_train.shape, y_train.shape)

lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]
print(X_train.shape, y_train.shape)

sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

(576, 8) (576,)
(550, 8) (550,)
Resampled dataset shape Counter({1: 239, 0: 188})


In [17]:
classifiers_s = [LogisticRegression(solver='liblinear'),SVC(), KNeighborsClassifier(),DecisionTreeClassifier(max_depth=5)]
acc_s=[]
pre_s=[]
rec_s=[]

for model in classifiers:
    model=model.fit(X_train,y_train)
    y_pred= model.predict(X_test)
    con_log=confusion_matrix(y_test,y_pred)
    con=classification_report(y_test,y_pred,output_dict=True)
    pre_s.append(con['weighted avg']['precision'])
    rec_s.append(con['weighted avg']['recall'])
    acc_s.append(con['accuracy'])

In [18]:
df_smote=pd.DataFrame({'Accuracy':acc_s,'Precission': pre_s, 'Recall': rec_s})
df_smote.index=('LogisticRegiossion_SMOTE','SVC_SMOTE','KNN_SMOTE','DecissionTree_SMOTE')
df_smote

Unnamed: 0,Accuracy,Precission,Recall
LogisticRegiossion_SMOTE,0.729167,0.721038,0.729167
SVC_SMOTE,0.734375,0.727897,0.734375
KNN_SMOTE,0.71875,0.710993,0.71875
DecissionTree_SMOTE,0.75,0.743827,0.75


In [19]:
#Sorting dataframe for better comparision:

diab_model=pd.concat([df,df_smote,df_o],axis=0)
diab_model.sort_index()

Unnamed: 0,Accuracy,Precission,Recall
DecissionTree,0.770833,0.765716,0.770833
DecissionTree_SMOTE,0.75,0.743827,0.75
DecissionTree_o,0.75,0.743827,0.75
KNN,0.71875,0.710993,0.71875
KNN_SMOTE,0.71875,0.710993,0.71875
KNN_o,0.71875,0.710993,0.71875
LogisticRegiossion,0.734375,0.726973,0.734375
LogisticRegiossion_SMOTE,0.729167,0.721038,0.729167
LogisticRegiossion_o,0.729167,0.721038,0.729167
SVC,0.744792,0.737991,0.744792


#### Observations:
#### All the performace measures precission, accuracy,recall are very important and should be high to achieve better performance results. Therefore, in diabetes data set, a person can be falsely labeled as diabetic as he can go for further evaluaitons but a true positive canot be labeled as negative. Therefore, models with high recall need to chosen. On observations Decission Tree is considered. It also observed that removing outliers,apping smote for this set did not make any difference. 

### 3. What is outlier detection? Why is it useful? What methods can you use for outlier detection?

#### “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” by Hawkins.  There are various reasons outliers will happen to be in data this can be due to a machine error, measuring error, entry error or it can be intentional. Outliers can impact the model drastically and therefore needs to be addresssed. Outlier detection methods try to fit the regions that are concentrated with the data leaving the deviatant observations.  
#### Various methods are used for outlier detection such as :
#### 1. Standard deviation method: If the s.d of a data is greater than 3 then the point is considered as outlier.
#### 2. Interquartile method: Points that fall below or above the interquartile range are too considered as outliers.
#### 3. Automatic outlier detection: This can be done by importing local outlier factor from sklearn.metrices.

### 4. Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 
You can use this code, but otherwise you follow standard practices we have already used many times: 
from sklearn.svm import SVC
classifier = SVC(kernel='linear')


In [20]:
data = np.genfromtxt('australian.dat',
                     skip_header=1,
                     skip_footer=1,
                     names=True,
                     dtype=None,
                     delimiter=' ')
df_c=pd.DataFrame(data)
df_c.columns=['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15']
df_c

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
0,0,29.58,1.750,1,4,4,1.250,0,0,0,1,2,280,1,0
1,0,21.67,11.500,1,5,3,0.000,1,1,11,1,2,0,1,1
2,1,20.17,8.170,2,6,4,1.960,1,1,14,0,2,60,159,1
3,0,15.83,0.585,2,8,8,1.500,1,1,2,0,2,100,1,1
4,1,17.42,6.500,2,3,4,0.125,0,0,0,0,2,60,101,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
682,1,43.00,0.290,1,13,8,1.750,1,1,8,0,2,100,376,1
683,1,31.57,10.500,2,14,4,6.500,1,0,0,0,2,0,1,1
684,1,20.67,0.415,2,8,4,0.125,0,0,0,0,2,0,45,0
685,0,18.83,9.540,2,6,4,0.085,1,0,0,0,2,100,1,1


In [21]:
df_c.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
count,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0,687.0
mean,0.678311,31.581237,4.752576,1.765648,7.372635,4.695779,2.230509,0.525473,0.427948,2.409025,0.458515,1.930131,183.624454,1021.064047,0.445415
std,0.467465,11.863305,4.978474,0.430725,3.687621,1.99614,3.351769,0.499715,0.495142,4.871537,0.498639,0.297331,171.904268,5221.187569,0.497374
min,0.0,13.75,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
25%,0.0,22.67,1.0,2.0,4.0,4.0,0.165,0.0,0.0,0.0,0.0,2.0,80.0,1.0,0.0
50%,1.0,28.67,2.75,2.0,8.0,4.0,1.0,1.0,0.0,0.0,0.0,2.0,160.0,6.0,0.0
75%,1.0,37.665,7.165,2.0,10.0,5.0,2.6675,1.0,1.0,3.0,1.0,2.0,272.0,396.0,1.0
max,1.0,80.25,28.0,3.0,14.0,9.0,28.5,1.0,1.0,67.0,1.0,3.0,2000.0,100001.0,1.0


In [22]:
X = df_c.drop('A15', axis=1)
y = df_c['A15']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,stratify=y,random_state=24)

#Standard Scaler:
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [23]:
from sklearn import svm
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix

model = SVC(kernel='linear')

model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[72 23]
 [ 8 69]]


In [24]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.76      0.82        95
           1       0.75      0.90      0.82        77

    accuracy                           0.82       172
   macro avg       0.82      0.83      0.82       172
weighted avg       0.83      0.82      0.82       172



In [25]:
print("Testing Score:",model.score(X_test,y_test))
print("Training Score:",model.score(X_train,y_train))

Testing Score: 0.8197674418604651
Training Score: 0.8699029126213592


### 5. How did the SVM model perform? Use a classification report. 


In [26]:
TN = cm[0,0]
TP = cm[1,1]
FN = cm[1,0]
FP = cm[0,1]

Accuracy_logl= (TN + TP) / (TN+ TP + FN + FP)
print("Accuracy:",Accuracy_logl)

precision = TP / (TP + FP)
print("Precision :", precision)

sensitivity = TP / (FN + TP)
print("Sensitivity/Recall:", sensitivity)

specificity = TN / (TN + FP)
print("Specificity:", specificity)

Accuracy: 0.8197674418604651
Precision : 0.75
Sensitivity/Recall: 0.8961038961038961
Specificity: 0.7578947368421053


#### Using classification report:

In [27]:
target_names=['Not_Approved','Appoved']

In [28]:
print(classification_report(y_test, y_pred,target_names=target_names))

              precision    recall  f1-score   support

Not_Approved       0.90      0.76      0.82        95
     Appoved       0.75      0.90      0.82        77

    accuracy                           0.82       172
   macro avg       0.82      0.83      0.82       172
weighted avg       0.83      0.82      0.82       172



#### Model has good performance results as to sensitivity /recall value is almost 95% which means approving a credit to the correct individual is happening 95% of times. Which inturn tells that correct factors are considered while checking for loan approval.  Also, accuracy, precission and specificity values are pretty good. F1 score which is harmonic mean of precission and recall is also good indicator.

### 6. What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

#### Data science career involves in collecting,shaping,storing,managing and analyzing information from the data. Data science can be applied in various fields such as recommendations for the movies, purchases, credit fraud, medicine for analysis of disease, manufacturing to predict machine break downs and many more. With my thetorical educational background of engineering i always loved and worked with data to draw some conclusions, making process better. Now, after exposure to data science my urge to work with data has been increased. After researching on the career scope with data science fields most attracted role include data scientist working with large complex data cleaning to drive strategic business decisisions, machine learning positions that involves in applying different machine learning algorithms and would also like to work on interactive visualizations softwares to present my predicitons to the business.