<a name="1"></a>
# SVC


The next dataset was selected - https://www.kaggle.com/uciml/adult-census-income/

The prediction task is to determine whether a person makes over $50K a year.

### Loading and exploring data

In [None]:
# Load data
df = pd.read_csv('/Who can earn more than 50K per year.csv')

print("Shape of the train dataframe =", df.shape)

df.head()

Shape of the train dataframe = (32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [None]:
df = df.replace('?', np.NaN) # replace '?' symbo

# Check missing values in data
print(f"Missing values in the training set:\n{df.isnull().sum()}\n")

df = df.dropna()
df.head()

Missing values in the training set:
age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64



Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
6,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K


In [None]:
df['income'] = df['income'].map({'<=50K':0, '>50K':1})

X = df.drop(['income'], axis=1)
y = df['income']

In [None]:
# Feature scaling
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X[feature] = le.fit_transform(X[feature])

scaler = StandardScaler()

X = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,3.31663,-0.208955,-0.53879,0.174763,-0.439738,2.282969,-0.734545,-0.261249,0.385048,-1.443405,-0.147445,10.555814,-1.914161,0.264924
1,1.184831,-0.208955,-0.467906,-1.39912,-2.400559,-1.722396,0.009964,1.612215,0.385048,-1.443405,-0.147445,9.427915,-0.077734,0.264924
2,0.195067,-0.208955,0.708645,1.224018,-0.047574,1.615408,0.754473,0.987727,0.385048,-1.443405,-0.147445,9.427915,-0.077734,0.264924
3,-0.337883,-0.208955,0.256222,0.174763,-0.439738,-1.722396,0.258134,1.612215,0.385048,-1.443405,-0.147445,9.106365,0.339636,0.264924
4,-0.03334,-0.208955,-0.370964,-2.710688,-1.616231,1.615408,-1.479055,1.612215,0.385048,0.692806,-0.147445,9.106365,-0.077734,0.264924


### SVC with a linear kernel

In [None]:
# Create an SVM model with linear kernel
model_linear = SVC(kernel='linear', C=0.1, gamma='auto')
model_linear = SVC_train(model_linear, X, y)

print("Bias:", model_linear.intercept_)
print("Weights:", model_linear.coef_[0])

Cross-validation accuracies:
1 fold: 0.814
2 fold: 0.814
3 fold: 0.803
4 fold: 0.810
5 fold: 0.811
CV mean μ = 0.810 with CV standard deviation σ = 0.39

Train accuracy: 0.811
Dev accuracy : 0.812

Bias: [-0.95951309]
Weights: [ 1.74751693e-01 -4.91135060e-02  2.35391107e-02  6.65519527e-03
  4.45404724e-01 -1.43241982e-01 -7.29207853e-03 -1.29584303e-01
  2.74692464e-02  1.81484682e-01  1.71658238e+00  2.32996170e-01
  1.43793489e-01 -1.96432215e-04]


### SVC with an RBF kernel

In [None]:
# Create an SVM model with RBF kernel
model_rbf = SVC(kernel='rbf', C=1, gamma='auto')
model_rbf = SVC_train(model_rbf, X, y)

Cross-validation accuracies:
1 fold: 0.843
2 fold: 0.841
3 fold: 0.845
4 fold: 0.840
5 fold: 0.844
CV mean μ = 0.843 with CV standard deviation σ = 0.20

Train accuracy: 0.853
Dev accuracy : 0.845



### SVC with a polynomial kernel

In [None]:
# Create an SVM model with polynomial kernel
model_poly = SVC(kernel='poly', C=0.1, degree=3, gamma='auto')
model_poly = SVC_train(model_poly, X, y)

Cross-validation accuracies:
1 fold: 0.815
2 fold: 0.806
3 fold: 0.807
4 fold: 0.805
5 fold: 0.810
CV mean μ = 0.808 with CV standard deviation σ = 0.38

Train accuracy: 0.815
Dev accuracy : 0.812



### Logistic Regression model 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.3, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_dev)

print('Logistic Regression accuracy score with all the features: {}'.format(accuracy_score(y_dev, y_pred)))

Logistic Regression accuracy score with all the features: 0.8216377500276274


### Confusion matrix and its derivations

In [None]:
from sklearn.metrics import confusion_matrix

def display_cm(X, y, model, model_name):
    # Obtain predictions
    y_pred = model.predict(X)
    
    # Get a confusion matrix
    C = confusion_matrix(y, y_pred)
    
    # Compute metrics
    precision = C[1,1] / (C[1,1] + C[0,1])
    recall = C[1,1] / (C[1,1] + C[1,0])
    acc = (C[1,1] + C[0,0]) / (C[1,1] + C[0,0] + C[0,1] + C[1,0])
    F1_score = 2 * precision * recall / (precision + recall)
    
    print(f"{model_name}:")
    print("Сonfusion matrix:\n", C)
    print()
    print("TP = ", C[1,1])
    print("FP = ", C[0,1])    
    print("TN = ", C[0,0])
    print("FN = ", C[1,0])    
    print()
    print("Precision: {:.3f}".format(precision))
    print("Recall: {:.3f}".format(recall))
    print("Accuracy: {:.3f}".format(acc))
    print("F1 score: {:.3f}".format(F1_score))
    print()

models = [model_linear, model_rbf, model_poly, logreg]
names = ['Linear kernel', 'RBF kernel', 'Polynomial kernel', 'Logistic Regression']

for i in range(4):
    print("----------------------------------------")    
    display_cm(X_train, y_train, models[i], names[i])

----------------------------------------
Linear kernel:
Сonfusion matrix:
 [[15448   445]
 [ 3548  1672]]

TP =  1672
FP =  445
TN =  15448
FN =  3548

Precision: 0.790
Recall: 0.320
Accuracy: 0.811
F1 score: 0.456

----------------------------------------
RBF kernel:
Сonfusion matrix:
 [[15018   875]
 [ 2231  2989]]

TP =  2989
FP =  875
TN =  15018
FN =  2231

Precision: 0.774
Recall: 0.573
Accuracy: 0.853
F1 score: 0.658

----------------------------------------
Polynomial kernel:
Сonfusion matrix:
 [[15564   329]
 [ 3572  1648]]

TP =  1648
FP =  329
TN =  15564
FN =  3572

Precision: 0.834
Recall: 0.316
Accuracy: 0.815
F1 score: 0.458

----------------------------------------
Logistic Regression:
Сonfusion matrix:
 [[14922   971]
 [ 2830  2390]]

TP =  2390
FP =  971
TN =  14922
FN =  2830

Precision: 0.711
Recall: 0.458
Accuracy: 0.820
F1 score: 0.557

