### Program 6: Write a python program to predict income levels of adult individuals using Support Vector Machine Model.
The process includes training, testing and evaluating the model on the Adult dataset. In this experiment you need to train a classifier on the Adult dataset, to predict whether an individual’s income is
greater or less than $50,000.



Dataset: We have used a smaller version of adult income dataset. This dataset has 3574 rows and 7 columns.
It has a total of 15 columns, Target Column is "Income", The income is divide into two classes: <=50K and >50K Number of attributes: 6, These are the demographics and other features to describe a person

6 attributes are:

- Age.
- Workclass.
- Education Number of Years.
- Occupation.
- gender.
- Hours-per-week.

The dataset contains missing values that are marked with a question mark character (?). There are two class values ‘>50K‘ and ‘<=50K‘ in target column i.e., it is a binary classification task.

In [None]:
#Required imports
import numpy as np
import pandas as pd


In [None]:
#Read dataset
df = pd.read_csv("smaller_adult.csv")
df.head()

Unnamed: 0,age,workclass,educational-num,occupation,gender,hours-per-week,income
0,25,Private,7,Machine-op-inspct,Male,40,<=50K
1,38,Private,9,Farming-fishing,Male,50,<=50K
2,28,Local-gov,12,Protective-serv,Male,40,>50K
3,44,Private,10,Machine-op-inspct,Male,40,>50K
4,18,?,10,?,Female,30,<=50K


In [None]:
df.columns

Index(['age', 'workclass', 'educational-num', 'occupation', 'gender',
       'hours-per-week', 'income'],
      dtype='object')

In [None]:
df.shape

(3574, 7)

In [None]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3574 entries, 0 to 3573
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              3574 non-null   int64 
 1   workclass        3574 non-null   object
 2   educational-num  3574 non-null   int64 
 3   occupation       3574 non-null   object
 4   gender           3574 non-null   object
 5   hours-per-week   3574 non-null   int64 
 6   income           3574 non-null   object
dtypes: int64(3), object(4)
memory usage: 195.6+ KB


In [None]:
df.describe()

Unnamed: 0,age,educational-num,hours-per-week
count,3574.0,3574.0,3574.0
mean,38.544208,10.051763,40.412423
std,13.739739,2.606555,12.582046
min,17.0,1.0,1.0
25%,27.0,9.0,39.0
50%,37.0,10.0,40.0
75%,47.75,13.0,45.0
max,90.0,16.0,99.0


In [None]:
# See the columns that contain a "?" and how many "?" are there in those columns
df.isin(['?']).sum()

age                  0
workclass          214
educational-num      0
occupation         214
gender               0
hours-per-week       0
income               0
dtype: int64

In [None]:
df.columns

Index(['age', 'workclass', 'educational-num', 'occupation', 'gender',
       'hours-per-week', 'income'],
      dtype='object')

In [None]:
#Replace ? with NaN
df['workclass'] = df['workclass'].replace('?', np.nan)
df['occupation'] = df['occupation'].replace('?', np.nan)


In [None]:
#Now the ? has been replaced by NaN, so count of ? is 0
df.isin(['?']).sum()

age                0
workclass          0
educational-num    0
occupation         0
gender             0
hours-per-week     0
income             0
dtype: int64

In [None]:
#Check missing values - NaN values
df.isnull().sum()

age                  0
workclass          214
educational-num      0
occupation         214
gender               0
hours-per-week       0
income               0
dtype: int64

In [None]:
#Drop all rows that contain a missing value
df.dropna(how='any', inplace=True)

In [None]:
#Check duplicate values in dataframe now
print(f"There are {df.duplicated().sum()} duplicate values")

There are 238 duplicate values


In [None]:
df = df.drop_duplicates()
df.shape

(3122, 7)

In [None]:
df.columns

Index(['age', 'workclass', 'educational-num', 'occupation', 'gender',
       'hours-per-week', 'income'],
      dtype='object')

In [None]:
#Extract X and y from the dataframe , income column is the target column, rest columns are features
X = df.loc[:,['age', 'workclass', 'educational-num', 'occupation', 'gender', 'hours-per-week']]
y = df.loc[:,'income']

In [None]:
# Since y is a binary categorical column we will use label encoder to convert it into numerical columns with values 0 and 1
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)
y = pd.DataFrame(y)
y.head()

Unnamed: 0,0
0,0
1,0
2,1
3,1
4,0


In [None]:
#First identify caterogical features and numeric features
numeric_features = X.select_dtypes('number')
categorical_features = X.select_dtypes('object')
categorical_features

Unnamed: 0,workclass,occupation,gender
0,Private,Machine-op-inspct,Male
1,Private,Farming-fishing,Male
2,Local-gov,Protective-serv,Male
3,Private,Machine-op-inspct,Male
5,Private,Other-service,Male
...,...,...,...
3568,Private,Machine-op-inspct,Male
3570,Private,Prof-specialty,Male
3571,Private,Craft-repair,Male
3572,Private,Exec-managerial,Female


In [None]:
numeric_features

Unnamed: 0,age,educational-num,hours-per-week
0,25,7,40
1,38,9,50
2,28,12,40
3,44,10,40
5,34,6,30
...,...,...,...
3568,56,3,40
3570,29,11,40
3571,25,9,50
3572,47,9,50


In [None]:
#Convert categorical features into numeric
converted_categorical_features = pd.get_dummies(categorical_features)
converted_categorical_features.shape

(3122, 23)

In [None]:
#combine the converted categorical features and the numeric features together into a new dataframe called "newX"
all_features = [converted_categorical_features, numeric_features]
newX = pd.concat(all_features,axis=1, join='inner')
newX.shape

(3122, 26)

In [None]:
newX.columns

Index(['workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private',
       'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',
       'workclass_State-gov', 'workclass_Without-pay',
       'occupation_Adm-clerical', 'occupation_Armed-Forces',
       'occupation_Craft-repair', 'occupation_Exec-managerial',
       'occupation_Farming-fishing', 'occupation_Handlers-cleaners',
       'occupation_Machine-op-inspct', 'occupation_Other-service',
       'occupation_Priv-house-serv', 'occupation_Prof-specialty',
       'occupation_Protective-serv', 'occupation_Sales',
       'occupation_Tech-support', 'occupation_Transport-moving',
       'gender_Female', 'gender_Male', 'age', 'educational-num',
       'hours-per-week'],
      dtype='object')

In [None]:
#Do a train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newX, y, test_size=0.33, random_state=42)

In [None]:
# Load Support Vector Machine Classifier
from sklearn.svm import SVC
clf = SVC(kernel="linear", gamma = 'auto')
clf.fit(X_train, y_train.values.ravel())


SVC(gamma='auto', kernel='linear')

In [None]:
# Make predictions
y_pred = clf.predict(X_test)

In [None]:
predictions_df = pd.DataFrame()
predictions_df['precdicted_salary_class'] = y_pred
predictions_df['actual_salary_class'] = y_test[0].values
predictions_df

Unnamed: 0,precdicted_salary_class,actual_salary_class
0,0,0
1,0,1
2,0,0
3,0,0
4,0,0
...,...,...
1026,1,1
1027,0,0
1028,0,0
1029,0,0


In [None]:
#Evaluate the performance of fitting
from sklearn.metrics import accuracy_score
print(accuracy_score(y_pred,y_test))

0.7672162948593598
