# Eksperimen Prediksi _Income_ - Tugas Besar 2A IF3170 AI

**Kelompok 23 - Saturnus**
- 13515001 (K-01) - Jonathan Christopher
- 13515008 (K-02) - Kanisius Kenneth Halim
- 13515052 (K-01) - Kevin Jonathan
- 13515064 (K-01) - Tasya
- 13515065 (K-02) - Felix Limanta

## Impor data

In [1]:
import pandas
import numpy as np

training_data = np.array(pandas.read_csv('./data/CensusIncome.data.txt', header=None))
test_data = np.array(pandas.read_csv('./data/CensusIncome.test.txt', header=None))

feature_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
discrete_feature_indices = [1, 3, 5, 6, 7, 8, 9, 13]
discrete_feature_domains = {
    'workclass': ['Private',  'Self-emp-not-inc',  'Self-emp-inc',  'Federal-gov',  'Local-gov',  'State-gov',  'Without-pay',  'Never-worked'],
    'education': ['Bachelors',  'Some-college',  '11th',  'HS-grad',  'Prof-school',  'Assoc-acdm',  'Assoc-voc',  '9th',  '7th-8th',  '12th',  'Masters',  '1st-4th',  '10th',  'Doctorate',  '5th-6th',  'Preschool'],
    'marital-status': ['Married-civ-spouse',  'Divorced',  'Never-married',  'Separated',  'Widowed',  'Married-spouse-absent',  'Married-AF-spouse'],
    'occupation': ['Tech-support',  'Craft-repair',  'Other-service',  'Sales',  'Exec-managerial',  'Prof-specialty',  'Handlers-cleaners',  'Machine-op-inspct',  'Adm-clerical',  'Farming-fishing',  'Transport-moving',  'Priv-house-serv',  'Protective-serv',  'Armed-Forces'],
    'relationship': ['Wife',  'Own-child',  'Husband',  'Not-in-family',  'Other-relative',  'Unmarried'],
    'race': ['White',  'Asian-Pac-Islander',  'Amer-Indian-Eskimo',  'Other',  'Black'],
    'sex': ['Female',  'Male'],
    'native-country': ['United-States',  'Cambodia',  'England',  'Puerto-Rico',  'Canada',  'Germany',  'Outlying-US(Guam-USVI-etc)',  'India',  'Japan',  'Greece',  'South',  'China',  'Cuba',  'Iran',  'Honduras',  'Philippines',  'Italy',  'Poland',  'Jamaica',  'Vietnam',  'Mexico',  'Portugal',  'Ireland',  'France',  'Dominican-Republic',  'Laos',  'Ecuador',  'Taiwan',  'Haiti',  'Columbia',  'Hungary',  'Guatemala',  'Nicaragua',  'Scotland',  'Thailand',  'Yugoslavia',  'El-Salvador',  'Trinadad&Tobago',  'Peru',  'Hong',  'Holand-Netherlands']
}
discrete_value_counts = [
    len(discrete_feature_domains['workclass']),
    len(discrete_feature_domains['education']),
    len(discrete_feature_domains['marital-status']),
    len(discrete_feature_domains['occupation']),
    len(discrete_feature_domains['relationship']),
    len(discrete_feature_domains['race']),
    len(discrete_feature_domains['sex']),
    len(discrete_feature_domains['native-country'])
]

## Praproses data

Data terlebih dahulu harus dipraproses dengan menghapus _whitespace_ dan memisahkan data target dari data yang digunakan untuk memrediksi. Nilai yang tidak diketahui diganti dengan modus dari nilai yang diketahui.

Setelah itu, data diskrit di-_encode_ dengan _one-hot encoding_, sehingga setiap kelas data pada setiap kategori menjadi sebuah kategori sendiri, yang dapat bernilai 0 atau 1 (_boolean_). Data kontinu dinormalisasi dengan rata-rata 0 dan simpangan baku 1.

### Hapus _whitespace_ dari data

In [2]:
training_data = [[item.strip() if isinstance(item, str) else item for item in row] for row in training_data]
test_data = [[item.strip() if isinstance(item, str) else item for item in row] for row in test_data]

### Pisahkan label _feature_ dan _target_

In [3]:
training_features = np.array([row[:-1] for row in training_data])
training_targets = np.array([row[-1] for row in training_data])

test_features = np.array([row[:-1] for row in test_data])
test_targets = np.array([row[-1] for row in test_data])

### Isi nilai yang tidak diketahui dengan modus

In [4]:
from collections import Counter

training_features_modes = [Counter(filter(lambda x : x != '?', column)).most_common(1)[0][0] for column in training_features.transpose()]
for r in range(0, len(training_features)):
    for c in range(0, len(training_features[r])):
        if training_features[r][c] == '?':
            training_features[r][c] = training_features_modes[c]
            
test_features_modes = [Counter(filter(lambda x : x != '?', column)).most_common(1)[0][0] for column in test_features.transpose()]
for r in range(0, len(test_features)):
    for c in range(0, len(test_features[r])):
        if test_features[r][c] == '?':
            test_features[r][c] = test_features_modes[c]

### Ubah nama _feature_ kategorik menjadi bilangan bulat

In [5]:
for r in range(0, len(training_features)):
    for c in range(0, len(training_features[r])):
        if c in discrete_feature_indices:
            domain = discrete_feature_domains[feature_names[c]]
            training_features[r][c] = domain.index(training_features[r][c])
            
for r in range(0, len(test_features)):
    for c in range(0, len(test_features[r])):
        if c in discrete_feature_indices:
            domain = discrete_feature_domains[feature_names[c]]
            test_features[r][c] = domain.index(test_features[r][c])

### _Encode_ _feature_ kategorik dengan _one hot encoding_

In [6]:
from sklearn.preprocessing import OneHotEncoder

oneHotEncoder = OneHotEncoder(categorical_features=discrete_feature_indices, n_values=discrete_value_counts)
oneHotEncoder.fit(training_features)

training_features = oneHotEncoder.transform(training_features).toarray().astype(int)
test_features = oneHotEncoder.transform(test_features).toarray().astype(int)

### Normalisasi nilai _feature_ kontinu

In [None]:
from sklearn import preprocessing

training_features = training_features.astype(np.float64)
training_features = np.hsplit(training_features, [-6])
training_features[1] = preprocessing.scale(training_features[1])
training_features = np.concatenate(training_features, axis=1)

test_features = test_features.astype(np.float64)
test_features = np.hsplit(test_features, [-6])
test_features[1] = preprocessing.scale(test_features[1])
test_features = np.concatenate(test_features, axis=1)


## Eksperimen pencarian algoritma pembelajaran terbaik

Eksperimen dilakukan dengan melakukan _10-fold cross validation_ pada data _training_. Skema terbaik dipilih berdasarkan rata-rata nilai _cross validation_ yang tertinggi, serta rentang 95% kepercayaan yang terendah.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

### 5-Nearest Neighbor

In [None]:
from sklearn import neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
scores = cross_val_score(gnb, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

### Decision Tree

In [None]:
from sklearn import tree

dtr = tree.DecisionTreeClassifier()
scores = cross_val_score(dtr, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

### Random Forest

In [8]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=123)
scores = cross_val_score(clf, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(solver='sgd', hidden_layer_sizes=(50, 25), max_iter=5000, random_state=123)
scores = cross_val_score(mlp, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

## _Full-training_ algoritma terbaik

Dilakukan _full-training_ pada algoritma terbaik yang diperoleh dari eksperimen sebelumnya, lalu ditunjukkan nilai akurasi dan _confusion matrix_-nya. Model hasil _full-training_ kemudian disimpan ke _file_ eksternal.

### Lakukan _full-training_ pada algoritma terbaik

In [10]:
mlp_classifier = mlp.fit(training_features, training_targets)
pred = mlp_classifier.predict(training_features)
print("Akurasi: %f" % accuracy_score(training_targets, pred))
print("Confusion matrix:")
print(confusion_matrix(training_targets, pred))

Kinerja rata-rata: 0.833328 (+/- 0.011881)


### Simpan model hasil _training_ ke _file_ eksternal

In [None]:
from sklearn.externals import joblib

joblib.dump(mlp_classifier, "mlp_model.pkl")

## Evaluasi model hasil _training_

Model yang telah disimpan dibaca kembali, lalu digunakan dan dievaluasi terhadap data uji.

### Baca model hasil _training_ dari _file_ eksternal

### Multi Layer Perceptron ANN

In [14]:
from sklearn.neural_network import MLPClassifier
cv_mlp = MLPClassifier(solver='sgd', hidden_layer_sizes=(8, 8), max_iter=2500, random_state=1)
scores = cross_val_score(cv_mlp, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

Kinerja rata-rata: 0.854642 (+/- 0.011274)


### Gaussian Naive Bayes

In [12]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
scores = cross_val_score(gnb, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

Kinerja rata-rata: 0.547865 (+/- 0.029052)


In [15]:
from sklearn import tree
dtr = tree.DecisionTreeClassifier()
scores = cross_val_score(dtr, training_features, training_targets, cv=10)
print("Kinerja rata-rata: %f (+/- %f)" % (scores.mean(), scores.std() * 2))

Kinerja rata-rata: 0.817666 (+/- 0.016075)


In [None]:
mlp_classifier = joblib.load("mlp_model.pkl")

### Lakukan klasifikasi untuk data uji dengan model yang telah dibaca

Karena terdapat puluhan ribu data uji, hasil yang ditampilkan adalah metrik performanya, yaitu akurasi dan _confusion matrix_.

In [None]:
pred = mlp_classifier.predict(test_features)
print("Akurasi: %f" % accuracy_score(test_targets, pred))
print("Confusion matrix:")
print(confusion_matrix(test_targets, pred))