# Disimilarity (Tugas 1)

* Cari dataset dari Url
* Tentukan Dissimilarity between binary variable
* Ukur Jarak d(1,2)
* Ukur Jarak d(1,3)
* Ukur Jarak d(1,4)

# Atribut Biner
Atribut biner adalah atribut nominal dengan hanya dua kategori atau status: 0 atau 1, di mana 0 biasanya berarti atribut tidak ada, dan 1 berarti ada. Atribut biner disebut sebagai Boolean jika dua keadaan sesuai dengan benar dan salah.

**Contoh Atribut Biner :**
Diberikan suatu atribut perokok yang menggambarkan objek pasien, 1 menunjukkan bahwa pasien merokok, sedangkan 0 menunjukkan bahwa pasien tidak. Demikian pula, misalkan pasien menjalani tes medis yang memiliki dua kemungkinan hasil. Atribut tes kesehatan adalah biner, dimana nilai 1 berarti hasil tes untuk pasien positif, sedangkan 0 berarti hasilnya negatif.

Atribut biner adalah ***simetris*** jika kedua statusnya sama-sama bernilai dan memiliki bobot yang sama; yaitu, tidak ada preferensi hasil mana yang harus dikodekan sebagai 0 atau 1. Salah satu contohnya adalah atribut gender yang menyatakan laki-laki dan perempuan.

Atribut biner ***asimetris*** jika hasil dari status tidak sama pentingnya, seperti hasil positif dan negatif dari tes medis untuk HIV. Berdasarkan konvensi, kami mengkodekan hasil yang paling penting, yang biasanya yang paling langka, dengan 1 (misalnya, HIV
positif) dan yang lainnya dengan 0 (misalnya, HIV negatif).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/datamining/tugas/

/content/drive/MyDrive/datamining/tugas


In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
df = pd.read_csv("MedicalCostPersonal.csv", sep=",", encoding="ISO-8859-1", header=0)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
# Show dataset shape
number_of_columns = df.shape[1]

In [None]:
# Show all columns for dataset
pd.set_option('display.max_columns', number_of_columns)
pd.set_option('display.max_rows', number_of_columns)

In [None]:
# Show all columns from dataframe
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

# Categorical/Nominal Features
*   sex
*   smoker

In [None]:
df[['age', 'sex','smoker']].head(5)

Unnamed: 0,age,sex,smoker
0,19,female,yes
1,18,male,no
2,28,male,no
3,33,male,no
4,32,male,no


# Change Values to 1/0

Take all values of ‘sex’ series

* If values is male change to 0

* If values is female change to 1




In [None]:
# sex code
code_sex_for_male = "male"
code_sex_for_female = "female"

# binary value
value_of_one = 1
value_of_zero = 0

def change_code_sex_to_biner(sex):
    return value_of_one if sex == code_sex_for_male else value_of_zero

In [None]:
# Update all values of 'sex' series
df["sex"] = df["sex"].apply(change_code_sex_to_biner)

Take all values of ‘smoker’ series

* If values is yes change to 1

* If values is no change to 0

In [None]:
# smoker code
code_smoker_for_no = "no"
code_smoker_for_yes = "yes"

# binary value
value_of_one = 1
value_of_zero = 0

def change_code_smoker_to_biner(smoker):
    return value_of_one if smoker == code_smoker_for_yes else value_of_zero

In [None]:
# Update all values of 'smoker' series
df["smoker"] = df["smoker"].apply(change_code_smoker_to_biner)

In [None]:
df[['age', 'sex','smoker']].head(5)

Unnamed: 0,age,sex,smoker
0,19,0,1
1,18,1,0
2,28,1,0
3,33,1,0
4,32,1,0


In [None]:
# CONSTAN VARIABLE
DECREMENT_BY_ONE = 1
INCREMENT_BY_ONE = 1

CONTINGENCY_TABLE_VALUE = {
    "q" : (1,1),
    "r" : (1,0),
    "s" : (0,1),
    "t" : (0,0),
}

In [None]:
def get_series(df, idx, series):
    return df.loc[(idx), series]

In [None]:
def get_dissimilarity_dataset(df, series_index = [], series = []):
    first_series = get_series(df, series_index[0], series)
    second_series = get_series(df, series_index[1], series)
    dataset = pd.concat([first_series,second_series],axis=1)
    return dataset.T

In [None]:
get_dissimilarity_dataset(df, [1,2], ['sex','smoker']).T

Unnamed: 0,1,2
sex,1,1
smoker,0,0


In [None]:
df.loc[0:4, ['sex','smoker']]

Unnamed: 0,sex,smoker
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0


In [None]:
def count_contingency_value(df, start_index = 0, last_index = 1):

    CONTINGENCY_VALUE = {
        "q" : 0,
        "r" : 0,
        "s" : 0,
        "t" : 0,
    }

    column_range = df.shape[1]

    for column in range(column_range):
        for value in CONTINGENCY_TABLE_VALUE:
            item = list((tuple(df.loc[(start_index):(last_index), df.columns[column]]) == CONTINGENCY_TABLE_VALUE[value], value))
            if item[0] == True:
                if item[1] == "q":
                    CONTINGENCY_VALUE["q"] += 1
                if item[1] == "r":
                    CONTINGENCY_VALUE["r"] += 1
                if item[1] == "s":
                    CONTINGENCY_VALUE["s"] += 1
                if item[1] == "t":
                    CONTINGENCY_VALUE["t"] += 1

    return CONTINGENCY_VALUE

In [None]:
# d(1,2)
df_1_2 = get_dissimilarity_dataset(df, [1,2], ['sex','smoker'])

In [None]:
c_d_1_2 = count_contingency_value(df_1_2, 1, 2)

In [None]:
# d(1,3)
df_1_3 = get_dissimilarity_dataset(df, [1,3], ['sex','smoker'])

In [None]:
c_d_1_3 = count_contingency_value(df_1_3, 1, 3)

In [None]:
# d(1,4)
df_1_4 = get_dissimilarity_dataset(df, [1,4], ['sex','smoker'])

In [None]:
c_d_1_4 = count_contingency_value(df_1_4, 1, 4)

# Dissimilarity Binary Assymetric Value Formula

$$
\frac{r + s}{q + r + s}\
$$

In [None]:
def measure_dissimilarity_binary_value_assymetric_distance(contingency_value):

    return (contingency_value["r"] + contingency_value["s"]) / (contingency_value["q"] + contingency_value["r"] + contingency_value["s"])

In [None]:
d_1_2 = measure_dissimilarity_binary_value_assymetric_distance(c_d_1_2)
d_1_3 = measure_dissimilarity_binary_value_assymetric_distance(c_d_1_2)
d_1_4 = measure_dissimilarity_binary_value_assymetric_distance(c_d_1_2)

In [None]:
d_1_2

0.0

In [None]:
d_1_3

0.0

In [None]:
d_1_4

0.0