# Data Preprocessing and Model Building for Autism Dataset

Proses ini akan mencakup beberapa langkah: 1. Memuat dan membagi data menjadi training, validation, dan test set.2. Menangani missing values yang ada pada data.3. Menangani outlier yang ada pada kolom usia.4. Melakukan scaling pada data numerik.5. Melakukan encoding pada data kategorikal.6. Memilih fitur yang paling relevan untuk prediksi.7. Melatih model prediksi menggunakan fitur yang telah dipilih.

## Step 1: Data Loading and Splitting

Pada langkah ini, data akan dimuat dari file CSV dan dibagi menjadi tiga bagian: training set (80%), validation set (20% dari training), dan test set (20%). Training set akan digunakan untuk melatih model, validation set untuk mengevaluasi dan menyempurnakan model, dan test set untuk menguji performa akhir model.

In [12]:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data_file_path = 'Autism_combined_data.csv'
data = pd.read_csv(data_file_path)

# Display the first few rows of the dataset
data.head()


Unnamed: 0.1,Unnamed: 0,id,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,result,age_desc,relation,Class/ASD
0,0,1,0,0,0,1,1,1,1,1,...,m,Hispanic,yes,yes,Austria,no,6,12-16 years,Parent,NO
1,1,2,0,0,0,0,0,0,0,0,...,m,Black,no,no,Austria,no,2,12-16 years,Relative,NO
2,2,3,0,0,0,0,0,0,0,0,...,f,?,no,no,AmericanSamoa,no,2,12-16 years,?,NO
3,3,4,0,1,1,1,1,1,0,1,...,f,White-European,no,no,United Kingdom,no,7,12-16 years,Self,YES
4,4,5,1,1,1,1,1,1,1,0,...,f,?,no,no,Albania,no,7,12-16 years,?,YES


In [13]:
# Splitting the data into features and target
X = data.drop(columns=['Class/ASD', 'Unnamed: 0', 'id'])
y = data['Class/ASD']

# Show the shape of the dataset
X.shape, y.shape

((1100, 20), (1100,))

In [14]:
# Splitting into training (80%) and test set (20%)
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Show the shape of the training and testing sets
X_train_full.shape, X_test.shape

((880, 20), (220, 20))

In [15]:
# Further splitting training into 80% training and 20% validation
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=42, stratify=y_train_full)
(X_train.shape, X_val.shape, X_test.shape), (y_train.shape, y_val.shape, y_test.shape)

# Show the shape of the training, validation, and testing sets
X_train.shape, X_val.shape, X_test.shape

((704, 20), (176, 20), (220, 20))

## Step 2: Handling Missing Values

Pada langkah ini, missing values yang ditandai dengan '?' akan diganti dengan nilai NaN. Selanjutnya, nilai median digunakan untuk mengisi missing values pada kolom numerik, sedangkan nilai modus digunakan untuk kolom kategorikal. Hal ini memastikan bahwa data tidak memiliki nilai yang hilang sebelum melatih model.

In [19]:

# Replace '?' with NaN and handle missing values using median/mode for numerical/categorical features
X_train_cleaned = X_train.replace('?', pd.NA)
X_val_cleaned = X_val.replace('?', pd.NA)
X_test_cleaned = X_test.replace('?', pd.NA)

X_train_cleaned, X_val_cleaned, X_test_cleaned

(     A1_Score  A2_Score  A3_Score  A4_Score  A5_Score  A6_Score  A7_Score  \
 273         1         1         1         1         1         0         0   
 781         1         0         0         0         0         0         1   
 84          1         0         0         1         1         1         0   
 137         1         1         1         1         1         1         1   
 551         1         0         0         0         0         0         0   
 ..        ...       ...       ...       ...       ...       ...       ...   
 937         1         0         1         1         0         1         1   
 712         1         0         0         0         0         0         0   
 25          1         0         1         1         1         1         0   
 503         1         0         0         0         0         0         0   
 839         0         0         1         0         0         1         0   
 
      A8_Score  A9_Score  A10_Score age gender       ethnicity

In [21]:
# Menampilkan jumlah nilai hilang di setiap kolom setelah penggantian
missing_values_train = X_train_cleaned.isnull().sum()
missing_values_val = X_val_cleaned.isnull().sum()
missing_values_test = X_test_cleaned.isnull().sum()

missing_values_train, missing_values_val, missing_values_test

(A1_Score            0
 A2_Score            0
 A3_Score            0
 A4_Score            0
 A5_Score            0
 A6_Score            0
 A7_Score            0
 A8_Score            0
 A9_Score            0
 A10_Score           0
 age                 1
 gender              0
 ethnicity          90
 jundice             0
 austim              0
 contry_of_res       0
 used_app_before     0
 result              0
 age_desc            0
 relation           90
 dtype: int64,
 A1_Score            0
 A2_Score            0
 A3_Score            0
 A4_Score            0
 A5_Score            0
 A6_Score            0
 A7_Score            0
 A8_Score            0
 A9_Score            0
 A10_Score           0
 age                 2
 gender              0
 ethnicity          27
 jundice             0
 austim              0
 contry_of_res       0
 used_app_before     0
 result              0
 age_desc            0
 relation           27
 dtype: int64,
 A1_Score            0
 A2_Score            0
 A3_

In [18]:
# Numerical columns imputation with median
numerical_columns = X_train_cleaned.select_dtypes(include='number').columns
for col in numerical_columns:
    median_value = X_train_cleaned[col].median()
    X_train_cleaned[col].fillna(median_value, inplace=True)
    X_val_cleaned[col].fillna(median_value, inplace=True)
    X_test_cleaned[col].fillna(median_value, inplace=True)

numerical_columns
    

Index(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 'A6_Score',
       'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'age', 'result'],
      dtype='object')

In [17]:
# Categorical columns imputation with mode
categorical_columns = X_train_cleaned.select_dtypes(include='object').columns
for col in categorical_columns:
    mode_value = X_train_cleaned[col].mode()[0]
    X_train_cleaned[col].fillna(mode_value, inplace=True)
    X_val_cleaned[col].fillna(mode_value, inplace=True)
    X_test_cleaned[col].fillna(mode_value, inplace=True)

missing_values_train = X_train_cleaned.isnull().sum().sum()
missing_values_val = X_val_cleaned.isnull().sum().sum()
missing_values_test = X_test_cleaned.isnull().sum().sum()

missing_values_train, missing_values_val, missing_values_test

(0, 0, 0)

## Step 3: Handling Outliers for Age

Langkah ini menangani outlier pada kolom 'age' menggunakan metode Interquartile Range (IQR). Nilai-nilai yang berada di luar rentang IQR akan diganti dengan batas atas atau batas bawah yang sesuai, sehingga outlier tidak memengaruhi model.

In [24]:
age_column_name = 'age'  

# Convert 'age' to numeric
X_train_cleaned[age_column_name] = pd.to_numeric(X_train_cleaned[age_column_name], errors='coerce')
X_val_cleaned[age_column_name] = pd.to_numeric(X_val_cleaned[age_column_name], errors='coerce')
X_test_cleaned[age_column_name] = pd.to_numeric(X_test_cleaned[age_column_name], errors='coerce')

X_train_cleaned[age_column_name], X_val_cleaned[age_column_name], X_test_cleaned[age_column_name]

(273    36.0
 781    18.0
 84     14.0
 137    18.0
 551    29.0
        ... 
 937     4.0
 712    33.0
 25     12.0
 503    19.0
 839     6.0
 Name: age, Length: 704, dtype: float64,
 653    44.0
 358    36.0
 844     4.0
 876     4.0
 782    43.0
        ... 
 960     7.0
 579    27.0
 410    28.0
 726    22.0
 73     12.0
 Name: age, Length: 176, dtype: float64,
 821      4.0
 750     38.0
 988      4.0
 1030     4.0
 446     21.0
         ... 
 624     26.0
 270     26.0
 348     30.0
 649     35.0
 237     24.0
 Name: age, Length: 220, dtype: float64)

In [26]:
# Define bounds for outliers
Q1 = X_train_cleaned[age_column_name].quantile(0.25)
Q3 = X_train_cleaned[age_column_name].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

lower_bound, upper_bound

(-17.5, 58.5)

In [27]:
# Capping outliers
X_train_cleaned[age_column_name] = X_train_cleaned[age_column_name].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)
X_val_cleaned[age_column_name] = X_val_cleaned[age_column_name].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)
X_test_cleaned[age_column_name] = X_test_cleaned[age_column_name].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)

X_train_cleaned[age_column_name].describe(), X_val_cleaned[age_column_name].describe(), X_test_cleaned[age_column_name].describe()

(count    703.000000
 mean      22.079659
 std       12.866122
 min        4.000000
 25%       11.000000
 50%       21.000000
 75%       30.000000
 max       58.500000
 Name: age, dtype: float64,
 count    174.000000
 mean      20.681034
 std       12.695534
 min        4.000000
 25%       11.000000
 50%       19.000000
 75%       28.000000
 max       58.500000
 Name: age, dtype: float64,
 count    217.000000
 mean      21.615207
 std       13.095027
 min        4.000000
 25%       11.000000
 50%       21.000000
 75%       29.000000
 max       58.500000
 Name: age, dtype: float64)

## Step 4: Scaling Data

Pada langkah ini, kolom 'age' akan diskalakan menggunakan Min-Max Scaling sehingga nilainya berada dalam rentang 0 hingga 1. Scaling ini memastikan bahwa fitur 'age' memiliki skala yang sebanding dengan fitur lain, yang penting untuk beberapa algoritma machine learning.

In [31]:

from sklearn.preprocessing import MinMaxScaler

# Scaling 'age' column using Min-Max Scaling
scaler = MinMaxScaler()
X_train_scaled = X_train_cleaned.copy()
X_train_scaled['age'] = scaler.fit_transform(X_train_cleaned[['age']])

X_val_scaled = X_val_cleaned.copy()
X_val_scaled['age'] = scaler.transform(X_val_cleaned[['age']])
X_test_scaled = X_test_cleaned.copy()
X_test_scaled['age'] = scaler.transform(X_test_cleaned[['age']])


273    0.587156
781    0.256881
84     0.183486
137    0.256881
551    0.458716
         ...   
937    0.000000
712    0.532110
25     0.146789
503    0.275229
839    0.036697
Name: age, Length: 704, dtype: float64

In [36]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,result,age_desc,relation,Class/ASD
0,0,1,0,0,0,1,1,1,1,1,...,m,Hispanic,yes,yes,Austria,no,6,12-16 years,Parent,NO
1,1,2,0,0,0,0,0,0,0,0,...,m,Black,no,no,Austria,no,2,12-16 years,Relative,NO
2,2,3,0,0,0,0,0,0,0,0,...,f,?,no,no,AmericanSamoa,no,2,12-16 years,?,NO
3,3,4,0,1,1,1,1,1,0,1,...,f,White-European,no,no,United Kingdom,no,7,12-16 years,Self,YES
4,4,5,1,1,1,1,1,1,1,0,...,f,?,no,no,Albania,no,7,12-16 years,?,YES


## Step 5: Encoding Categorical Variables

Fitur kategorikal seperti 'gender', 'ethnicity', dan 'relation' perlu dikonversi menjadi bentuk numerik. Untuk kolom dengan sedikit kategori seperti 'gender' dan 'jundice', digunakan Label Encoding, sementara untuk kolom lain digunakan One-Hot Encoding.

In [32]:
from sklearn.preprocessing import LabelEncoder

# Encoding categorical features
label_encoder_cols = ['gender', 'jundice', 'austim', 'used_app_before']
label_encoders = {}
for col in label_encoder_cols:
    le = LabelEncoder()
    X_train_scaled[col] = le.fit_transform(X_train_scaled[col])
    X_val_scaled[col] = le.transform(X_val_scaled[col])
    X_test_scaled[col] = le.transform(X_test_scaled[col])
    label_encoders[col] = le


In [35]:
# One-Hot Encoding for other categorical columns
X_train_encoded = pd.get_dummies(X_train_scaled, columns=['ethnicity', 'contry_of_res', 'relation'], drop_first=True)
X_val_encoded = pd.get_dummies(X_val_scaled, columns=['ethnicity', 'contry_of_res', 'relation'], drop_first=True)
X_test_encoded = pd.get_dummies(X_test_scaled, columns=['ethnicity', 'contry_of_res', 'relation'], drop_first=True)

X_val_encoded = X_val_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

X_train_encoded

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,contry_of_res_U.S. Outlying Islands,contry_of_res_Ukraine,contry_of_res_United Arab Emirates,contry_of_res_United Kingdom,contry_of_res_United States,contry_of_res_Viet Nam,relation_Others,relation_Parent,relation_Relative,relation_Self
273,1,1,1,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,1
781,1,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
84,1,0,0,1,1,1,0,0,1,1,...,0,0,0,0,0,0,0,1,0,0
137,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,1
551,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
937,1,0,1,1,0,1,1,1,0,1,...,0,0,0,0,1,0,0,1,0,0
712,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
25,1,0,1,1,1,1,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
503,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 6: Feature Selection

Pada langkah ini, fitur yang memiliki korelasi tinggi dengan target dipilih untuk digunakan dalam model. Ini membantu mengurangi kompleksitas model dan memastikan bahwa hanya fitur yang relevan yang digunakan.

In [37]:

# Selecting features based on correlation and chi-square
selected_features = ['A9_Score', 'A6_Score', 'A4_Score', 'A5_Score', 'A3_Score', 'result', 'austim']
X_train_selected = X_train_encoded[selected_features]
X_val_selected = X_val_encoded[selected_features]
X_test_selected = X_test_encoded[selected_features]

X_train_selected, X_val_selected, X_test_selected


(     A9_Score  A6_Score  A4_Score  A5_Score  A3_Score  result  austim
 273         1         0         1         1         1       8       1
 781         0         0         0         0         0       3       0
 84          1         1         1         1         0       6       0
 137         1         1         1         1         1      10       0
 551         0         0         0         0         0       2       0
 ..        ...       ...       ...       ...       ...     ...     ...
 937         0         1         1         0         1       7       0
 712         0         0         0         0         0       1       0
 25          1         1         1         1         1       8       0
 503         0         0         0         0         0       1       0
 839         1         1         0         0         1       4       0
 
 [704 rows x 7 columns],
      A9_Score  A6_Score  A4_Score  A5_Score  A3_Score  result  austim
 653         1         1         1         1      

## Step 7: Model Training (Random Forest)

uji coba membuat model random forest

In [38]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train Random Forest model
model = RandomForestClassifier(random_state=42, n_estimators=100)
model.fit(X_train_selected, y_train)

# Predictions
y_val_pred = model.predict(X_val_selected)
y_test_pred = model.predict(X_test_selected)

# Evaluation
val_accuracy = accuracy_score(y_val, y_val_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

val_classification_report = classification_report(y_val, y_val_pred)
test_classification_report = classification_report(y_test, y_test_pred)

val_accuracy, test_accuracy, val_classification_report, test_classification_report


(1.0,
 1.0,
 '              precision    recall  f1-score   support\n\n          NO       1.00      1.00      1.00       113\n         YES       1.00      1.00      1.00        63\n\n    accuracy                           1.00       176\n   macro avg       1.00      1.00      1.00       176\nweighted avg       1.00      1.00      1.00       176\n',
 '              precision    recall  f1-score   support\n\n          NO       1.00      1.00      1.00       141\n         YES       1.00      1.00      1.00        79\n\n    accuracy                           1.00       220\n   macro avg       1.00      1.00      1.00       220\nweighted avg       1.00      1.00      1.00       220\n')