# Autism Detection using Machine Learning

### Exploring the Data

In [1]:
import pandas as pd

In [2]:
autism_data = pd.read_csv("data/train.csv")

In [3]:
autism_data.isnull().sum()

ID                 0
A1_Score           0
A2_Score           0
A3_Score           0
A4_Score           0
A5_Score           0
A6_Score           0
A7_Score           0
A8_Score           0
A9_Score           0
A10_Score          0
age                0
gender             0
ethnicity          0
jaundice           0
austim             0
contry_of_res      0
used_app_before    0
result             0
age_desc           0
relation           0
Class/ASD          0
dtype: int64

- There are no null values in the data. 

In [4]:
autism_data.columns

Index(['ID', 'A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',
       'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'age',
       'gender', 'ethnicity', 'jaundice', 'austim', 'contry_of_res',
       'used_app_before', 'result', 'age_desc', 'relation', 'Class/ASD'],
      dtype='object')

In [5]:
autism_data["austim"].value_counts()

no     669
yes    131
Name: austim, dtype: int64

- The data is quite imbalanced. This may make it difficult for the model to predict the positive class and some balancing techniques may be necessary. 

### Cleaning the Data

In [6]:
cleaned_data = autism_data.drop(columns=["ID"]).rename(columns={'austim': 'family_autism'})
cleaned_data['age'] = cleaned_data['age'].round(0)
cleaned_data['result'] = cleaned_data['result'].round(0)

- The ID column does not give any predictive power to detecting autism so is removed.

In [7]:
aq_cols = [col for col in cleaned_data.columns if '_Score' in col or col == 'result']

In [8]:
autism_data[aq_cols].head(20)

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,result
0,1,0,1,0,1,0,1,0,1,1,6.351166
1,0,0,0,0,0,0,0,0,0,0,2.255185
2,1,1,1,1,1,1,1,1,1,1,14.851484
3,0,0,0,0,0,0,0,0,0,0,2.276617
4,0,0,0,0,0,0,0,0,0,0,-4.777286
5,1,0,0,0,0,1,0,0,1,1,9.562117
6,1,0,0,0,0,0,1,1,1,0,7.984569
7,1,1,1,1,1,1,1,0,1,1,13.237898
8,1,1,1,1,0,0,0,1,1,1,-1.755774
9,0,0,0,0,0,0,0,1,0,1,14.92257


- The values in the 'result' variable appear to be extremely arbitrary, rather than based on the scores of the AQ1-10 screening test. For example, patient 1 scored all 0s in all of the tests and had a final result of 2.25, whereas patients 4 and 14 also scored all 0s and had scores of -4.78 and 9.80 respectively. Patients 2 and 19 also scored all 1s in all of the tests, but had different scores in the 'result' variable.

In [9]:
cleaned_data['family_autism'] = cleaned_data['family_autism'].apply(lambda x: 1 if x == "yes" else 0 )
cleaned_data['jaundice'] = cleaned_data['jaundice'].apply(lambda x: 1 if x == "yes" else 0 )

In [10]:
cleaned_data.drop(columns=['relation', 'age_desc','contry_of_res', 'result'], inplace=True)

- These columns are also unlikely to help predict autism so can be removed. 

In [11]:
cleaned_data.columns

Index(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 'A6_Score',
       'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'age', 'gender',
       'ethnicity', 'jaundice', 'family_autism', 'used_app_before',
       'Class/ASD'],
      dtype='object')

In [12]:
cleaned_data['gender'].value_counts()

m    530
f    270
Name: gender, dtype: int64

In [13]:
cleaned_data['used_app_before'] = cleaned_data['used_app_before'].apply(lambda x: 1 if x == "yes" else 0 )
cleaned_data['gender'] = cleaned_data['gender'].apply(lambda x: 1 if x == "f" else 0 )

In [14]:
cleaned_data['ethnicity'].value_counts()

White-European     257
?                  203
Middle Eastern      97
Asian               67
Black               47
South Asian         34
Pasifika            32
Others              29
Latino              17
Hispanic             9
Turkish              5
others               3
Name: ethnicity, dtype: int64

In [15]:
cleaned_data['ethnicity'] = cleaned_data['ethnicity'].apply(lambda x: 'Others' if x == '?' else 'Others' if x == 'others'  else x)

In [16]:
cleaned_data['ethnicity'].value_counts()

White-European     257
Others             235
Middle Eastern      97
Asian               67
Black               47
South Asian         34
Pasifika            32
Latino              17
Hispanic             9
Turkish              5
Name: ethnicity, dtype: int64

In [17]:
cat_cols = [feature for feature in cleaned_data.columns if cleaned_data[feature].dtypes == 'O']
num_cols = [feature for feature in cleaned_data.columns if feature not in cat_cols]

cat_cols, num_cols

(['ethnicity'],
 ['A1_Score',
  'A2_Score',
  'A3_Score',
  'A4_Score',
  'A5_Score',
  'A6_Score',
  'A7_Score',
  'A8_Score',
  'A9_Score',
  'A10_Score',
  'age',
  'gender',
  'jaundice',
  'family_autism',
  'used_app_before',
  'Class/ASD'])

In [18]:
cleaned_data = pd.get_dummies(cleaned_data, columns=['ethnicity'])

In [19]:
x = cleaned_data.drop(columns=['Class/ASD'])
y = cleaned_data['Class/ASD']

### Creating the Models

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score

In [23]:
rf_classifier = RandomForestClassifier()
svm_classifier = SVC()
log_reg_classifier = LogisticRegression()
knn_classifier = KNeighborsClassifier()
xgb_classifier = XGBClassifier()

In [24]:
rf_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
log_reg_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
xgb_classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [25]:
rf_pred = rf_classifier.predict(X_test)
svm_pred = svm_classifier.predict(X_test)
log_reg_pred = log_reg_classifier.predict(X_test)
knn_pred = knn_classifier.predict(X_test)
xgb_pred = xgb_classifier.predict(X_test)

In [26]:
rf_accuracy = accuracy_score(y_test, rf_pred)
svm_accuracy = accuracy_score(y_test, svm_pred)
log_reg_accuracy = accuracy_score(y_test, log_reg_pred)
knn_accuracy = accuracy_score(y_test, knn_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)


rf_recall = recall_score(y_test, rf_pred)
svm_recall = recall_score(y_test, svm_pred)
log_reg_recall = recall_score(y_test, log_reg_pred)
knn_recall = recall_score(y_test, knn_pred)
xgb_recall = recall_score(y_test, xgb_pred)

In [27]:
print(f"Random Forest Metrics:")
print(f"Accuracy: {rf_accuracy:.4f}")
print(f"Recall: {rf_recall:.4f}\n")

print(f"SVM Metrics:")
print(f"Accuracy: {svm_accuracy:.4f}")
print(f"Recall: {svm_recall:.4f}\n")

print(f"XGBoost Metrics:")
print(f"Accuracy: {xgb_accuracy:.4f}")
print(f"Recall: {xgb_recall:.4f}\n")

print(f"Logistic Regression Metrics:")
print(f"Accuracy: {log_reg_accuracy:.4f}")
print(f"Recall: {log_reg_recall:.4f}\n")

print(f"KNN Metrics:")
print(f"Accuracy: {knn_accuracy:.4f}")
print(f"Recall: {knn_recall:.4f}\n")

Random Forest Metrics:
Accuracy: 0.8438
Recall: 0.5000

SVM Metrics:
Accuracy: 0.7750
Recall: 0.0000

XGBoost Metrics:
Accuracy: 0.8187
Recall: 0.5833

Logistic Regression Metrics:
Accuracy: 0.8500
Recall: 0.6389

KNN Metrics:
Accuracy: 0.7812
Recall: 0.5556



- The recall score for many of the models are low, which is likely due to the imbalance of the dataset. To handle this, the SMOTE technique can be used to oversample the minority dataset. 

### Using SMOTE

In [28]:
from imblearn.over_sampling import SMOTE

In [29]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [30]:
rf_classifier_balanced = RandomForestClassifier()
svm_classifier_balanced  = SVC()
log_reg_classifier_balanced  = LogisticRegression()
knn_classifier_balanced  = KNeighborsClassifier()
xgb_classifier_balanced  = XGBClassifier()

In [31]:
rf_classifier_balanced.fit(X_train_resampled, y_train_resampled)
svm_classifier_balanced.fit(X_train_resampled, y_train_resampled)
log_reg_classifier_balanced.fit(X_train_resampled, y_train_resampled)
knn_classifier_balanced.fit(X_train_resampled, y_train_resampled)
xgb_classifier_balanced.fit(X_train_resampled, y_train_resampled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
rf_pred_balanced = rf_classifier_balanced.predict(X_test)
svm_pred_balanced = svm_classifier_balanced.predict(X_test)
log_reg_pred_balanced = log_reg_classifier_balanced.predict(X_test)
knn_pred_balanced = knn_classifier_balanced.predict(X_test)
xgb_pred_balanced = xgb_classifier.predict(X_test)

In [33]:
rf_accuracy_balanced = accuracy_score(y_test, rf_pred_balanced)
svm_accuracy_balanced = accuracy_score(y_test, svm_pred_balanced)
log_reg_accuracy_balanced = accuracy_score(y_test, log_reg_pred_balanced)
knn_accuracy_balanced = accuracy_score(y_test, knn_pred_balanced)
xgb_accuracy_balanced = accuracy_score(y_test, xgb_pred_balanced)


rf_recall_balanced = recall_score(y_test, rf_pred_balanced)
svm_recall_balanced = recall_score(y_test, svm_pred_balanced)
log_reg_recall_balanced = recall_score(y_test, log_reg_pred_balanced)
knn_recall_balanced = recall_score(y_test, knn_pred_balanced)
xgb_recall_balanced = recall_score(y_test, xgb_pred_balanced)

In [34]:
print(f"Random Forest Metrics:")
print(f"Accuracy: {rf_accuracy_balanced:.4f}")
print(f"Recall: {rf_recall_balanced:.4f}\n")

print(f"SVM Metrics:")
print(f"Accuracy: {svm_accuracy_balanced:.4f}")
print(f"Recall: {svm_recall_balanced:.4f}\n")

print(f"XGBoost Metrics:")
print(f"Accuracy: {xgb_accuracy_balanced:.4f}")
print(f"Recall: {xgb_recall_balanced:.4f}\n")

print(f"Logistic Regression Metrics:")
print(f"Accuracy: {log_reg_accuracy_balanced:.4f}")
print(f"Recall: {log_reg_recall_balanced:.4f}\n")

print(f"KNN Metrics:")
print(f"Accuracy: {knn_accuracy_balanced:.4f}")
print(f"Recall: {knn_recall_balanced:.4f}\n")

Random Forest Metrics:
Accuracy: 0.8313
Recall: 0.6667

SVM Metrics:
Accuracy: 0.7500
Recall: 0.9167

XGBoost Metrics:
Accuracy: 0.8187
Recall: 0.5833

Logistic Regression Metrics:
Accuracy: 0.8375
Recall: 0.8333

KNN Metrics:
Accuracy: 0.7688
Recall: 0.8611



- Using SMOTE significantly improved the recall of the models, indicating they are better at classifiying patients who were diagnosed at autistic.
- The SVM, Logistic regression and KNN models demonstrated the most promise with regards to their performance.

### Fine Tuning Hyperparemeters

In [35]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [36]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.fit_transform(X_test)

In [37]:
svm_grid_search = SVC()

In [38]:
svm_param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

In [39]:
svm_grid_search = GridSearchCV(svm_grid_search, svm_param_grid, cv=5, scoring='accuracy')
svm_grid_search.fit(X_train_scaled, y_train_resampled)

In [40]:
svm_best_params = svm_grid_search.best_params_
svm_best_model = svm_grid_search.best_estimator_

In [41]:
svm_grid_pred = svm_best_model.predict(X_test_scaled)

In [42]:
svm_best_model_accuracy = accuracy_score(y_test, svm_grid_pred)
svm_best_model_recall = recall_score(y_test, svm_grid_pred)

In [43]:
print("Best Parameters:", svm_best_params)
print("Accuracy after tuning:", svm_best_model_accuracy)
print("Recall after tuning:", svm_best_model_recall)

Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Accuracy after tuning: 0.79375
Recall after tuning: 0.8333333333333334


- Fine tuning the SVM model with different hyperparameters improved the accuracy, but recall decreased. 

In [44]:
lr_param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
    'max_iter': [100, 200, 300]
}

In [45]:
lr_grid_search = LogisticRegression()

In [46]:
lr_grid_search = GridSearchCV(lr_grid_search, lr_param_grid, cv=5, scoring='accuracy')
lr_grid_search.fit(X_train_scaled, y_train_resampled)

270 fits failed out of a total of 900.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\moham\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\moham\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\linear_model\_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\moham\AppData\Local\Packages\Python

In [47]:
lr_best_params = lr_grid_search.best_params_
lr_best_model = lr_grid_search.best_estimator_

In [48]:
lr_grid_pred = lr_best_model.predict(X_test_scaled)
lr_best_model_accuracy = accuracy_score(y_test, lr_grid_pred)
lr_best_model_recall = recall_score(y_test, lr_grid_pred)

print("Best Parameters:", lr_best_params)
print("Accuracy after tuning:", lr_best_model_accuracy)
print("Recall after tuning:", lr_best_model_recall)

Best Parameters: {'C': 1, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Accuracy after tuning: 0.74375
Recall after tuning: 0.9722222222222222


- Fine tuning improved recall even further for Logistic Regression, however total accuracy decreased

In [49]:
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'p': [1, 2], 
    'metric': ['euclidean', 'manhattan', 'chebyshev']
}

In [50]:
knn_grid_search = KNeighborsClassifier()
knn_grid_search = GridSearchCV(knn_grid_search, knn_param_grid, cv=5, scoring='accuracy')
knn_grid_search.fit(X_train_scaled, y_train_resampled)

In [51]:
knn_best_params = knn_grid_search.best_params_
knn_best_model = knn_grid_search.best_estimator_

knn_grid_pred = knn_best_model.predict(X_test_scaled)
knn_best_model_accuracy = accuracy_score(y_test, knn_grid_pred)
knn_best_model_recall = recall_score(y_test, knn_grid_pred)

print("Best Parameters:", knn_best_params)
print("Accuracy after tuning:", knn_best_model_accuracy)
print("Recall after tuning:", knn_best_model_recall)

Best Parameters: {'metric': 'manhattan', 'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
Accuracy after tuning: 0.75
Recall after tuning: 0.8611111111111112


- The KNN model decreased its' accuracy after fine tunig and recall remained the same.
- Overall, the Logistic Regression model with the default parameters provides the best trade off between accuracy and recall.