# <center><b>MIS MODELLING NOTEBOOK</b></center>

* In this project, the focus is to test the implementation of multiple models and use the one with the best performance.

In [1]:
import collections.abc
#hyper needs the four following aliases to be done manually.
collections.Iterable = collections.abc.Iterable
collections.Mapping = collections.abc.Mapping
collections.MutableSet = collections.abc.MutableSet
collections.MutableMapping = collections.abc.MutableMapping

In [12]:
#Import libraries
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
from collections import Counter

#sklearn libraries
# pre-processing and model libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# handle warnings
import warnings
warnings.filterwarnings("ignore")

#metrics
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.utils import shuffle

#Save and load the model
import joblib

In [3]:
# load the second data
df = pd.read_csv('new_adhd.csv', on_bad_lines='skip')
df.head()

Unnamed: 0,Fails to give attention to details,Has difficulty sustaining attention to tasks,Does not seem to listen when spoken to directly,Does not follow through on instructions,Has difficulty organizing tasks and activities,"Avoids, dislikes, or is reluctant to engage in tasks requiring sustained mental effort",Loses things necessary for tasks or activities,Is easily distracted by extraneous stimuli,Is forgetful in daily activities,Fidgets with or taps hands or feet or squirms in seat,Leaves seat in situations in which it is inappropriate,Unable to play or engage in leisure activities quietly,Has difficulty playing or engaging in leisure activities quietly,Is “on the go” or acts as if “driven by a motor”,Talks excessively,Blurts out an answer before the question has been completed,Has difficulty waiting his or her turn,Interrupts or intrudes on others,Result
0,Just a little,Very Often,Very Often,Often,Often,Often,Just a little,Very Often,Not at all,Not at all,Just a little,Very Often,Just a little,Often,Just a little,Often,Not at all,Very Often,ADHD
1,Not at all,Very Often,Just a little,Not at all,Just a little,Just a little,Just a little,Not at all,Just a little,Not at all,Very Often,Just a little,Not at all,Just a little,Just a little,Just a little,Just a little,Often,No ADHD
2,Not at all,Just a little,Just a little,Not at all,Just a little,Often,Not at all,Not at all,Not at all,Often,Just a little,Just a little,Not at all,Not at all,Often,Not at all,Often,Just a little,No ADHD
3,Just a little,Not at all,Just a little,Not at all,Just a little,Just a little,Just a little,Just a little,Not at all,Not at all,Often,Not at all,Often,Often,Just a little,Often,Not at all,Just a little,No ADHD
4,Not at all,Very Often,Often,Often,Not at all,Not at all,Not at all,Often,Not at all,Just a little,Not at all,Not at all,Not at all,Not at all,Not at all,Often,Not at all,Just a little,No ADHD


## PREPROCESSING

In [4]:
#Shuffle the data
df = shuffle(df, random_state=41).reset_index(drop=True)

In [5]:
# preview the distribution of the data
df['Result'].value_counts()

Result
No ADHD    686761
ADHD       313239
Name: count, dtype: int64

In [6]:
# change the data to lower cases
for col in df.columns:
    df[col] = df[col].str.lower()
df.head()

Unnamed: 0,Fails to give attention to details,Has difficulty sustaining attention to tasks,Does not seem to listen when spoken to directly,Does not follow through on instructions,Has difficulty organizing tasks and activities,"Avoids, dislikes, or is reluctant to engage in tasks requiring sustained mental effort",Loses things necessary for tasks or activities,Is easily distracted by extraneous stimuli,Is forgetful in daily activities,Fidgets with or taps hands or feet or squirms in seat,Leaves seat in situations in which it is inappropriate,Unable to play or engage in leisure activities quietly,Has difficulty playing or engaging in leisure activities quietly,Is “on the go” or acts as if “driven by a motor”,Talks excessively,Blurts out an answer before the question has been completed,Has difficulty waiting his or her turn,Interrupts or intrudes on others,Result
0,not at all,not at all,often,just a little,just a little,not at all,just a little,not at all,not at all,not at all,not at all,just a little,just a little,not at all,not at all,very often,not at all,not at all,no adhd
1,not at all,just a little,not at all,not at all,not at all,just a little,not at all,just a little,not at all,just a little,just a little,not at all,often,not at all,not at all,not at all,not at all,just a little,no adhd
2,often,not at all,often,very often,not at all,not at all,not at all,not at all,just a little,not at all,not at all,often,not at all,not at all,very often,just a little,often,not at all,adhd
3,not at all,just a little,not at all,just a little,often,not at all,just a little,not at all,just a little,just a little,very often,just a little,very often,not at all,just a little,not at all,very often,just a little,no adhd
4,not at all,often,often,not at all,not at all,not at all,very often,just a little,just a little,just a little,not at all,not at all,just a little,not at all,just a little,not at all,not at all,not at all,no adhd


In [7]:
# encode the target variable which is the result column
le = LabelEncoder()
Y = le.fit_transform(df['Result'])

enc = OneHotEncoder()
X = enc.fit_transform(df.drop('Result', axis=1))

In [None]:
# Split the data
# X, Y = df.drop('Result', axis=1), df['Result']

In [8]:
# Scale the first dataset
scaler=MaxAbsScaler()
# scale the first data
X = scaler.fit_transform(X)

## MODEL PREPARATION


In [9]:
#Using the first dataframe, split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [15]:
#Model list for testing different model performance and use the best performing model
modelList = []
modelList.append(("LogisticReg", LogisticRegression()))
modelList.append(("MultinomialNB", MultinomialNB()))
modelList.append(("GradBoostClf", GradientBoostingClassifier()))
modelList.append(("DecisionTree", DecisionTreeClassifier()))
modelList.append(("RandomForest", RandomForestClassifier()))
# modelList.append(("SVC",  SVC()))
modelList.append(("XGB", XGBClassifier()))
modelList.append(("LightGBM", LGBMClassifier()))

In [16]:
#Train and predict function using the second data
def train_predict(x_train, x_test, y_train, y_test):
    for name, classifier in modelList:
        classifier.fit(x_train,y_train)
        y_pred = classifier.predict(x_test)
        print("{} Accuracy: {}".format(name,accuracy_score(y_test,y_pred)))
        print("confusion matrix:\n", confusion_matrix(y_test, y_pred))
        print()

In [17]:
#test the function
train_predict(X_train, X_test, y_train, y_test)

LogisticReg Accuracy: 1.0
confusion matrix:
 [[ 93742      0]
 [     0 206258]]

MultinomialNB Accuracy: 0.9317433333333334
confusion matrix:
 [[ 74490  19252]
 [  1225 205033]]

GradBoostClf Accuracy: 0.92476
confusion matrix:
 [[ 71206  22536]
 [    36 206222]]

DecisionTree Accuracy: 0.8648733333333334
confusion matrix:
 [[ 71490  22252]
 [ 18286 187972]]

RandomForest Accuracy: 0.9405966666666666
confusion matrix:
 [[ 76832  16910]
 [   911 205347]]

XGB Accuracy: 0.9978233333333333
confusion matrix:
 [[ 93109    633]
 [    20 206238]]

[LightGBM] [Info] Number of positive: 480503, number of negative: 219497
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.208564 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 144
[LightGBM] [Info] Number of data points in the train set: 700000, number of used features: 72
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.686433 -> initscore=0.783495
[LightGBM] [

* The three top model are 
1. XGBoost with accuracy of 1.0
2. LightGBM with accuracy score of 0.9997
3. RandomForest with accuracy score of 0.9849

* So the implementation will be to use the three model, then use a voting system to pick the result.

In [13]:
# define the three models
model1 = XGBClassifier(probability=True)
model2 = LGBMClassifier()
model3 = LogisticRegression()

In [14]:
#Using the whole dataset to train the deployment model

# XGBoost model
model1.fit(X, Y)

In [15]:
# LightGBM model
model2.fit(X, Y)

[LightGBM] [Info] Number of positive: 686761, number of negative: 313239
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.164793 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 144
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 72
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.686761 -> initscore=0.785020
[LightGBM] [Info] Start training from score 0.785020


In [16]:
# Random forest model
model3.fit(X, Y)

In [17]:
df2 = pd.read_csv('new_adhd2.csv', on_bad_lines='skip')
df2.head()

Unnamed: 0,Fails to give attention to details,Has difficulty sustaining attention to tasks,Does not seem to listen when spoken to directly,Does not follow through on instructions,Has difficulty organizing tasks and activities,"Avoids, dislikes, or is reluctant to engage in tasks requiring sustained mental effort",Loses things necessary for tasks or activities,Is easily distracted by extraneous stimuli,Is forgetful in daily activities,Fidgets with or taps hands or feet or squirms in seat,Leaves seat in situations in which it is inappropriate,Unable to play or engage in leisure activities quietly,Has difficulty playing or engaging in leisure activities quietly,Is “on the go” or acts as if “driven by a motor”,Talks excessively,Blurts out an answer before the question has been completed,Has difficulty waiting his or her turn,Interrupts or intrudes on others,Result
0,Just a little,Just a little,Often,Not at all,Just a little,Not at all,Not at all,Often,Not at all,Just a little,Not at all,Not at all,Not at all,Just a little,Not at all,Just a little,Not at all,Not at all,No ADHD
1,Just a little,Often,Not at all,Just a little,Just a little,Just a little,Not at all,Very Often,Often,Often,Very Often,Not at all,Often,Just a little,Just a little,Very Often,Just a little,Just a little,ADHD
2,Very Often,Not at all,Not at all,Often,Very Often,Not at all,Not at all,Not at all,Not at all,Very Often,Very Often,Not at all,Just a little,Not at all,Not at all,Not at all,Not at all,Very Often,ADHD
3,Often,Just a little,Just a little,Not at all,Very Often,Just a little,Just a little,Not at all,Not at all,Often,Often,Very Often,Very Often,Just a little,Not at all,Often,Just a little,Very Often,ADHD
4,Often,Not at all,Not at all,Not at all,Just a little,Just a little,Often,Very Often,Just a little,Not at all,Not at all,Just a little,Not at all,Just a little,Just a little,Just a little,Very Often,Not at all,No ADHD


In [18]:
# change the data to lower cases
for col in df2.columns:
    df2[col] = df2[col].str.lower()
df2.head()

Unnamed: 0,Fails to give attention to details,Has difficulty sustaining attention to tasks,Does not seem to listen when spoken to directly,Does not follow through on instructions,Has difficulty organizing tasks and activities,"Avoids, dislikes, or is reluctant to engage in tasks requiring sustained mental effort",Loses things necessary for tasks or activities,Is easily distracted by extraneous stimuli,Is forgetful in daily activities,Fidgets with or taps hands or feet or squirms in seat,Leaves seat in situations in which it is inappropriate,Unable to play or engage in leisure activities quietly,Has difficulty playing or engaging in leisure activities quietly,Is “on the go” or acts as if “driven by a motor”,Talks excessively,Blurts out an answer before the question has been completed,Has difficulty waiting his or her turn,Interrupts or intrudes on others,Result
0,just a little,just a little,often,not at all,just a little,not at all,not at all,often,not at all,just a little,not at all,not at all,not at all,just a little,not at all,just a little,not at all,not at all,no adhd
1,just a little,often,not at all,just a little,just a little,just a little,not at all,very often,often,often,very often,not at all,often,just a little,just a little,very often,just a little,just a little,adhd
2,very often,not at all,not at all,often,very often,not at all,not at all,not at all,not at all,very often,very often,not at all,just a little,not at all,not at all,not at all,not at all,very often,adhd
3,often,just a little,just a little,not at all,very often,just a little,just a little,not at all,not at all,often,often,very often,very often,just a little,not at all,often,just a little,very often,adhd
4,often,not at all,not at all,not at all,just a little,just a little,often,very often,just a little,not at all,not at all,just a little,not at all,just a little,just a little,just a little,very often,not at all,no adhd


In [19]:
data = list(df2[df2.columns[:-1]].values[1:2])
data

[array(['just a little', 'often', 'not at all', 'just a little',
        'just a little', 'just a little', 'not at all', 'very often',
        'often', 'often', 'very often', 'not at all', 'often',
        'just a little', 'just a little', 'very often', 'just a little',
        'just a little'], dtype=object)]

In [20]:
data = enc.transform(data)
data

<1x72 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [22]:
pred = model2.predict_proba(data)
pred

array([[0.84446243, 0.15553757]])

In [23]:
pred[0][1]

0.1555375683416272

In [70]:
model1.predict_proba?

In [47]:
mapping = dict(zip(range(len(le.classes_)), le.classes_))
print(mapping)

{0: 'adhd', 1: 'no adhd'}


In [51]:
print(mapping.get(pred[0]))

no adhd


In [24]:
# save the models for testing
# save the model
joblib.dump(model1, 'models/model_xgboost.pkl')
joblib.dump(model2, 'models/model_lgbm.pkl')
joblib.dump(model3, 'models/model_logistic.pkl')

['models/model_logistic.pkl']

In [53]:
#save the scaler and the label encoder
joblib.dump(enc, 'models/scaler.pkl')
joblib.dump(le, 'models/label_encoder.pkl')

['models/label_encoder.pkl']

In [101]:
def prediction(data):
    model1 = joblib.load('models/model_xgboost.pkl')
    model2 = joblib.load('models/model_lgbm.pkl')
    model3 = joblib.load('models/model_logistic.pkl')
    scaler = joblib.load('models/scaler.pkl')
    data = scaler.transform(data)
    
    # model predictions
    pred1 = model1.predict(data)
    pred2 = model2.predict(data)
    pred3 = model3.predict(data)
    
    pred = []
    pred_prob = []
    # probability
    pred_prob1 = model1.predict_proba(data)
    pred_prob2 = model2.predict_proba(data)
    
    if list(pred1):
        pred.append(pred1[0])
    if list(pred2):
        pred.append(pred2[0])
    if list(pred3):
        pred.append(pred3[0])
    if list(pred_prob1):
        pred_prob.append(pred_prob1[0].max())
    if list(pred_prob2):
        pred_prob.append(pred_prob2[0].max())
    if pred and pred_prob:
        result = voting_aggregator(pred, pred_prob)
        return result
    return {}
    
def voting_aggregator(scores, pred_prob):
    label_encoder = joblib.load('models/label_encoder.pkl')
    result = Counter(scores)
    result = result.most_common(1)[0][0]
    mapping = dict(zip(range(len(label_encoder.classes_)), label_encoder.classes_))
    result = mapping.get(result)
    prob = round((sum(pred_prob)/len(pred_prob))*100, 2)
    prediction = {"prediction":result, "confidence level":prob}
    return prediction

In [107]:
result = prediction(data)
print(f"Result: {result.get('prediction', '').upper()}, confidence level : {result.get('confidence level')}%")

Result: ADHD, confidence level : 91.74%


In [56]:
x = [1,2,3,4,3,2,1,2,3,4,3,1,2,3,2,1,2,3,4,5,4,6]

In [57]:
c = Counter(x)
c

Counter({1: 4, 2: 6, 3: 6, 4: 4, 5: 1, 6: 1})

In [58]:
c.most_common()

[(2, 6), (3, 6), (1, 4), (4, 4), (5, 1), (6, 1)]

In [59]:
c.items()

dict_items([(1, 4), (2, 6), (3, 6), (4, 4), (5, 1), (6, 1)])

In [65]:
c.most_common(1)[0][0]

2

In [74]:
sum(x)/len(x)*100

277.2727272727273

In [25]:
from openai import OpenAI

In [55]:
client = OpenAI(api_key='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')

In [70]:
pred = 'no adhd'
behaviour = '''Sometimes, he can't hold back his thoughts and end up speaking out of turn. \ 
                Also patience isn't his strong suit, making waiting for his turn a bit tricky. \
                Lastly, he might jump into actions without fully thinking about the outcomes.'''
behaviours = """
            The child is well behaved in school and does not involve himself with any activities that are harmful to himself,
            or others.
            """
content = f"""The standard ADHD questionnaire had been administered, and the result is predicted to be <{pred.upper()}> \
                and the behavioural pattern of the child is ({behaviour}). Analyze the behaviour in relation to the prediction
                 and ascertain if the prediction is correct. You are to respond with a YES if the prediction \
                is correct or NO if not. You are to strictly respond only with a Yes or a No"""
resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Mental health assistant"},
            {"role": "user", 
             "content": content}
        ]
    )

In [71]:
print(resp.choices[0].message.content)

NO


In [63]:
print(resp.choices[0].message.content)

Based on the provided information, the prediction of <NO ADHD> seems to be incorrect. The mentioned behavioral patterns, such as difficulties with impulse control, lack of patience, and engaging in impulsive actions without considering the consequences, are all characteristic symptoms of ADHD. Therefore, the correct response would be NO.


In [50]:
print(resp.choices[0].message.content)

Based on the behaviour described, the prediction of <NO ADHD> does not seem to be correct. The child's difficulty holding back thoughts and speaking out of turn, lack of patience, and impulsive actions without considering outcomes are all indicative of symptoms associated with ADHD. Therefore, the correct response is NO.


In [48]:
print(resp.choices[0].message.content)

Based on the given behavioral pattern described, it does align with some common symptoms of ADHD. The difficulty in holding back thoughts and speaking out of turn, impatience, and impulsivity are all potential signs of ADHD.

However, as an AI language model, I don't have access to real-time data or personal information about the child, and I cannot diagnose or provide a definitive answer. A diagnosis should only be made by a qualified healthcare professional based on a comprehensive evaluation of the individual's symptoms and history.

In this case, if the result of the ADHD questionnaire is predicted to be "NO ADHD," but the described behaviors are consistent with ADHD symptoms, it may be worth seeking a second opinion from a mental health specialist or a healthcare professional who can conduct a more thorough assessment.


In [46]:
print(resp.choices[0].message.content)

Based on the suspected behavior you described, which includes an inability to hold back thoughts, speaking out of turn, lack of patience, difficulty waiting for turn, and impulsivity, there is a possibility that the child may have ADHD. However, it is important to note that this is just an analysis based on limited information and does not substitute for a proper diagnosis by a qualified mental health specialist. 

Therefore, I cannot accurately predict whether the child has ADHD or not without further assessment and evaluation. I recommend that you consult with a mental health specialist who can administer the standard ADHD questionnaire and conduct a comprehensive evaluation to determine whether ADHD is present or not.
