In [141]:
from utils import download_kaggle
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from scipy.stats import chi2_contingency
from plotly.subplots import make_subplots


#folder = download_kaggle("kaggle competitions download -c playground-series-s3e13")

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if not actual:
        return 0.0

    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        # first condition checks whether it is valid prediction
        # second condition checks if prediction is not repeated
        if p in actual and p not in predicted[:i]:
            
            num_hits += 1.0
            score += num_hits / (i+1.0)

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

# Vector-Borne Disease Prognosis

## Dataset Description
This dataset was generated from a deep learning model trained on the *Vector Borne Disease Prediction* dataset. Although the feature distribution is similar, there are differences in the feature distribution compared to the original dataset. It is permissible to use the original dataset for feature engineering. 

The dataset includes two files:
- `train.csv`: The training dataset, with the target variable `prognosis`.
- `test.csv`: The testing dataset, with `prognosis` excluded.

Additional Information: 
All symptoms prognosis included in the dataset are associated with 11 vector-borne diseases.

## Background Information

According to the World Health Organization (WHO), vector-borne diseases account for more than 17% of all infectious diseases and cause over 700,000 deaths annually. These diseases are transmitted through biological vectors, such as mosquitoes, flies, and ticks.

## Biological Vectors

Biological vectors are living organisms that can transmit infectious pathogens between humans or from animals to humans (zoonoses). The organisms infect themselves during a blood meal on an infected organism and later transmit the pathogen to a new host.

Source: WHO (https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases)

## Common Symptoms

The common symptoms of vector-borne diseases include fever, headache, muscle pain, joint pain, rash, vomiting, and diarrhea.

Source: KidsHealth (https://kidshealth.org/en/parents/mosquito-diseases.html)

## Risk Factors to Consider

There are several risk factors associated with vector-borne diseases, including climate change and human-animal interaction. Higher temperatures increase the probability of pathogen development in the environment. There are also several sociodemographic factors that increase the risk of contracting vector-borne diseases, such as:

- Poverty and inequality: People living in poor conditions are more vulnerable to vector-borne diseases, as they may lack access to health and sanitation services, adequate housing, and vector control services.

- Globalization and urbanization: People who own pets and exotic animals are at higher risk of contracting vector-borne diseases. Urbanization can also affect the transmission of vector-borne diseases in crowded cities or streets.

- Human and animal interaction: Activities such as hunting, farming, and pet ownership can increase the risk of contracting vector-borne diseases.

Note: It is recommended to adopt a cautious approach while dealing with these diseases and take necessary preventive measures to avoid transmission.

sources:
    
WHO: (https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases)
    
EFSA: (https://www.efsa.europa.eu/en/topics/topic/vector-borne-diseases)
    
Nature: (https://www.nature.com/articles/s41590-020-0648-y/)


## EDA

With the information at our disposal, we can now proceed with performing an Exploratory Data Analysis (EDA) to gain further insights into the dataset.


# 

In [80]:
df = pd.read_csv(f"playground-series-s3e13/train.csv", index_col='id')
df

Unnamed: 0_level_0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lyme_disease
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tungiasis
2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,Lyme_disease
3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,Rift_Valley_fever
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Plague
703,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Malaria
704,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
705,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,Plague


Based on an initial inspection of the dataset, it appears to consist of 707 records and 66 variables, with the variables primarily being binary-encoded (0,1) and the target variable being a string-formatted indication of the prognosis or name of the disease.

To gain further insight, I will proceed to analyze the distribution of disease counts and the unique disease classes present in the dataset.

In [142]:
value_counts = df['prognosis'].value_counts()

colors = ['#1E88E5'] * len(value_counts)
colors[0] = "#4CAF50"
colors[-1] = '#FB8C00'

colors

fig = go.Figure(data=[go.Bar(x=value_counts.index, 
                             y=value_counts.values,
                             marker_color=colors)])




fig.update_layout(
    title="Deseases variable counts <br> <span style='font-size:.8em'><b style='color:green'> West Nile Fever</b> semms to be more common desease on the dataset and <b style='color:orange'> Malaria</b>  the most uncommon",
    width=800,
    height=500,
    xaxis=dict(
        linewidth=1,  # set the linewidth for the X axis
        linecolor='gray'
    ),
    yaxis=dict(
        linewidth=1,
        linecolor='gray'
    ),
    plot_bgcolor="rgba(0,0,0,0)"

)
fig.show()

Upon analyzing the value counts of the dataset, it appears that the distribution is relatively uniform, with no significant differences in the number of records across different strata.

Moving forward, I will proceed to verify if all the variables in the dataset are indeed binary-class variables.

In [143]:
for i in df.columns:
    print(f"Min max on column {i}: {pd.unique(df[i])}")

Min max on column sudden_fever: [1. 0.]
Min max on column headache: [1. 0.]
Min max on column mouth_bleed: [0. 1.]
Min max on column nose_bleed: [1. 0.]
Min max on column muscle_pain: [1. 0.]
Min max on column joint_pain: [1. 0.]
Min max on column vomiting: [1. 0.]
Min max on column rash: [0. 1.]
Min max on column diarrhea: [1. 0.]
Min max on column hypotension: [1. 0.]
Min max on column pleural_effusion: [1. 0.]
Min max on column ascites: [1. 0.]
Min max on column gastro_bleeding: [0. 1.]
Min max on column swelling: [0. 1.]
Min max on column nausea: [1. 0.]
Min max on column chills: [1. 0.]
Min max on column myalgia: [0. 1.]
Min max on column digestion_trouble: [0. 1.]
Min max on column fatigue: [1. 0.]
Min max on column skin_lesions: [0. 1.]
Min max on column stomach_pain: [1. 0.]
Min max on column orbital_pain: [0. 1.]
Min max on column neck_pain: [1. 0.]
Min max on column weakness: [1. 0.]
Min max on column back_pain: [1. 0.]
Min max on column weight_loss: [1. 0.]
Min max on column

After verifying that all variables in the dataset are binary-encoded, I will now proceed to analyze the probability correlation between the symptoms and the prognosis of vector-borne diseases.

In [144]:
prognosis_grouped = df.groupby('prognosis')

sums = prognosis_grouped.sum()

In [145]:
def show_sums_plot(index):
    sums_0 = sums.iloc[index]
    sums_0 = zip(sums_0.index, sums_0.values)
    sums_0 = sorted(sums_0, key=lambda x: x[1], reverse=False)
    sums_0 = list(zip(*sums_0))    # Horizontal bar chart
    colors = []
    for i in range(len(sums_0[0])):
        if i <= 5:
            colors.append("#4CAF50")
        else:
            colors.append("#1E88E5")
        
    colors.reverse()
            

    figure = go.Figure(
        data=[go.Bar(y=sums_0[0], x=sums_0[1], orientation='h', marker_color=colors)])


    figure.update_layout(
        width=1000,
        height=1200,
        title=f"<b>{' '.join(sums.index[index].split('_')).capitalize()}</b> Desease <b style='color:green'>most common</b> symptoms"
    )

    return figure

figure = show_sums_plot(0)
figure.show()

Upon analyzing the dataset for the Chikungunya disease, the top five symptoms that appear to be most strongly correlated with the prognosis are:

- Nose Bleed
- Rash
- Sudden Fever
- Muscle Pain
- Joint Pain

These symptoms may be considered the most important features to take into account when predicting the prognosis for this disease.

In [146]:


figure = show_sums_plot(1)
figure.show()

Upon analyzing the dataset for the Dengue disease, it appears that the most important symptoms for predicting its prognosis are less specific compared to other diseases. However, the top five symptoms that appear to have the highest correlation with the Dengue prognosis are:

* Nose Bleed
* Rash
* Sudden Fever
* Joint Pain
* Mouth Pain

These features may be considered important factors to take into account when predicting the prognosis of Dengue disease.

In [147]:
sums

Unnamed: 0_level_0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
prognosis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chikungunya,40.0,37.0,34.0,45.0,39.0,38.0,35.0,41.0,24.0,6.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dengue,26.0,23.0,34.0,35.0,30.0,34.0,27.0,34.0,26.0,26.0,...,13.0,1.0,6.0,3.0,6.0,19.0,19.0,14.0,2.0,4.0
Japanese_encephalitis,41.0,34.0,33.0,34.0,37.0,30.0,26.0,34.0,26.0,28.0,...,15.0,3.0,4.0,4.0,6.0,8.0,8.0,6.0,0.0,0.0
Lyme_disease,39.0,33.0,33.0,35.0,30.0,34.0,29.0,30.0,36.0,38.0,...,24.0,19.0,20.0,20.0,20.0,8.0,2.0,6.0,9.0,9.0
Malaria,31.0,28.0,29.0,29.0,27.0,34.0,31.0,34.0,30.0,26.0,...,7.0,4.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0
Plague,27.0,26.0,22.0,27.0,34.0,26.0,21.0,27.0,20.0,25.0,...,5.0,5.0,6.0,5.0,6.0,1.0,1.0,1.0,2.0,1.0
Rift_Valley_fever,35.0,28.0,37.0,28.0,29.0,30.0,31.0,32.0,25.0,29.0,...,8.0,2.0,8.0,1.0,3.0,18.0,18.0,12.0,4.0,3.0
Tungiasis,16.0,21.0,16.0,14.0,22.0,19.0,25.0,19.0,16.0,14.0,...,7.0,5.0,4.0,5.0,3.0,43.0,43.0,46.0,2.0,1.0
West_Nile_fever,43.0,38.0,45.0,54.0,54.0,29.0,33.0,35.0,36.0,44.0,...,16.0,9.0,12.0,10.0,8.0,9.0,9.0,10.0,2.0,2.0
Yellow_Fever,28.0,31.0,23.0,22.0,31.0,23.0,32.0,31.0,18.0,21.0,...,7.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0


In [148]:
figure = show_sums_plot(2)
figure.show()

After analyzing the dataset, it appears that Japanese Encephalitis exhibits symptoms that are relatively uncommon compared to other diseases. The top five symptoms associated with Japanese Encephalitis are

- Yellow Skin.
- Light Sensitivity.
- Yellow Eyes.
- Comma. 
- Sudden Fever.

These symptoms may be important features to consider for future disease prediction.

In [149]:
figure = show_sums_plot(3)
figure.show()

Upon analyzing the dataset, it appears that Lyme Disease exhibits some less common symptoms. The top five symptoms that are associated with this disease include Jaundice, Weight Loss, Weakness, Black Pain, and Yellow Skin. These findings could potentially be valuable for further prediction or feature engineering processes.

In [150]:
figure = show_sums_plot(4)
figure.show()

During the analysis, it was observed that malaria shares important symptoms with other diseases. The top five symptoms that are common in malaria 

- Abdominal Pain
- Rash
- Joint Pain
- Light sensitivity
- Yellow Skin

These variables can be considered during feature engineering and model training for accurate disease prediction.

In [151]:
figure = show_sums_plot(5)
figure.show()

As per my analysis, the top five symptoms for the Plague disease include light sensitivity, yellow skin, abdominal pain, loss of appetite, and urination loss. Similarly, for the other diseases, such as Malaria, Lyme disease, Japanese Encephalitis, Chikungunya, and Dengue, specific symptoms have been identified based on their respective frequency of occurrence. These symptoms can be considered significant for the feature engineering and model training purposes.

And so on to the next deseases, observe the most common symptoms to all and draw you conclusions.

In [152]:
figure = show_sums_plot(6)
figure.show()

In [153]:
figure = show_sums_plot(7)
figure.show()

In [154]:
figure = show_sums_plot(8)
figure.show()

In [155]:
figure = show_sums_plot(9)
figure.show()

In [156]:
figure = show_sums_plot(10)
figure.show()

One tendency observed in all the diseases is that a group of variables is more closely associated with a specific symptom. While I have identified the top five most common symptoms for each disease, the overall symptomatic picture appears to be well-defined for each condition. This indicates that careful consideration of symptom variables can play a crucial role in the feature engineering and model development process.

Based on my analysis, there appears to be a specific combination of variables that can significantly increase the probability of the prognosis for a given disease. This information can be considered crucial in accurately modeling the problem and predicting the occurrence of these diseases.




## Feature Analysis

I will start by training a dummy classifier in order to obtain a baseline score for random classification. This baseline score will be useful in evaluating the performance of other classification models that I will train later on. The dummy classifier predicts random classes without taking into account any features or patterns in the data. Once I have obtained the baseline score, I will compare it with the scores obtained by other models to determine their effectiveness in making accurate predictions. This will help me to select the best model for my classification problem.

In [157]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

def model_three_best_proba(model: DummyClassifier, X, y, top_k=3):


    print(f"Dependent variables shape: {y.shape}")
    print(f"Independent variables shape: {X.shape}")


    model.fit(X, y)

    probas = np.array(model.predict_proba(X))
    predictions_three_best = []
    for proba in probas:
        sorted_preds_ids = np.argsort(-proba)
        top_k = sorted_preds_ids[:3]
        predict_string = [model.classes_[i] for i in top_k]
        predictions_three_best.append(predict_string)
    # get a 707x10 matrix
    # Get 3 more likely classes
    return predictions_three_best


We will use the MPA@3 evaluation metric, which involves selecting the 3 most probable classes predicted by the model and returning them in string format. The earlier a correct prediction occurs in the 3 selected classes, the higher the score we will receive.

To establish a baseline random score, we will apply this evaluation metric to a dummy classifier that predicts classes at random. The function will generate 3 string classes per prediction based on the model probability, which we can use to evaluate the performance of the dummy classifier.

In [158]:

columns_target = filter(lambda x: x.startswith("prognosis"), df.columns)
columns_predictors = filter(lambda x: not x.startswith("prognosis"), df.columns)

y = df['prognosis'].values
X = df[columns_predictors].values
fake_clf = DummyClassifier(strategy="most_frequent")

preds = model_three_best_proba(fake_clf, X, y, top_k=3)
print("MAP@3", mapk(df['prognosis'], preds, k=3))
fake_clf.fit(X, y)
print("Score", fake_clf.score(X, y))

Dependent variables shape: (707,)
Independent variables shape: (707, 64)
MAP@3 0.06553512494106553
Score 0.12022630834512023


The baseline score obtained from the dummy classifier is approximately 2.46% correct predictions. The objective now is to train a classifier that predicts only one class, with the ultimate goal of evaluating the model using the MPA@3 function before submitting the final predictions.

I have decided to use a simple decision tree for classification in this project. This decision was made for two reasons. Firstly, I want to study the feature importance of all dataset features with respect to the target variable. Secondly, I plan to study all the features of the dataset.

The reason for choosing a decision tree as the baseline model is because the prognosis process is based on heuristics, and a decision tree can predict the heuristics-based model data decision accurately. The main idea is to adjust a decision tree and perform hyperparameter tuning to get the feature importances.

Furthermore, decision trees are easier to explain compared to more complex models such as XGBoost or LightGBM.

Now, without fear **all in**...

<div style="width:100%;height:0;padding-bottom:42%;position:relative;"><iframe src="https://giphy.com/embed/CZaFzQEd5idfa" width="100%" height="100%" style="position:absolute" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div><p><a href="https://giphy.com/gifs/tournament-firing-pistols-CZaFzQEd5idfa">via GIPHY</a></p>

In [159]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
clf = DecisionTreeClassifier(
    max_depth=5,
    random_state=0,
    max_leaf_nodes=5
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0
)

clf.fit(X_train, y_train)

preds = model_three_best_proba(clf, X_test, y_test, top_k=3)
print(preds)
print("Baseline Score: ", clf.score(X_test, y_test))
print("MAP@3", mapk(df['prognosis'], preds, k=3))


Dependent variables shape: (177,)
Independent variables shape: (177, 64)
[['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'West_Nile_fever', 'Lyme_disease'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Japanese_encephalitis', 'Yellow_Fever', 'Malaria'], ['Chikungunya', 'Dengue', 'Japanese_encephalitis'], ['Japanese_enc

I have observed that the classifier scores relatively low on the complete dataset. While it is slightly better than the fake classifier, I plan to investigate the feature importance of the entire dataset using a decision tree model.

Next I'll in a binary manner for all classes as it may lead to improved scores. This approach could potentially enhance the performance of the classifier.

In [160]:
features_scores = sorted(zip(df.columns.drop('prognosis'), clf.feature_importances_), key=lambda x: x[1], reverse=True)
features_scores = list(zip(*features_scores))

figure = go.Figure(
    data=[go.Bar(x=features_scores[0][:5], y=features_scores[1][:5], marker_color=["green", "lightslategray", "lightslategray", "lightslategray"])])
figure.update_layout(
    width=800,
    height=500,
    title="Most important feature of Desición tree model <br> <span style='font-size:.8rem'> <b style='color:green'>Toenail loss</b> seems to be the most important feature"
)
figure.show()

 it appears that the model is biased towards the most infrequent features. To address this issue, I plan to repeat the analysis, but this time I will use single prognosis as a binary feature. By doing so, I hope to mitigate the bias in the model and improve its overall performance.

In [161]:
df = pd.read_csv(f"playground-series-s3e13/train.csv", index_col='id')
prognosis_unique = df['prognosis'].unique()
datasets = {}
for prognosis in prognosis_unique:
    df_prognosis = df.copy()
    # Replace the prognisis w/ a one
    df_prognosis['prognosis'] = df['prognosis'].apply(lambda x: x == prognosis)

    datasets[prognosis] = df_prognosis

datasets['Lyme_disease']

# Testing
lyme_desease_generated = datasets['Lyme_disease'][datasets["Lyme_disease"]['prognosis'] == True].drop('prognosis', axis=1)

lyme_real = df[df['prognosis'] == "Lyme_disease"].drop('prognosis', axis=1)

print("Are the same values:", (lyme_real.values == lyme_desease_generated.values).all())

Are the same values: True


I intend to create a function that can apply a model to all datasets. By doing so, I hope to streamline the analysis process and make it more efficient. This function will enable us to evaluate the performance of the model across multiple datasets, providing valuable insights that can inform future decision-making processes.

In [162]:
from sklearn.neighbors import KNeighborsClassifier

def evaluate_single_model(model, dataset, key):
    X_train, X_test, y_train, y_test = train_test_split(
        dataset.drop('prognosis', axis=1),
        dataset['prognosis'], random_state=50
    )

    model.fit(X_train, y_train)
    print(f"Model Score for {key}:", model.score(X_test, y_test))

    return model

def get_all_models(dataset_dict, model):
    models = {}
    for key in dataset_dict.keys():
        # Trick to reinit the model w/type

            models[key] = evaluate_single_model(model(), dataset_dict[key], key)
    return models



print("Dummy classifiers scores: ")
models = get_all_models(datasets, DummyClassifier)

Dummy classifiers scores: 
Model Score for Lyme_disease: 0.903954802259887
Model Score for Tungiasis: 0.9096045197740112
Model Score for Zika: 0.9096045197740112
Model Score for Rift_Valley_fever: 0.9152542372881356
Model Score for West_Nile_fever: 0.8926553672316384
Model Score for Malaria: 0.9378531073446328
Model Score for Chikungunya: 0.8983050847457628
Model Score for Plague: 0.9265536723163842
Model Score for Dengue: 0.9096045197740112
Model Score for Yellow_Fever: 0.9209039548022598
Model Score for Japanese_encephalitis: 0.8757062146892656


In [163]:
print("KNN Classifier Scores: ")
models = get_all_models(datasets, KNeighborsClassifier)

KNN Classifier Scores: 
Model Score for Lyme_disease: 0.903954802259887
Model Score for Tungiasis: 0.8926553672316384
Model Score for Zika: 0.9096045197740112
Model Score for Rift_Valley_fever: 0.9152542372881356
Model Score for West_Nile_fever: 0.8813559322033898
Model Score for Malaria: 0.9152542372881356
Model Score for Chikungunya: 0.8983050847457628
Model Score for Plague: 0.9096045197740112
Model Score for Dengue: 0.8983050847457628
Model Score for Yellow_Fever: 0.8926553672316384
Model Score for Japanese_encephalitis: 0.8757062146892656


In [164]:
print("DecisionTreeClassifier scores: ")
models = get_all_models(datasets, DecisionTreeClassifier)

DecisionTreeClassifier scores: 
Model Score for Lyme_disease: 0.903954802259887
Model Score for Tungiasis: 0.8192090395480226
Model Score for Zika: 0.8418079096045198
Model Score for Rift_Valley_fever: 0.8418079096045198
Model Score for West_Nile_fever: 0.7288135593220338
Model Score for Malaria: 0.9096045197740112
Model Score for Chikungunya: 0.9265536723163842
Model Score for Plague: 0.8531073446327684
Model Score for Dengue: 0.847457627118644
Model Score for Yellow_Fever: 0.847457627118644
Model Score for Japanese_encephalitis: 0.807909604519774


In [165]:
def plot_features(clf, desease):
    features_scores = sorted(zip(df.columns.drop('prognosis'), clf.feature_importances_), key=lambda x: x[1], reverse=True)
    features_scores = list(zip(*features_scores))
    feature_name = " ".join(features_scores[0][0].split("_")).capitalize()

    figure = go.Figure(
        data=[go.Bar(x=features_scores[0][:5], y=features_scores[1][:5], marker_color=["green", "lightslategray", "lightslategray", "lightslategray", "lightslategray"])])
    figure.update_layout(
        width=800,
        height=500,
        title=f"Most important feature of Desición tree model <br> <span style='font-size:.8rem'> <b style='color:green'>{feature_name}</b> seems to be the most important feature for <b>{desease}</b> desease"
    )
    figure.show()

plot_features(models['Lyme_disease'], 'Lyme Disease')

In [166]:
plot_features(models['Tungiasis'], 'Tungiasis')

In [167]:
plot_features(models['Zika'], 'Lyme Disease')

In [168]:
plot_features(models['Rift_Valley_fever'], 'Rift Valley Fever')

In [169]:
plot_features(models['Rift_Valley_fever'], 'Rift Valley Fever')

In [170]:
plot_features(models['West_Nile_fever'], 'West Nile Fever')

In [171]:
plot_features(models['Malaria'], 'Malaria')

In [172]:
plot_features(models['Malaria'], 'Malaria')

In [173]:
plot_features(models['Malaria'], 'Malaria')

In [174]:
plot_features(models['Dengue'], 'Dengue')

In [175]:
plot_features(models['Yellow_Fever'], 'Yellow Fever')

In [176]:
plot_features(models['Japanese_encephalitis'], 'Japanese Encephalitis')

# Conclusion
In conclusion, the analysis indicates that the models are not performing well and are likely guessing rather than making informed decisions based on the data. To improve their accuracy, I recommend exploring more relevant feature combinations that can enhance the model's ability to capture the underlying patterns in the data. Additionally, we can experiment with different modeling techniques and algorithms to identify the most suitable approach for the data at hand.

Further steps could involve collecting more data in the same format (like the original dataset) to improve the quality and diversity of the dataset, conducting a more thorough feature engineering process, and implementing cross-validation techniques to assess the models' robustness. It is important to note that the effectiveness of the proposed steps may vary depending on the nature of the data and the problem we are trying to solve.

Thank you for entrusting me with the initial EDA and feature analysis. I appreciate the opportunity to contribute to this project and help drive better insights and outcomes. If you have any further questions or require additional support, please do not hesitate to reach out to me. I look forward to continuing our collaboration in the future.