    SHALLOW MACHINE LEARNING APPROACHES

TFIDF + Classification

Count Vectorizer + Classification

Fast Text Embeddings + Classification 

### DATA PROCESSING

In [12]:
import re
import unicodedata
import pandas as pd


In [13]:
df = pd.read_csv("../../dataset/MTS-Dialog-TrainingSet.csv")

In [14]:
contraction_map = {
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "that's": "that is",
    "there's": "there is",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "can't": "can not",
    "won't": "will not",
    "couldn't": "could not",
    "wouldn't": "would not",
    "i've": "i have",
    "we're": "we are",
    "they're": "they are",
    "i'll": "i will",
    "i'd": "i would",
    "let's": "let us",
    "what's": "what is",
    "haven't": "have not",
    "ma'am": "madam",
    "how's": "how is",
    "you've": "you have",
    "we'll": "we will",
    "hasn't": "has not",
    "you've": "you have",
    "you'll": "you will",
    "we'll": "we will",
    "hasn't": "has not",
    "how're": "how are",
    "you'd": "you would",
    "we've": "we have",
    "isn't": "is not",
    "wasn't": "was not",
    "it'll": "it will",
    "here's": "here is"
}


def expand_contractions(text):
    text = text.lower()
    for c, repl in contraction_map.items():
        text = re.sub(r"\b" + re.escape(c) + r"\b", repl, text)
    return text

def normalize_text(s, lowercase=True):
    if pd.isna(s):
        return ""

    # Normalizar unicode
    s = unicodedata.normalize("NFKC", str(s))

    # Marcadores de quién habla -> usar tokens temporales
    s = re.sub(r'\bDoctor[:\-]\s*', ' __doc__ ', s, flags=re.I)
    s = re.sub(r'\bDoctor_2[:\-]\s*', ' __doc2__ ', s, flags=re.I)
    s = re.sub(r'\bPatient[:\-]\s*', ' __pat__ ', s, flags=re.I)
    s = re.sub(r'\bGuest_family[:\-]\s*', ' __fam__ ', s, flags=re.I)
    s = re.sub(r'\bGuest_family_1[:\-]\s*', ' __fam__ ', s, flags=re.I) 
    #Si hay dos visitantes, el primero cambia de guest_family a guest_family_1, vamos a igualarlos, esté solo o no siempre será <FAMILY>
    s = re.sub(r'\bGuest_family_2[:\-]\s*', ' __fam2__ ', s, flags=re.I) 
    s = re.sub(r'\bGuest_clinician[:\-]\s*', ' __clin__ ', s, flags=re.I)
    
    # Expand contractions (suponiendo que tienes esta función)
    s = expand_contractions(s)

    # Separar puntuación
    s = re.sub(r'([.,!?;:()"\[\]])', r' \1 ', s)

    # Reducir espacios
    s = re.sub(r'\s+', ' ', s).strip()

    # Lowercase todo excepto los tags
    if lowercase:
        s = s.lower()

    # Restaurar los tags en mayúsculas
    s = s.replace('__doc__', '<DOC>')
    s = s.replace('__doc2__', '<DOC2>')
    s = s.replace('__pat__', '<PAT>')
    s = s.replace('__fam__', '<FAMILY>')
    s = s.replace('__fam2__', '<FAMILY2>')
    s = s.replace('__clin__', '<CLIN>')

    return s


# Versión para Embeddings NO Contextuales, y ELMo (lowercase)
df['dialog_clean'] = df['dialogue'].apply(lambda x: normalize_text(x, lowercase=True))

# Versión para BIO/ClinicalBERT (manteniendo mayúsculas)
df['dialog_clean_clinicBERT'] = df['dialogue'].apply(lambda x: normalize_text(x, lowercase=False))

# Los resúmenes
df['section_text_clean'] = df['section_text'].apply(lambda x: normalize_text(x, lowercase=True))



### 1.st APPROACH: TF‑IDF + Shallow ML Classification Techniques:

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = df['dialog_clean']          
y = df['section_header']        

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

tfidf = TfidfVectorizer(
    ngram_range=(1,2),
    min_df=3,
    max_df=0.9,
    sublinear_tf=True
)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec  = tfidf.transform(X_test)



In [16]:
clf = LogisticRegression(random_state=42)
clf.fit(X_train_vec, y_train)
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      1.000     0.833     0.909        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.667     0.133     0.222        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.711     0.986     0.826        70
        GENHX      0.541     0.946     0.688        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      1.000     0.364     0.533        11
PASTMEDICALHX      0.812     0.542     0.650        24
 PASTSURGICAL      0.857     0.462     0.600        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [22]:
from sklearn.ensemble import RandomForestClassifier

In [23]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_vec, y_train)
y_pred = rf.predict(X_test_vec)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      1.000     1.000     1.000        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.833     0.333     0.476        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      1.000     0.333     0.500         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      1.000     0.400     0.571         5
    FAM/SOCHX      0.654     1.000     0.791        70
        GENHX      0.684     0.929     0.788        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      0.833     0.455     0.588        11
PASTMEDICALHX      0.667     0.333     0.444        24
 PASTSURGICAL      0.545     0.462     0.500        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [24]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train_vec, y_train)
y_pred = dtc.predict(X_test_vec)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.500     0.667     0.571        12
   ASSESSMENT      0.167     0.143     0.154         7
           CC      0.357     0.333     0.345        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.333     0.333     0.333         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.776     0.843     0.808        70
        GENHX      0.649     0.661     0.655        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      0.750     0.545     0.632        11
PASTMEDICALHX      0.471     0.333     0.390        24
 PASTSURGICAL      0.333     0.154     0.211        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [26]:
from sklearn.svm import LinearSVC, SVC

In [27]:
linear_svm = LinearSVC(random_state=42)
linear_svm.fit(X_train_vec, y_train)
y_pred = linear_svm.predict(X_test_vec)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.917     0.917     0.917        12
   ASSESSMENT      1.000     0.286     0.444         7
           CC      0.667     0.400     0.500        15
    DIAGNOSIS      1.000     0.250     0.400         4
  DISPOSITION      1.000     0.667     0.800         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      1.000     0.800     0.889         5
    FAM/SOCHX      0.833     1.000     0.909        70
        GENHX      0.729     0.911     0.810        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      1.000     0.500     0.667         2
  MEDICATIONS      1.000     0.818     0.900        11
PASTMEDICALHX      0.739     0.708     0.723        24
 PASTSURGICAL      0.643     0.692     0.667        13
         PLAN      1.000     0.500     0.667         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [28]:
nonlinear_svm = SVC(random_state=42)
nonlinear_svm.fit(X_train_vec, y_train)
y_pred = nonlinear_svm.predict(X_test_vec)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      1.000     0.750     0.857        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      1.000     0.067     0.125        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.714     1.000     0.833        70
        GENHX      0.525     0.946     0.675        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      1.000     0.364     0.533        11
PASTMEDICALHX      0.824     0.583     0.683        24
 PASTSURGICAL      0.833     0.385     0.526        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

    1.st Approach Conclussions

Linear support vector machine: 0.797

Random Forest: 0.697

Logistic Regression: 0.676

### 2.nd APPROACH: Count Vectorizer + Shallow ML Classification Techniques:

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
count_vect = CountVectorizer(
    ngram_range=(1, 2),
    min_df=3,
    max_df=0.9
)

X_train_count = count_vect.fit_transform(X_train)
X_test_count = count_vect.transform(X_test)

In [31]:
clf = LogisticRegression(random_state=42)
clf.fit(X_train_count, y_train)
y_pred = clf.predict(X_test_count)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.909     0.833     0.870        12
   ASSESSMENT      1.000     0.143     0.250         7
           CC      0.474     0.600     0.529        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.500     0.333     0.400         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.667     0.800     0.727         5
    FAM/SOCHX      0.810     0.971     0.883        70
        GENHX      0.818     0.804     0.811        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      0.778     0.636     0.700        11
PASTMEDICALHX      0.690     0.833     0.755        24
 PASTSURGICAL      0.727     0.615     0.667        13
         PLAN      1.000     0.500     0.667         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_count, y_train)
y_pred = rf.predict(X_test_count)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.923     1.000     0.960        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.455     0.333     0.385        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      1.000     0.333     0.500         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      1.000     0.200     0.333         5
    FAM/SOCHX      0.739     0.971     0.840        70
        GENHX      0.680     0.911     0.779        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      0.600     0.818     0.692        11
PASTMEDICALHX      0.769     0.417     0.541        24
 PASTSURGICAL      0.545     0.462     0.500        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [33]:
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train_count, y_train)
y_pred = dtc.predict(X_test_count)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))


               precision    recall  f1-score   support

      ALLERGY      0.700     0.583     0.636        12
   ASSESSMENT      1.000     0.143     0.250         7
           CC      0.294     0.333     0.312        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.333     0.333     0.333         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.667     0.400     0.500         5
    FAM/SOCHX      0.800     0.857     0.828        70
        GENHX      0.729     0.625     0.673        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.500     0.500     0.500         2
  MEDICATIONS      0.286     0.364     0.320        11
PASTMEDICALHX      0.344     0.458     0.393        24
 PASTSURGICAL      0.600     0.462     0.522        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [34]:
linear_svm = LinearSVC(random_state=42)
linear_svm.fit(X_train_count, y_train)
y_pred = linear_svm.predict(X_test_count)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      1.000     0.833     0.909        12
   ASSESSMENT      0.500     0.286     0.364         7
           CC      0.389     0.467     0.424        15
    DIAGNOSIS      1.000     0.250     0.400         4
  DISPOSITION      0.667     0.667     0.667         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.600     0.600     0.600         5
    FAM/SOCHX      0.870     0.957     0.912        70
        GENHX      0.857     0.857     0.857        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.500     0.500     0.500         2
  MEDICATIONS      0.615     0.727     0.667        11
PASTMEDICALHX      0.731     0.792     0.760        24
 PASTSURGICAL      0.769     0.769     0.769        13
         PLAN      1.000     0.500     0.667         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [35]:
nonlinear_svm = SVC(random_state=42)
nonlinear_svm.fit(X_train_count, y_train)
y_pred = nonlinear_svm.predict(X_test_count)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      1.000     0.250     0.400        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.667     0.267     0.381        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.471     0.943     0.629        70
        GENHX      0.738     0.804     0.769        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      1.000     0.182     0.308        11
PASTMEDICALHX      0.667     0.500     0.571        24
 PASTSURGICAL      1.000     0.385     0.556        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

    2.nd Approach Conclussion

Linear Support Vector Machine: 0.78

Logistic Regression: 0.763

## 3.d APPROACH: FastText Embeddings + Logistic Regression:

In [36]:
import gensim.downloader as api

In [37]:
ft = api.load("fasttext-wiki-news-subwords-300")   # FastText

In [38]:
import numpy as np

In [39]:
def doc_vector(tokens, model):
    vecs = [model[w] for w in tokens if w in model.key_to_index]
    if not vecs:
        return np.zeros(model.vector_size)
    return np.mean(vecs, axis=0)

X_emb = df['dialog_clean'].apply(lambda t: doc_vector(t.split(), ft))
X_mat = np.vstack(X_emb.values)

X_train, X_test, y_train, y_test = train_test_split(
    X_mat, y, test_size=0.2, stratify=y, random_state=42)



In [40]:
clf = LogisticRegression(class_weight='balanced', max_iter=300)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.379     0.917     0.537        12
   ASSESSMENT      0.333     0.143     0.200         7
           CC      0.750     0.200     0.316        15
    DIAGNOSIS      0.333     0.250     0.286         4
  DISPOSITION      0.125     0.667     0.211         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.500     0.400     0.444         5
    FAM/SOCHX      1.000     0.086     0.158        70
        GENHX      0.800     0.071     0.131        56
        GYNHX      0.333     1.000     0.500         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.062     0.500     0.111         2
         LABS      0.000     0.000     0.000         0
  MEDICATIONS      0.222     0.364     0.276        11
OTHER_HISTORY      0.000     0.000     0.000         0
PASTMEDICALHX      1.000     0.125     0.222        24
 PASTSURGICAL      0.278     0.385     0.323        13
         

In [41]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.909     0.833     0.870        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.833     0.333     0.476        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      1.000     0.333     0.500         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      1.000     0.200     0.333         5
    FAM/SOCHX      0.600     0.943     0.733        70
        GENHX      0.623     0.857     0.722        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      1.000     0.364     0.533        11
PASTMEDICALHX      0.357     0.208     0.263        24
 PASTSURGICAL      0.667     0.462     0.545        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [42]:
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.538     0.583     0.560        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.133     0.133     0.133        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.706     0.686     0.696        70
        GENHX      0.585     0.554     0.569        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
         LABS      0.000     0.000     0.000         0
  MEDICATIONS      0.429     0.273     0.333        11
OTHER_HISTORY      0.000     0.000     0.000         0
PASTMEDICALHX      0.344     0.458     0.393        24
 PASTSURGICAL      0.357     0.385     0.370        13
         

In [43]:
linear_svm = LinearSVC(random_state=42)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.917     0.917     0.917        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      1.000     0.267     0.421        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.731     0.971     0.834        70
        GENHX      0.420     0.839     0.560        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      1.000     0.364     0.533        11
PASTMEDICALHX      1.000     0.208     0.345        24
 PASTSURGICAL      0.667     0.154     0.250        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

In [44]:
nonlinear_svm = SVC(random_state=42)
nonlinear_svm.fit(X_train, y_train)
y_pred = nonlinear_svm.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      ALLERGY      0.818     0.750     0.783        12
   ASSESSMENT      0.000     0.000     0.000         7
           CC      0.000     0.000     0.000        15
    DIAGNOSIS      0.000     0.000     0.000         4
  DISPOSITION      0.000     0.000     0.000         3
     EDCOURSE      0.000     0.000     0.000         2
         EXAM      0.000     0.000     0.000         5
    FAM/SOCHX      0.634     0.914     0.749        70
        GENHX      0.423     0.929     0.581        56
        GYNHX      0.000     0.000     0.000         1
      IMAGING      0.000     0.000     0.000         1
IMMUNIZATIONS      0.000     0.000     0.000         2
  MEDICATIONS      0.000     0.000     0.000        11
PASTMEDICALHX      0.000     0.000     0.000        24
 PASTSURGICAL      0.000     0.000     0.000        13
         PLAN      0.000     0.000     0.000         2
   PROCEDURES      0.000     0.000     0.000         1
         

    3.d Approach Conclussion

Random Forest: 0.635

Linear Support Vector Machine: 0.614

## 4.th APPROACH: PREBUILT CLINICAL BERT EMBEDDINGS + Classification

In [45]:
# Load pre-extracted embeddings
embeddings_df = pd.read_csv(
    '../../embedding_projector/clinical_bert_embeddings_tsv.tsv',
    sep='\t',
    header=None
)
metadata_df = pd.read_csv(
    '../../embedding_projector/clinical_bert_metadata.tsv',
    sep='\t'
)

X = embeddings_df.values  # shape: (n_samples, 768)
y = metadata_df['section_header']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [46]:
clf = LogisticRegression(
    class_weight='balanced',
    max_iter=500,
    random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      allergy      0.786     0.917     0.846        12
   assessment      0.000     0.000     0.000         7
           cc      0.412     0.467     0.438        15
    diagnosis      0.000     0.000     0.000         4
  disposition      0.400     0.667     0.500         3
     edcourse      0.000     0.000     0.000         2
         exam      0.667     0.800     0.727         5
    fam/sochx      0.836     0.871     0.853        70
        genhx      0.896     0.768     0.827        56
        gynhx      0.000     0.000     0.000         1
      imaging      0.000     0.000     0.000         1
immunizations      0.000     0.000     0.000         2
  medications      0.625     0.909     0.741        11
pastmedicalhx      0.524     0.458     0.489        24
 pastsurgical      0.727     0.615     0.667        13
         plan      0.000     0.000     0.000         2
   procedures      0.000     0.000     0.000         1
         

In [47]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      allergy      1.000     0.750     0.857        12
   assessment      0.000     0.000     0.000         7
           cc      0.333     0.267     0.296        15
    diagnosis      0.000     0.000     0.000         4
  disposition      0.000     0.000     0.000         3
     edcourse      0.000     0.000     0.000         2
         exam      1.000     0.200     0.333         5
    fam/sochx      0.570     0.929     0.707        70
        genhx      0.754     0.875     0.810        56
        gynhx      0.000     0.000     0.000         1
      imaging      0.000     0.000     0.000         1
immunizations      0.000     0.000     0.000         2
  medications      0.583     0.636     0.609        11
pastmedicalhx      0.357     0.208     0.263        24
 pastsurgical      0.500     0.231     0.316        13
         plan      0.000     0.000     0.000         2
   procedures      0.000     0.000     0.000         1
         

In [48]:
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      allergy      0.417     0.417     0.417        12
   assessment      0.111     0.143     0.125         7
           cc      0.400     0.267     0.320        15
    diagnosis      0.000     0.000     0.000         4
  disposition      0.000     0.000     0.000         3
     edcourse      0.000     0.000     0.000         2
         exam      0.400     0.400     0.400         5
    fam/sochx      0.677     0.600     0.636        70
        genhx      0.698     0.661     0.679        56
        gynhx      0.000     0.000     0.000         1
      imaging      0.000     0.000     0.000         1
immunizations      0.333     0.500     0.400         2
  medications      0.154     0.182     0.167        11
pastmedicalhx      0.129     0.167     0.145        24
 pastsurgical      0.182     0.308     0.229        13
         plan      0.000     0.000     0.000         2
   procedures      0.000     0.000     0.000         1
         

In [49]:
linear_svm = LinearSVC(random_state=42)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      allergy      0.706     1.000     0.828        12
   assessment      0.000     0.000     0.000         7
           cc      0.600     0.600     0.600        15
    diagnosis      0.000     0.000     0.000         4
  disposition      0.000     0.000     0.000         3
     edcourse      0.000     0.000     0.000         2
         exam      0.667     0.800     0.727         5
    fam/sochx      0.805     0.886     0.844        70
        genhx      0.797     0.839     0.817        56
        gynhx      0.000     0.000     0.000         1
      imaging      0.000     0.000     0.000         1
immunizations      0.000     0.000     0.000         2
  medications      0.714     0.909     0.800        11
pastmedicalhx      0.458     0.458     0.458        24
 pastsurgical      0.800     0.615     0.696        13
         plan      0.000     0.000     0.000         2
   procedures      0.000     0.000     0.000         1
         

In [50]:
nonlinear_svm = SVC(random_state=42)
nonlinear_svm.fit(X_train, y_train)
y_pred = nonlinear_svm.predict(X_test)
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

               precision    recall  f1-score   support

      allergy      1.000     0.417     0.588        12
   assessment      0.000     0.000     0.000         7
           cc      0.500     0.333     0.400        15
    diagnosis      0.000     0.000     0.000         4
  disposition      0.000     0.000     0.000         3
     edcourse      0.000     0.000     0.000         2
         exam      0.000     0.000     0.000         5
    fam/sochx      0.522     1.000     0.686        70
        genhx      0.690     0.875     0.772        56
        gynhx      0.000     0.000     0.000         1
      imaging      0.000     0.000     0.000         1
immunizations      0.000     0.000     0.000         2
  medications      1.000     0.364     0.533        11
pastmedicalhx      0.333     0.083     0.133        24
 pastsurgical      1.000     0.231     0.375        13
         plan      0.000     0.000     0.000         2
   procedures      0.000     0.000     0.000         1
         

    4.th Approach Conclussion

Linear Suppor Vector Machine: 0.71

Logistic Regression: 0.689

Random Forest: 0.622

# OVERALL CONCLUSION

# Summary of Experimental Results

| Approach | Feature Type | Best Model | Best Accuracy | Observation |
|----------|--------------|-------------|---------------|-------------|
| 1.er Approach | TF-IDF | Linear SVM | 0.797 | **Más Eficiente** en la mayoría de experimentos |
| 2.o Approach | Count Vectorizer | Linear SVM | 0.780 | Un poco mas flojo que TFIDF, pero funcional |
| 3.er Approach | FastText Embeddings | Random Forest | 0.635 | Bastante peor,los modelos de ML clasicos no funcionan bien con este tipo de embeddings |
| 4.o Approach | ClinicalBERT Embeddings | Linear SVM | 0.710 | Mejor que los embeddings anteriores pero peor que TF-IDF + SVM  |

## Conclusiones

- **TF-IDF y Count Vectorizer** superan a los embeddings en general y clasifican de forma mas eficaz.

- **Linear SVM** independientemente del tipo de representación de los features, funciona estupendamente, a diferencia de lo que pensabamos inicialmente, ya que logistic regression pintaba mejor a nuestros ojos.

- **TF-IDF + Linear SVM** este enfoque es el mas potente con un accuracy sorprendente (0.797).
  Este hecho sugiere que las features se capturan bien con esta combinación de técnicas.

- **FastText and ClinicalBERT embeddings** Usar embeddings de este tipo en machine learning clásico no es la mejor opción, aunque lo eserábamos, ya que están más enfocadas al deep learning.

- **ClinicalBERT** mejora respecto al anterior approach, lo cual nos indica que la diferencia entre embeddings importa, pero flaquea en el uso de técnicas de shallow machine learning.

## Final Key Takeaway

Para este dataset en una tarea de clasificación, **las representaciones de texto más simples (TF-IDF) con un modelo lineal fuerte (SVM) superan a las representaciones mas modernas (embeddings) combinadas con el machine learning tradicional**.

De todas formas, esperamos que un fine-tunning de un modelo basado en redes neuronales convolucionales (CNN) funcione mucho mejor.