# Keyword Detection on Websites



## Assignment
Your task is to create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. If you want to know more please read this article.

The expected result is a CSV file for test data with columns [doc_id and prediction].

Bonus: if you would like to go the extra mile in this task try to identify tumor board types interdisciplinary, breast, and any third type of tumor board up to you. For these tumor boards please try to identify their schedule: Day (e.g. Friday), frequency (e.g. weekly, bi-weekly, monthly), and time when they start.

## Data Description
You have train.csv and test.csv files and folder with corresponding .html files.

Files:

train.csv contains next columns: url, doc_id and label
test.csv contains next columns: url and doc_id
htmls contains files with names {doc_id}.html
keyword2tumor_type.csv contains useful keywords for types of tumorboards
Description of tumor board labels:

1 (no evidence): tumor boards are not mentioned on the page
2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
3 (high confidence): page is completely dedicated to the description of tumor board types and dates
You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

## Practicalities
You should prepare a Jupyter Notebook with the code that you used for making the predictions and the following documentation:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm performs and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

## Tips
to extract clean text from the page you can use BeautifulSoup module like this

from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

clean_text = soup.get_text(' ')


## If you decide that you don't need, for example, tags <p> in your document you can do this:##


from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

for tag in soup.find_all('p'):
    tag.decompose()

#### To download the dataset <a href="https://drive.google.com/drive/folders/1Qs2fLj9HmAzx2YGKmqkePCa1Acs5JY3Z?usp=sharing"> Click here </a>

In [1]:
import numpy as  np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
import os 
import re
import warnings as w 
w.filterwarnings('ignore')


from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
keyword_to_tumor_df = pd.read_csv('keyword2tumor_type.csv')
html_folder ="htmls/"
#here files are doc_id.html

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     100 non-null    object
 1   doc_id  100 non-null    int64 
 2   label   100 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 2.5+ KB


In [4]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     48 non-null     object
 1   doc_id  48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 900.0+ bytes


No null values

In [5]:
train_df.head()

Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3


In [6]:
test_df.head()

Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


Here we have only link and doc_id for files. Actual file are in html folder , where file are named as doc_id.html. We can extract all the text from there.

In [7]:
keyword_to_tumor_df.head()

Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


In [8]:
#func to extract text from html file
def html_to_text(doc_id):
    file_path = os.path.join(html_folder, f'{doc_id}.html')
    if not os.path.exists(file_path):
        print('File not found')
        return ""

    with open(file_path,'r', encoding='latin-1') as f:
            content = f.read()
            # using utf-8 encoding we got error cause text is in latin which contains 
            # non ASCII characters
        
    soup = BeautifulSoup(content,'html.parser') 
    clean_text = soup.get_text(' ') #spaces between elements
    return clean_text.strip()

In [9]:
# extracting text for train and test data 
train_df['text'] = train_df['doc_id'].apply(html_to_text)
test_df['text']=test_df['doc_id'].apply(html_to_text)

In [10]:
train_df['text'].head(10)

0    Elbe-Elster Klinikum - Chirurgie Finsterwalde ...
1    Onkologisches Zentrum - Klinikum Bayreuth \n \...
2    Zentrum - SozialpÃ¤diatrisches Zentrum - StÃ¤d...
3    Leistung - Spezielle UnterstÃ¼tzung bei der An...
4    Zuweiser - Tumorkonferenzen - Tumorkonferenz G...
5    Krebszentrum Reutlingen: Impressum - Kreisklin...
6    Ãsthetische Brustchirurgie - krebszentrum.kre...
7    Presse und Auszeichnungen - krebszentrum.kreis...
8    Hautkrebs - krebszentrum.kreiskliniken-reutlin...
9    Magenkrebs - krebszentrum.kreiskliniken-reutli...
Name: text, dtype: object

In [11]:
#remove special characters and extra space
def filter_text(text):
    text=text.lower()
    text = re.sub(r'\s+',' ', text)
    text = re.sub(r'[^\w\s]','', text)
    return text

In [12]:
train_df['filter_text'] = train_df['text'].apply(filter_text)
test_df['filter_text'] = test_df['text'].apply(filter_text)

In [13]:
train_df.drop(columns=['url'], inplace=True)
test_df.drop(columns='url', inplace=True)

In [14]:
test_df['filter_text'].head(1).values[0]

'bauchspeicheldrã¼se  klinik fã¼r allgemein viszeral und kinderchirurgie gãttingen klinik fã¼r allgemein viszeral und kinderchirurgie zur hauptnavigation springen zum inhalt wechseln aktuelles und kontakt kontakt logo der universtãtsmedizin gãttingen navigation ãffnen oder schliessen hauptnavigation subnavigation ãffnen oder schliessen medizinische versorgung poliklinik sonographie schilddrã¼se speiserãhre und magen darm bauchspeicheldrã¼se ced leber und galle hernien koloproktologie adipositaschirurgie kinderchirurgie sarkomchirurgie hipec roboterchirurgie interdisziplinãre zentren subnavigation ãffnen oder schliessen forschung klinische studien tumorepigenetik ag conradi ag gaedcke ag grade ag krause ag sperling ag sprenger ag wegwitz promotion publikationen subnavigation ãffnen oder schliessen lehre module blockpraktikum famulaturenpj subnavigation ãffnen oder schliessen ãber uns mitarbeiter stationen stationãre aufnahme geschichte der klinik navigationspfad medizinische versorgung 

In [15]:
train_df.head()

Unnamed: 0,doc_id,label,text,filter_text
0,1,1,Elbe-Elster Klinikum - Chirurgie Finsterwalde ...,elbeelster klinikum chirurgie finsterwalde su...
1,3,3,Onkologisches Zentrum - Klinikum Bayreuth \n \...,onkologisches zentrum klinikum bayreuth aktue...
2,4,1,Zentrum - SozialpÃ¤diatrisches Zentrum - StÃ¤d...,zentrum sozialpãdiatrisches zentrum stãdtisc...
3,5,1,Leistung - Spezielle UnterstÃ¼tzung bei der An...,leistung spezielle unterstã¼tzung bei der anm...
4,6,3,Zuweiser - Tumorkonferenzen - Tumorkonferenz G...,zuweiser tumorkonferenzen tumorkonferenz gas...


In [16]:
# extracting keywords - as the language is latin it will be difficult to understand. As we have been given a set of 
# keywords. We will see if any of those keywords is present in text as classify it accordingly

In [17]:
keyword_to_tumor_dict = dict(zip(keyword_to_tumor_df['keyword'],keyword_to_tumor_df['tumor_type']))

def extract_keywords(text):
    matched_keywords =[]
    for keyword in keyword_to_tumor_dict:
        if keyword in text:
            matched_keywords.append(keyword_to_tumor_dict[keyword])
    return " ".join(matched_keywords)


In [18]:
train_df['keywords'] = train_df['filter_text'].apply(extract_keywords)
test_df['keywords'] = test_df['filter_text'].apply(extract_keywords)

In [19]:
train_df

Unnamed: 0,doc_id,label,text,filter_text,keywords
0,1,1,Elbe-Elster Klinikum - Chirurgie Finsterwalde ...,elbeelster klinikum chirurgie finsterwalde su...,Brust Brust Darm Mamma carcinoma Mamma carcino...
1,3,3,Onkologisches Zentrum - Klinikum Bayreuth \n \...,onkologisches zentrum klinikum bayreuth aktue...,Brust Brust Darm Darm Darm Haut Haut Kopf-hals...
2,4,1,Zentrum - SozialpÃ¤diatrisches Zentrum - StÃ¤d...,zentrum sozialpãdiatrisches zentrum stãdtisc...,Urologische Schwerpunkt
3,5,1,Leistung - Spezielle UnterstÃ¼tzung bei der An...,leistung spezielle unterstã¼tzung bei der anm...,
4,6,3,Zuweiser - Tumorkonferenzen - Tumorkonferenz G...,zuweiser tumorkonferenzen tumorkonferenz gas...,Darm Magen Magen
...,...,...,...,...,...
95,140,1,uniFM | uniCROSS \n \n \n \n \n \n \n \n \n \n...,unifm unicross news and magazine theme home w...,Endokrine malignome Gallenblasen/gallengangkre...
96,141,1,InterdisziplinÃ¤re NeurovaskulÃ¤re Konferenz ǀ...,interdisziplinãre neurovaskulãre konferenz ǀ u...,Brust Endokrine malignome Gallenblasen/galleng...
97,144,2,FÃ¼r Ãrzte | Vivantes \n \n \n \n \n \n \n ...,fã¼r ãrzte vivantes javascript scheint in ihr...,Haut Haut Kopf-hals Lunge Lunge Magen Magen Ur...
98,145,2,"Innere Medizin â HÃ¤matologie, Onkologie und...",innere medizin â hãmatologie onkologie und pal...,Haut Haut Kopf-hals Lunge Lunge Magen Magen Ur...


In [20]:
tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(train_df['filter_text'].fillna(''))  # Handle NaN values

# Target variable
y = train_df['label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)


In [21]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
acc

0.45

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Naive Bayes": MultinomialNB(),
    "SVM": SVC(kernel='linear', random_state=42)
}

# train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Make predictions
    acc = accuracy_score(y_test, y_pred)  # Calculate accuracy
    print(f"{name} Accuracy: {acc:.4f}") 

Logistic Regression Accuracy: 0.4500
Random Forest Accuracy: 0.4500
Naive Bayes Accuracy: 0.4500
SVM Accuracy: 0.4000
