# Keyword Detection on Websites



## Assignment
Your task is to create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. If you want to know more please read this article.

The expected result is a CSV file for test data with columns [doc_id and prediction].

Bonus: if you would like to go the extra mile in this task try to identify tumor board types interdisciplinary, breast, and any third type of tumor board up to you. For these tumor boards please try to identify their schedule: Day (e.g. Friday), frequency (e.g. weekly, bi-weekly, monthly), and time when they start.

## Data Description
You have train.csv and test.csv files and folder with corresponding .html files.

Files:

train.csv contains next columns: url, doc_id and label
test.csv contains next columns: url and doc_id
htmls contains files with names {doc_id}.html
keyword2tumor_type.csv contains useful keywords for types of tumorboards
Description of tumor board labels:

1 (no evidence): tumor boards are not mentioned on the page
2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
3 (high confidence): page is completely dedicated to the description of tumor board types and dates
You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

## Practicalities
You should prepare a Jupyter Notebook with the code that you used for making the predictions and the following documentation:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm performs and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

## Tips
to extract clean text from the page you can use BeautifulSoup module like this

from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

clean_text = soup.get_text(' ')


## If you decide that you don't need, for example, tags <p> in your document you can do this:##


from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

for tag in soup.find_all('p'):
    tag.decompose()

#### To download the dataset <a href="https://drive.google.com/drive/folders/1Qs2fLj9HmAzx2YGKmqkePCa1Acs5JY3Z?usp=sharing"> Click here </a>

In [388]:
# Required Libraries
import pandas as pd
import os

# Load CSV Files
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
keywords_df = pd.read_csv('keyword2tumor_type.csv')

# Display First Few Rows
print("Train Data")
print(train_df.head())

print("\nTest Data")
print(test_df.head())

print("\nKeywords Data")
print(keywords_df.head())

# Check the HTML Folder
html_folder_path = '/content/htmls'
html_files = os.listdir(html_folder_path)

# Display some HTML file names
print(f"\nTotal HTML Files: {len(html_files)}")



Train Data
                                                 url  doc_id  label
0  http://elbe-elster-klinikum.de/fachbereiche/ch...       1      1
1  http://klinikum-bayreuth.de/einrichtungen/zent...       3      3
2  http://klinikum-braunschweig.de/info.php/?id_o...       4      1
3  http://klinikum-braunschweig.de/info.php/?id_o...       5      1
4  http://klinikum-braunschweig.de/zuweiser/tumor...       6      3

Test Data
                                                 url  doc_id
0  http://chirurgie-goettingen.de/medizinische-ve...       0
1  http://evkb.de/kliniken-zentren/chirurgie/allg...       2
2  http://krebszentrum.kreiskliniken-reutlingen.d...       7
3  http://marienhospital-buer.de/mhb-av-chirurgie...      15
4  http://marienhospital-buer.de/mhb-av-chirurgie...      16

Keywords Data
        keyword tumor_type
0  senologische      Brust
1  brustzentrum      Brust
2        breast      Brust
3        thorax      Brust
4     thorakale      Brust

Total HTML Files: 148


This code section loads the necessary datasets: train.csv, test.csv, and keyword2tumor_type.csv using pandas. It also checks the available HTML files in the specified folder. The first few rows of each dataset are printed to verify the data. This step ensures that the datasets are correctly loaded and HTML files are present for further processing. The total count of HTML files is also displayed.

In [389]:
!pip install chardet



In [390]:
import os
import chardet
import pandas as pd
from bs4 import BeautifulSoup

# Function to Detect Encoding and Extract Text from HTML
def parse_html_file(file_path):
    try:

        with open(file_path, 'rb') as file:
            raw_data = file.read()
            result = chardet.detect(raw_data)
            encoding = result['encoding']
            print(f"Detected encoding for {file_path}: {encoding}")


        with open(file_path, 'r', encoding=encoding) as file:
            soup = BeautifulSoup(file, 'html.parser')

            text = soup.get_text(separator=' ', strip=True)
            return text
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return None


This code defines a function parse_html_file() that reads and extracts text from HTML files. It uses the chardet library to detect the encoding of the HTML file, ensuring proper text decoding. Then, the BeautifulSoup library is used to parse the HTML and extract text content, which is returned as a string. If an error occurs (e.g., due to encoding issues), it handles the exception and returns None. The function ensures robust reading and parsing of HTML files for further text analysis.

In [391]:
# folder path
html_folder_path = '/content/htmls'

# Parsing html files in train data
train_df['text'] = train_df['doc_id'].apply(
    lambda doc_id: parse_html_file(os.path.join(html_folder_path, f"{doc_id}.html"))
)
test_df['text'] = test_df['doc_id'].apply(
    lambda doc_id: parse_html_file(os.path.join(html_folder_path, f"{doc_id}.html"))
)



Detected encoding for /content/htmls/1.html: utf-8
Detected encoding for /content/htmls/3.html: utf-8
Detected encoding for /content/htmls/4.html: utf-8
Detected encoding for /content/htmls/5.html: utf-8
Detected encoding for /content/htmls/6.html: utf-8
Detected encoding for /content/htmls/8.html: utf-8
Detected encoding for /content/htmls/9.html: utf-8
Detected encoding for /content/htmls/10.html: utf-8
Detected encoding for /content/htmls/11.html: utf-8
Detected encoding for /content/htmls/12.html: utf-8
Detected encoding for /content/htmls/13.html: utf-8
Detected encoding for /content/htmls/14.html: utf-8
Detected encoding for /content/htmls/17.html: utf-8
Detected encoding for /content/htmls/18.html: utf-8
Detected encoding for /content/htmls/19.html: utf-8
Detected encoding for /content/htmls/20.html: utf-8
Detected encoding for /content/htmls/21.html: utf-8
Detected encoding for /content/htmls/22.html: utf-8
Detected encoding for /content/htmls/23.html: utf-8
Detected encoding f

  k = self.parse_starttag(i)


Detected encoding for /content/htmls/74.html: utf-8
Detected encoding for /content/htmls/78.html: utf-8
Detected encoding for /content/htmls/82.html: utf-8
Detected encoding for /content/htmls/84.html: utf-8
Detected encoding for /content/htmls/87.html: utf-8
Detected encoding for /content/htmls/91.html: utf-8
Detected encoding for /content/htmls/99.html: utf-8
Detected encoding for /content/htmls/103.html: utf-8
Detected encoding for /content/htmls/104.html: utf-8
Detected encoding for /content/htmls/109.html: utf-8
Detected encoding for /content/htmls/113.html: utf-8
Detected encoding for /content/htmls/116.html: utf-8
Detected encoding for /content/htmls/123.html: utf-8
Detected encoding for /content/htmls/124.html: utf-8
Detected encoding for /content/htmls/127.html: utf-8
Detected encoding for /content/htmls/134.html: utf-8
Detected encoding for /content/htmls/135.html: utf-8
Detected encoding for /content/htmls/142.html: utf-8
Detected encoding for /content/htmls/143.html: utf-8


In this code, we are parsing the HTML files corresponding to each document ID in both the training and test datasets. The apply() function is used to apply the parse_html_file() function to each doc_id in both train_df and test_df. The HTML files are located in the html_folder_path and are named using the doc_id followed by the .html extension.

The extracted text from each HTML file is stored in the new column text for both train_df and test_df. This step is crucial for transforming the raw HTML content into a readable text format, which will be used in the subsequent stages of the analysis and model training.

In [392]:
train_df.head()

Unnamed: 0,url,doc_id,label,text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,Elbe-Elster Klinikum - Chirurgie Finsterwalde ...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,Onkologisches Zentrum - Klinikum Bayreuth Aktu...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,Zentrum - Sozialpädiatrisches Zentrum - Städti...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,Leistung - Spezielle Unterstützung bei der Anm...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,Zuweiser - Tumorkonferenzen - Tumorkonferenz G...


In [393]:
test_df.head()

Unnamed: 0,url,doc_id,text
0,http://chirurgie-goettingen.de/medizinische-ve...,0,"Bauchspeicheldrüse | Klinik für Allgemein-, Vi..."
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2,Chirurgie der Bauchspeicheldrüse (Pankreaschir...
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7,Brustzentrum Reutlingen: Behandlungsverfahren ...
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15,Leistungsspektrum: Sankt Marien-Hospital Buer ...
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16,Leistungsspektrum: Sankt Marien-Hospital Buer ...


In [394]:
#checking null values
train_df.isnull().sum()

Unnamed: 0,0
url,0
doc_id,0
label,0
text,0


In [395]:
#checking null values on test df

test_df.isnull().sum()


Unnamed: 0,0
url,0
doc_id,0
text,0


In [396]:
#textpreprocessing for both train and test data

import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to clean text
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters, numbers, and extra spaces
    text = re.sub(r'[^a-z\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords and apply lemmatization
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    # Join the cleaned tokens back into a string
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text


    train_df['cleaned_text'] = train_df['text'].apply(clean_text)
    test_df['cleaned_text'] = test_df['text'].apply(clean_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



In the above code, text preprocessing is applied to both the train_df and test_df datasets. The clean_text() function performs several key steps to clean and prepare the text for further analysis or model training:

Lowercasing: All text is converted to lowercase to ensure consistency and prevent the model from treating the same word in different cases (e.g., "Tumor" vs "tumor") as different.

Remove Special Characters and Numbers: Using a regular expression (re.sub()), special characters, numbers, and extra spaces are removed to retain only the relevant words for analysis.

Remove Punctuation: Although not enabled in this section (it's commented out), punctuation removal can be done if needed.

**Tokenization**: The text is split into individual tokens (words) using nltk.word_tokenize().

**Remove Stopwords**: Common words like "the", "is", "and", etc., which do not contribute significant meaning to the text, are removed using the stopwords corpus from NLTK.

**Lemmatization**: Words are reduced to their base form using a WordNetLemmatizer. For example, "running" is changed to "run", and "better" is changed to "good".

***The cleaned text for both the train and test datasets is stored in new columns cleaned_text for further use in the next steps, like vectorization and model training. This preprocessing ensures that the text is in a consistent, meaningful format for analysis and model input.***

In [397]:
# TF-IDF Vectorizer to convert text into numerical features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit the TF-IDF model
X_train_tfidf = tfidf.fit_transform(train_df['text'])
X_test_tfidf = tfidf.transform(test_df['text'])

y_train = train_df['label']


In this code, **TF-IDF Vectorization** is applied as part of Feature Engineering. It transforms raw text into numerical features that machine learning models can understand.

Key points:
**TF-IDF** (Term Frequency-Inverse Document Frequency) measures the importance of each word in the text, giving higher weight to rare words and lower weight to common words across all documents.
**max_features=5000:** Limits the features to the top 5000 most important words.
**stop_words='english':** Removes common words (like "the", "is") that are not meaningful for analysis.
Purpose:
This step converts text data into a format suitable for training machine learning models by representing it as a matrix of numerical values. It's an essential step in transforming unstructured text into structured data that the model can process.

In [398]:
# Training on Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_tfidf, y_train)


In this code, we are training a **Random Forest Classifier** on the features obtained from TF-IDF vectorization.

### Key Points:
- **Random Forest Classifier**: It's an ensemble learning method that uses multiple decision trees to make predictions. Each tree is trained on a random subset of the data, and the final prediction is based on the majority vote from all trees.
- **n_estimators=100**: This specifies that the forest should consist of 100 decision trees.
- **random_state=42**: Ensures reproducibility of results, meaning the random processes (like selecting subsets of data for training) will yield the same results each time the code is run.

### Purpose:
This step trains the model using the transformed features (TF-IDF) from the training data (`X_train_tfidf`) and the corresponding target labels (`y_train`). The model will now be able to predict the target label for unseen data based on the patterns learned during training.

In [399]:

# predictions on test data
y_pred = model.predict(X_test_tfidf)

# Prepare the predictions DataFrame for the test data
test_predictions = pd.DataFrame({
    'doc_id': test_df['doc_id'],
    'prediction': y_pred
})

# Saving predictions to a CSV file
test_predictions.to_csv('final_predictions.csv', index=False)

### Code Explanation:

1. **Making Predictions on Test Data**:
   - `y_pred = model.predict(X_test_tfidf)`: Here, the trained Random Forest model (`model`) is used to predict the target labels for the test dataset (`X_test_tfidf`). The predicted labels are stored in `y_pred`.

2. **Creating a DataFrame for Predictions**:
   - `test_predictions = pd.DataFrame({'doc_id': test_df['doc_id'], 'prediction': y_pred})`: A new DataFrame is created, which contains two columns:
     - `'doc_id'`: The unique document identifier from the test dataset.
     - `'prediction'`: The predicted label (0, 1, or 2) corresponding to each document.

3. **Saving the Predictions**:
   - `test_predictions.to_csv('final_predictions.csv', index=False)`: The `test_predictions` DataFrame is saved to a CSV file named `final_predictions.csv`, without including the index.

### Purpose:
This part of the code is for generating and saving the final predictions for the test data. After predicting the labels for each document, we create a CSV file (`final_predictions.csv`) containing the `doc_id` and its corresponding predicted label. This CSV file can be submitted or used for further evaluation.

# **BONUS TASK**

In [400]:
import re
import pandas as pd

# Function to extract schedule details
def extract_schedule_info(text):
    days = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
    frequency = ['weekly', 'bi-weekly', 'monthly']
    time_keywords = ['am', 'pm', 'morning', 'afternoon', 'evening']


    day = None
    freq = None
    time = None


    for day_candidate in days:
        if day_candidate in text.lower():
            day = day_candidate.capitalize()
            break  # stop after finding the first match

    for freq_candidate in frequency:
        if freq_candidate in text.lower():
            freq = freq_candidate
            break  # stop after finding the first match

    for time_candidate in time_keywords:
        if time_candidate in text.lower():
            time = time_candidate
            break  # stop after finding the first match

    return day, freq, time



**Above code** defines a function `extract_schedule_info`, which is designed to extract scheduling information from text. The function searches for keywords related to **days of the week**, **meeting frequency**, and **time of day** in the provided text.

1. **Day Extraction**: The function looks for day names (e.g., Monday, Tuesday) in the text and returns the first match it finds.
2. **Frequency Extraction**: It checks for keywords that indicate how often the tumor board meetings occur, such as "weekly", "bi-weekly", or "monthly".
3. **Time Extraction**: It identifies meeting times using keywords like "morning", "afternoon", "am", or "pm".

The function then returns three values: the **day**, **frequency**, and **time** of the meeting, which can be used for organizing tumor board meetings or further analysis.


In [401]:
import pandas as pd

def identify_tumor_board_type(text, keyword_df):
    tumor_types = keyword_df['tumor_type'].unique()
    for tumor_type in tumor_types:
        keywords = keyword_df[keyword_df['tumor_type'] == tumor_type]['keyword'].tolist()
        if any(keyword in text for keyword in keywords):
            return tumor_type
    return 'Other'

train_df['tumor_type'] = train_df['text'].apply(lambda x: identify_tumor_board_type(x, pd.read_csv('keyword2tumor_type.csv')))
test_df['tumor_type'] = test_df['text'].apply(lambda x: identify_tumor_board_type(x, pd.read_csv('keyword2tumor_type.csv')))

print(train_df[['doc_id', 'tumor_type']])


    doc_id        tumor_type
0        1             Brust
1        3             Leber
2        4       Urologische
3        5             Other
4        6  Interdisziplinär
..     ...               ...
95     140              Haut
96     141       Urologische
97     144       Urologische
98     145    Hämatooncology
99     146             Other

[100 rows x 2 columns]


**Above code** defines a function `identify_tumor_board_type`, which categorizes the tumor board type based on keywords extracted from the `keyword2tumor_type.csv` file. Here's how it works:

1. **Tumor Types Extraction**: The function takes in the `text` (content of the HTML page) and the `keyword_df` (a DataFrame containing tumor types and associated keywords). It first retrieves all unique tumor types from the dataset.

2. **Keyword Matching**: For each tumor type, it gets the associated keywords and checks if any of the keywords are present in the input `text`. If a match is found, the function returns the corresponding tumor type.

3. **Categorization**: If no match is found for any tumor type, the function returns 'Other', indicating that the page doesn't fit into any of the defined categories.

4. **Application to Data**: The function is applied to both the training and test data using the `apply` method, creating a new column `tumor_type` that contains the predicted tumor board type for each document.

In summary, this function classifies tumor boards into specific types based on keyword matching, and it is applied to both train and test datasets to predict tumor board types.

In [402]:

def identify_tumor_board_type_and_schedule(text, keyword_df):

    tumor_type = identify_tumor_board_type(text, keyword_df)
    day, freq, time = extract_schedule_info(text)

    return tumor_type, day, freq, time

keyword_df = pd.read_csv('keyword2tumor_type.csv')

train_df[['tumor_type', 'day', 'frequency', 'start_time']] = train_df['text'].apply(
    lambda x: pd.Series(identify_tumor_board_type_and_schedule(x, keyword_df))
)

test_df[['tumor_type', 'day', 'frequency', 'start_time']] = test_df['text'].apply(
    lambda x: pd.Series(identify_tumor_board_type_and_schedule(x, keyword_df))
)

# Display the results
print(train_df[['doc_id', 'tumor_type', 'day', 'frequency', 'start_time']])

    doc_id        tumor_type     day frequency start_time
0        1             Brust    None      None         am
1        3             Leber    None      None         am
2        4       Urologische    None      None         am
3        5             Other    None      None         am
4        6  Interdisziplinär    None      None         am
..     ...               ...     ...       ...        ...
95     140              Haut  Friday      None         am
96     141       Urologische    None      None         am
97     144       Urologische    None      None         am
98     145    Hämatooncology    None      None         am
99     146             Other    None      None         am

[100 rows x 5 columns]


**Above code** integrates the tumor board type identification and schedule extraction into a single function, `identify_tumor_board_type_and_schedule`, that processes both aspects simultaneously. Here's a breakdown:

1. **Function Definition**:
   - The function `identify_tumor_board_type_and_schedule` combines the results of two earlier functions:
     - `identify_tumor_board_type` for detecting the tumor board type.
     - `extract_schedule_info` for extracting day, frequency, and start time of the tumor board.

2. **Applying the Function**:
   - The function is applied to both the train and test datasets using the `apply` method. The `lambda` function is used to pass each document's `text` through `identify_tumor_board_type_and_schedule`.
   - The result is unpacked into four separate columns: `tumor_type`, `day`, `frequency`, and `start_time`.

3. **Storing the Results**:
   - The new columns are added to the `train_df` and `test_df` DataFrames.
   - The final dataset contains predictions for tumor board type, day of the week, frequency (e.g., weekly, bi-weekly, etc.), and start time (e.g., morning, afternoon, evening).

In summary, this code combines tumor board classification with schedule extraction, and applies the function to both the train and test data, producing relevant information for each document.

In [403]:
train_df['day'].unique()
train_df['frequency'].unique()
train_df['start_time'].unique()
train_df['tumor_type'].unique()

array(['Brust', 'Leber', 'Urologische', 'Other', 'Interdisziplinär',
       'Haut', 'Endokrine malignome', 'Darm', 'Magen', 'Gynäkologie',
       'Prostata', 'Hämatooncology', 'Schwerpunkt', 'Sarkome', 'Lunge',
       'Neuroonkologie'], dtype=object)

In [404]:
train_df.tail()

Unnamed: 0,url,doc_id,label,text,tumor_type,day,frequency,start_time
95,http://www.unicross.uni-freiburg.de/thema/unifm/,140,1,uniFM | uniCROSS News and Magazine Theme HOME ...,Haut,Friday,,am
96,http://www.uniklinik-duesseldorf.de/patienten-...,141,1,Interdisziplinäre Neurovaskuläre Konferenz ǀ U...,Urologische,,,am
97,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,144,2,Für Ärzte | Vivantes JavaScript scheint in Ihr...,Urologische,,,am
98,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,145,2,"Innere Medizin – Hämatologie, Onkologie und Pa...",Hämatooncology,,,am
99,http://www.walburga-krankenhaus.de/wk/artikel/...,146,1,Versorgung von Krebspatienten stärken | St. Wa...,Other,,,am
