
### Text clustering

Read, execute and analyse the code in the notebook tutorial_clustering_words. Then *choose one* of the assignments
 a), b) or c). 

a) read the article Clinical Documents Clustering Based on Medication/Symptom Names Using Multi-View Nonnegative Matrix Factorization. 
you can find the article <a href = 'https://pubmed.ncbi.nlm.nih.gov/26011887/'> here</a>. Explain the similarities of this notebook and
 the article. Explain in your own words what need to be added to this notebook to reproduce the article. There is no need to code the solution
 , you can mention in your own words the steps. 

b) Improve the outcome improving the data preprocessing and the hyper parameter configurations. 
Explain your choices. Your solution should be a coded solution with comments. 
Are there any other weighting solutions next to TF-IDF?

c) Provide a text clustering solution with your own data of interest, you can follow a similar approach to the one in the
 tutorial_clustering_words notebook. 

Mind you that you are not allowed to copy code solutions without referencing.

# <span style="color:green">Solution W2: Part a </span>

### Similarities:

1. Both methods use a collection of clinical case reports or clinical notes for their analysis.

2. They both involve preprocessing of the text which includes removing punctuations, stop words, and other unwanted text, as well as lowercasing the text.
it is worth metioning that it does not mention exactly do in preprocessing part in paper two, but because "removing punctuations, stop words, and other unwanted text" is standard procedure I assume they did the same

3. Both methods use Nonnegative Matrix Factorization (NMF) for clustering. This involves creating a Document-Term Matrix (DTM) or similar matrices (like sample-feature matrices in method two) to perform the calculations.


## Differences and what needs to be added to method one to reproduce method two:

1. Entity Extraction: In method two, the researchers specifically extract entities like symptom names and medication names from the clinical notes using specialized tools such as Stanford CoreNLP, MetaMap, and MedEx. Method one does not have this step. To reproduce method two, method one would need to add a step to extract specific entities like symptom names and medication names.

2. Multi-view NMF: Method two uses an extension of NMF, called multi-view NMF, that considers different 'views' or perspectives of the data simultaneously. To reproduce method two, method one would need to incorporate the use of multi-view NMF instead of just basic NMF.

3. Data Sources: Method two uses two different datasets from i2b2 workshop on NLP challenges, from 2009 and 2014 respectively, whereas method one does not specify the data source. To reproduce method two, the first method would need to use the same or comparable datasets.

4. Evaluation Metrics: Method two uses specific metrics such as accuracy and normalized mutual information (NMI) to evaluate the clustering results. Method one does not specify any evaluation metrics. To reproduce method two, method one would need to include these evaluation metrics.

5. Negation Processing: Method two involves identifying negations in the clinical notes, to exclude certain symptom and medication names. Method one does not mention handling negations. This could be a crucial addition to method one to replicate the results of method two.

6. Different Feature Spaces: Method two generates features from three views: symptom names, medication names, and words, whereas method one only uses the lemmatized words for feature extraction. To reproduce method two, method one would need to generate features from these additional perspectives.

## conclusion
Conclusion, method one can be considered a simpler approach to text clustering, while method two is more nuanced and takes into account multiple views of the data, extracts specific entities, and uses specific evaluation metrics. To reproduce method two, method one would need to incorporate these additional steps and considerations.

# <span style="color:green">Implenting method one with "MACCROBAT2020" Dataset </span>

# Clustering text

See also https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8


This notebook uses a collection of clinical case reports to cluster words by topics using the NMF method. To cluster text we need to preprocess the text first with regular natural language processing cleaning steps such as remove punctuations, stopwords, or other unwanted text. we lower the text and use the lemma to reduce variation of words. This is all done in part A. 

Next we need to prepare the text in a document term matrix so that NMF can perform the calculations. The Document-Term Matrix (DTM) represents the frequency of words (or terms) in a collection of documents. Each row in the matrix represents a document, and each column represents a word in the vocabulary. The value in each cell represents the frequency of the corresponding word in the corresponding document. This is done in part B

Lastly we run the clustering algorithm and visualize the outcome. This is done in part C


## The data 
A collection of 200 clinical case report documents in plain text format are used. The documents are named using PubMed document IDs, and have been edited to include only clinical case report details. The dataset is called "MACCROBAT2020" and is the second release of this dataset, with improvements made to the consistency and format of annotations
https://figshare.com/articles/dataset/MACCROBAT2018/9764942




rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import re
import string
import glob
import pandas as pd
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import re
import string
import glob
import pandas as pd
from pathlib import Path

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import re
import string
import glob
import pandas as pd
from pathlib import Path

In [19]:
import pandas as pd
import glob
from pathlib import Path
import re
import string
from nltk import pos_tag, word_tokenize
from nltk.stem import WordNetLemmatizer

#create empty dataframe
df = pd.DataFrame(columns=['docid','text'])

# Get all files ending with .txt
docs = [x for x in glob.glob("/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/*.txt")]

# Fill dataframe
for doc in docs:
    txt = Path(doc).read_text()
    df.loc[len(df.index)] = [doc[:-4], txt]

df.head()


['/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/19860925.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/26228535.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/27773410.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/28103924.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/27064109.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/20146086.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/MachineLearning/Data/MACCROBAT2020/26656340.txt', '/Users/Hesam_1/Library/CloudStorage/OneDrive-HanzehogeschoolGroningen/Semester2/M

Unnamed: 0,docid,text
0,/Users/Hesam_1/Library/CloudStorage/OneDrive-H...,Our 24-year-old non-smoking male patient prese...
1,/Users/Hesam_1/Library/CloudStorage/OneDrive-H...,A 25-year-old female patient had noticed left-...
2,/Users/Hesam_1/Library/CloudStorage/OneDrive-H...,A 69-year-old male diabetic patient was admitt...
3,/Users/Hesam_1/Library/CloudStorage/OneDrive-H...,Our patient was a 7-year-old Italian boy born ...
4,/Users/Hesam_1/Library/CloudStorage/OneDrive-H...,A 53-year-old man came to our hospital with si...


In [41]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, 
    remove punctuation, remove read errors,
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('�', ' ', text)
    return text

cleaned = lambda x: clean_text(x)

In [42]:
# Noun extract and lemmatize function
def nouns(text):
    '''Given a string of text, tokenize the text 
    and pull out only the nouns.'''
    # create mask to isolate words that are nouns
    is_noun = lambda pos: pos[:2] == 'NN'
    # store function to split string of words 
    # into a list of words (tokens)
    tokenized = word_tokenize(text)
    # store function to lemmatize each word
    wordnet_lemmatizer = WordNetLemmatizer()
    # use list comprehension to lemmatize all words 
    # and create a list of all nouns
    all_nouns = [wordnet_lemmatizer.lemmatize(word) \
    for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    
    #return string of joined list of nouns
    return ' '.join(all_nouns)

In [54]:
# Clean Text
df["text"] = df["text"].apply(cleaned)
data_nouns = pd.DataFrame(df["text"].apply(nouns))
# Visually Inspect
data_nouns.head()

Unnamed: 0,text
0,year non patient hemoptysis day thoracic pain ...
1,year female patient loss year hemiparesis mont...
2,year patient failure therapy edema basal crepi...
3,patient year boy gestation duration age month ...
4,year man hospital sign symptom heart failure w...


## Part B: The Document-Term Matrix (DTM)



In [56]:
# Create a document-term matrix with only nouns
# Store TF-IDF Vectorizer
tv_noun = TfidfVectorizer(stop_words='english', ngram_range = (1,1), max_df = .8, min_df = .01)
# Fit and Transform speech noun text to a TF-IDF Doc-Term Matrix
data_tv_noun = tv_noun.fit_transform(data_nouns.text)
# Create data-frame of Doc-Term Matrix with nouns as column names
data_dtm_noun = pd.DataFrame(data_tv_noun.toarray(), columns=tv_noun.get_feature_names_out())
data_dtm_noun.index = df.index
# Visually inspect Document Term Matrix
data_dtm_noun.head()

Unnamed: 0,abdomen,ablation,abnormality,abscess,absence,absent,abuse,access,accompanying,accordance,...,york,yr,zhejiang,zinc,zone,µg,µmol,μg,μl,μmol
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.056135,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.055374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.080138,0.0,0.0,0.0,0.0,0.0,0.0,...,0.083947,0.0,0.0,0.0,0.0,0.0,0.0,0.062559,0.0,0.0


## Part C: Run the NMF



In [57]:
def display_topics(model, feature_names, num_top_words, topic_names=None):
    '''Given an NMF model, feature_names, and number of top words, print 
       topic number and its top feature names, up to specified number of top words.'''
    # iterate through topics in topic-term matrix, 'H' aka
    # model.components_
    for ix, topic in enumerate(model.components_):
        #print topic, topic number, and top words
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i] \
             for i in topic.argsort()[:-num_top_words - 1:-1]]))

In [60]:
nmf_model = NMF(5)
# Learn an NMF model for given Document Term Matrix 'V' 
# Extract the document-topic matrix 'W'
doc_topic = nmf_model.fit_transform(data_dtm_noun)
# Extract top words from the topic-term matrix 'H' 
display_topics(nmf_model, tv_noun.get_feature_names_out(), 10)


Topic  0
dl, mg, level, blood, ml, day, count, platelet, serum, range

Topic  1
tumor, mass, cell, cm, lesion, figure, fig, lymph, resection, metastasis

Topic  2
valve, figure, echocardiography, heart, artery, mm, pressure, failure, atrium, vein

Topic  3
lung, day, chest, fig, treatment, hospital, tuberculosis, therapy, effusion, dyspnea

Topic  4
age, month, eye, week, seizure, rash, parent, mri, muscle, brain


In [None]:
ex