## Homework 2 - Feature Engineering and Supervised Learners

In the last few lessons, we learned how python programming and natural language processing (NLP) can be used to process, standardize, and encode textual information that can inform prediction models used to solve a classification task.

In this homework, you will demonstrate how to apply these approaches to transform text into informative features to classify sentences by disease status. For our questions, we'll return to our dataset of stroke diagnostic impression sections and associated acute ischemic stroke (AIS vs. non-AIS) classes. 

---

<img src="img/paper.png" width=1000>

- Kim C, Zhu V, Obeid J, Lenert L. Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke. PLoS One. 2019 Feb 28;14(2):e0212778. doi: 10.1371/journal.pone.0212778. PMID: 30818342; PMCID: PMC6394972.

---

The total point value is __20 points__. Please submit your completed jupyter notebook through canvas with the following naming convention: _HW2_LastName_FirstName.ipynb_. This assignment will be due no later than __March 27 at 11:59pm__. If you have any questions, please join Wednesday and Thursday's TA hours before reaching directly out to Danielle or Patryk.

Good luck!
Danielle & Patryk

--- 
__Question 1:__ When starting a classification problem, its important to create a development set for training a prediction model and a held-out validation set for assessing the generalizability of the model.  

Read in the dataset and randomly sample and create a development (80% of rows) and validation (20% of rows) set. (__**1 point**__)

Demonstrate that the development and validation sets are similar by computing and comparing the proportion of AIS and non-AIS cases. (__**2 points**__)

---

In [47]:
### code here
import pandas as pd
from sklearn.model_selection import train_test_split

#Load the dataset
df = pd.read_csv('pone.0212778.s002.csv')

df.head()

Unnamed: 0,ID,Label,Text
0,1,0,1. No diffusion restriction in brain parenchym...
1,2,0,1. Multiple rim- Like nodular enhancement of b...
2,3,1,1. Diffusion restriction in right parietal occ...
3,4,0,metallic artifact image distortion loss cere...
4,5,0,1. Right temporo-occipital cortical & subcorti...


In [62]:
#Split the dataset
dev_df, val_df = train_test_split(df, test_size = 0.20, random_state=42, stratify=df['Label'])

print(dev_df.shape, val_df.shape)

(2419, 3) (605, 3)


In [63]:
#Computing the proportion of AIS and Non AIS cases

def proportion_AIS_non_AIS(dataset):
    total_entries = len(dataset)
    AIS_entries = len(dataset[dataset['Label'] == 1])
    Non_AIS_entries = len(dataset[dataset['Label']== 0])
    
    AIS_proportion = round((AIS_entries/total_entries)*100,2)
    Non_AIS_proportion = round((Non_AIS_entries/total_entries)*100,2)
    
    return AIS_proportion, Non_AIS_proportion

In [64]:
# Calculating proportions for the development set
dev_ais_prop, dev_non_ais_prop = proportion_AIS_non_AIS(dev_df)

# Calculating proportions for the validation set
val_ais_prop, val_non_ais_prop = proportion_AIS_non_AIS(val_df)

result = pd.DataFrame({
    '': ['Development','Validation'],
    'AIS Proportion(%)': [dev_ais_prop,val_ais_prop],
    'Non AIS Proportion(%)': [dev_non_ais_prop, val_non_ais_prop]
})

print(result)

                AIS Proportion(%)  Non AIS Proportion(%)
0  Development              14.30                  85.70
1   Validation              14.21                  85.79


--- 
__Question 2:__ Before training a model, it can be informative to characterize the dataset to identify textual differences between with each class in the development set. In the paper, they describe differences between the distributions of the overall length of text characters for the impression sections between the non-AIS and AIS radiology reports.


<img src="img/Text_character_dist.png" width=500>

_Could there be other differences between the texts for each class?_ Write a program that reports the average number of individual sentences within the impression sections by class (AIS vs non-AIS) for the development set. 

For this question, process the development set to complete the following steps:

1. Segment the individual sentences within each impression section (__**3 points**__)
2. Determine the number of sentences for each impression section (__**1 point**__)
3. Compute and report the average number of sentences within the impressions for both AIS vs non-AIS cases. (__**1 point**__)

---

In [65]:
# code here
import nltk

from nltk.tokenize import sent_tokenize

# Function to segment text into sentences and count them
def count_sentences(text):
    sentences = sent_tokenize(text)
    return len(sentences)

# Applying the function to the 'Text' column of the development set and creating a new column 'Num_Sentences'
dev_df['Num_Sentences'] = dev_df['Text'].apply(count_sentences)

# Displaying the first few rows of the modified development set
dev_df.head()

Unnamed: 0,ID,Label,Text,Num_Sentences
2193,2194,0,Unremarkable finding of brain parenchyma and c...,2
832,833,0,Target appeared lesion in right corona radiata...,6
3014,3015,0,Diffuse brain atrophy\r\nNo diffusion restrict...,4
2663,2664,0,Unremarkable finding of brain parenchyma and c...,2
2346,2347,0,Multiple unidentified bright objects or small ...,2


In [66]:
# import re

# # Function to segment text into sentences using basic delimiters and count them
# def basic_sentence_count(text):
#     sentences = re.split(r'[.!?]', text)
#     # Removing any empty strings resulted from the split
#     sentences = [s for s in sentences if s]
#     return len(sentences)

# # Applying the basic sentence count function to the 'Text' column of the development set
# train_X['Num_Sentences_Basic'] = train_X['Text'].apply(basic_sentence_count)

# # Displaying the first few rows of the modified development set
# train_X.head()

In [67]:
# Calculating the average number of sentences for each class (AIS vs non-AIS)
avg_sentences_AIS = dev_df[dev_df['Label'] == 1]['Num_Sentences'].mean()
avg_sentences_Non_AIS = dev_df[dev_df['Label'] == 0]['Num_Sentences'].mean()

print(avg_sentences_AIS, avg_sentences_Non_AIS)

9.606936416184972 5.187650747708635


--- 
__Question 3:__ Name 1 advantages and 1 disadvantage to each of the following feature engineering steps:

1. stemming words (__**1 point**__)
2. generating bigrams (__**1 point**__)



In [38]:
#write response here


## 1. Stemming words
### Advantage: Reduces Dimensionality
Stemming helps in reducing the number of unique words in the dataset by converting different forms of a word to a common base form (stem). This reduction in vocabulary size helps in decreasing the dimensionality of the feature space, making the model simpler and less prone to overfitting.

### Disadvantage: Loss of Semantics
Stemming might lead to words being converted to stems that are not actual words. It can also cause different words with distinct meanings to be stemmed to the same root, leading to a loss of information and potential misinterpretation of the text's semantics.

## 2. Generating Bigrams
### Advantage: Captures Contextual Information
Bigrams help in capturing the contextual information and relationships between adjacent words in the text. Including bigrams as features allows the model to consider word pairs, potentially capturing more meaningful information compared to using only unigrams (single words).

### Disadvantage: Increases Dimensionality
Generating bigrams significantly increases the number of features as it considers pairs of words. This increase in dimensionality can make the model more complex, require more data to train effectively, and increase the computational cost. It might also make the model more susceptible to overfitting if the dataset is not sufficiently large and diverse.

--- 
__Question 4:__ Following the diagram below, recreate the pre-processing and feature engineering steps to create a feature matrix to the development and validation sets: (__**2 points**__)

1. Tokenize into single word units 
2. Reduce the case of each word
3. Remove all stopwords
4. Stem each word
5. Apply TF-IDF weighting
6. Generate continuous bi-grams from the stemmed words

---

<img src="img/Pre-processing.png" width=500>


In [39]:
#code here 

#Import Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Downloading the stopwords from NLTK (we will use a basic method if this fails)
try:
    nltk.download('stopwords')
    nltk.download('punkt')
    stopwords_nltk = set(stopwords.words('english'))
except:
    stopwords_nltk = set()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/priyam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/priyam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [40]:
# Function to preprocess the text
def preprocess_text(text):
    
    # Tokenizing and converting to lowercase
    tokens = nltk.word_tokenize(text.lower())
    
    # Removing stopwords and punctuation, and stemming the words
    tokens = [ps.stem(word) for word in tokens if word.isalnum() and word not in stopwords]
    
    return ' '.join(tokens)

In [68]:
# Applying the preprocessing function to the development and validation sets
dev_df['Processed_Text'] = dev_df['Text'].apply(preprocess_text)
val_df['Processed_Text'] = val_df['Text'].apply(preprocess_text)

# Displaying the first few rows of the development dataframe with the processed text
dev_df[['Text', 'Processed_Text']].head()

Unnamed: 0,Text,Processed_Text
2193,Unremarkable finding of brain parenchyma and c...,unremark find brain parenchyma cerebrospin flu...
832,Target appeared lesion in right corona radiata...,target appear lesion right corona radiata diff...
3014,Diffuse brain atrophy\r\nNo diffusion restrict...,diffus brain atrophi diffus restrict multipl o...
2663,Unremarkable finding of brain parenchyma and c...,unremark find brain parenchyma cerebrospin flu...
2346,Multiple unidentified bright objects or small ...,multipl unidentifi bright object small vessel ...


In [69]:
# Initializing the TfidfVectorizer to consider both unigrams and bigrams
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

# Fitting the vectorizer and transforming the development set
dev_tfidf = vectorizer.fit_transform(dev_df['Processed_Text'])

# Transforming the validation set using the fitted vectorizer
val_tfidf = vectorizer.transform(val_df['Processed_Text'])

In [76]:
# Displaying the shapes of the transformed development and validation sets
dev_tfidf.shape, val_tfidf.shape

((2419, 10242), (605, 10242))

In [79]:
print(dev_tfidf)

  (0, 177)	0.17885086644245832
  (0, 3702)	0.17885086644245832
  (0, 367)	0.17948046893082822
  (0, 7697)	0.10999679381584516
  (0, 5459)	0.10915549693937357
  (0, 8673)	0.25168718383363
  (0, 3231)	0.23312283918811527
  (0, 1404)	0.23098857377479334
  (0, 6853)	0.2339122510508171
  (0, 1053)	0.23234060389619726
  (0, 3124)	0.2339122510508171
  (0, 9701)	0.23331951422118302
  (0, 173)	0.16963884363685464
  (0, 3701)	0.17885086644245832
  (0, 345)	0.10971544656952369
  (0, 7696)	0.10915549693937357
  (0, 5458)	0.10915549693937357
  (0, 8655)	0.22834662489270907
  (0, 3214)	0.21415845175651618
  (0, 1403)	0.23098857377479334
  (0, 6850)	0.23214615118178974
  (0, 1043)	0.19654853151254503
  (0, 3122)	0.3411667356493756
  (0, 9695)	0.18435532564296195
  (1, 412)	0.06733864425102538
  :	:
  (2418, 9126)	0.16051011123761677
  (2418, 9746)	0.24337207671151065
  (2418, 4178)	0.22420992073727802
  (2418, 1795)	0.24337207671151065
  (2418, 9744)	0.22136427916640466
  (2418, 9172)	0.2139823798611

--- 
__Question 5:__ Name 2 supervised learners used in the manuscript. (__**1 point**__)

Define recall and precision. Explain the advantage of these measures over use of accuracy and when its most important to report them (**2 points**) 


In [None]:
#write response here


--- 
__Question 6:__ Train a Naive Bayes classifier using the development set.  Apply the trained model to the validation set. Report precision, recall, and F1-score. 

_Note that you can use features from Question 4's approach, a simple CountVectorizer, or create your own features)._
(**5 points**)


In [75]:
#code here 
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, precision_recall_fscore_support

# Initializing the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Training the classifier using the development set
nb_classifier.fit(dev_tfidf, dev_df['Label'])

# Making predictions on the validation set
val_predictions = nb_classifier.predict(val_tfidf)

# Calculating precision, recall, and F1-score for the validation set
precision, recall, f1_score, _ = precision_recall_fscore_support(val_df['Label'], val_predictions, average='binary')

# Getting a detailed classification report
classification_rep = classification_report(val_df['Label'], val_predictions)

print("Precision: ",precision)
print("Recall: ",recall)
print("F1 Score: ",f1_score) 
print(classification_rep)

Precision:  1.0
Recall:  0.046511627906976744
F1 Score:  0.08888888888888888
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       519
           1       1.00      0.05      0.09        86

    accuracy                           0.86       605
   macro avg       0.93      0.52      0.51       605
weighted avg       0.88      0.86      0.81       605

