Task 1

Categorise each of the following use cases using NLP

A model that allocates which email folder an email should be sent (work, friends, promotions, important), like Gmails's ibbix tabs. 

A model that helps decide wat grade to award to an essay question. This can be used by a university professor who grades a lot of classes or essay competitions. 

A model that provides assistive technology for doctors to provide their diagnosis. Remeber, doctors ask questions, so the model will use the patient's answers to provide probable diagnoses for the doctor to weigh and make decisions on. 



Natural language processing applications

Language translation
Text Classification
Automatic Summarisation
Sentiment analysis
Question Answering

Email Folder Allocation Model:

NLP Task: Text Classification
Explanation: This model involves categorizing emails into different folders (work, friends, promotions, important) based on the content of the email. NLP can be used for text classification to analyze the text of the email and predict the appropriate category or folder. Techniques like machine learning, particularly classification algorithms, can be applied to achieve this task.



In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Sample data
emails = ["Work: Meeting at 2 PM", "Friends: Let's catch up this weekend!", "Promotions: Special offers inside!", "Important: Urgent update"]

labels = ["work", "friends", "promotions", "important"]

# Encode labels
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Vectorize the text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, encoded_labels, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# Decode predictions
decoded_predictions = label_encoder.inverse_transform(predictions)

# Evaluate accuracy
accuracy = accuracy_score(y_test, predictions)

print("Predicted Labels:", decoded_predictions)
print("Accuracy:", accuracy)


Predicted Labels: ['important']
Accuracy: 0.0


Explanation: his code demonstrates a basic email folder allocation model using a Naive Bayes classifier for text classification.
It uses the CountVectorizer to convert the text data into numerical features and the MultinomialNB classifier for training and prediction.
The labels (work, friends, promotions, important) are encoded using LabelEncoder for model training.
The accuracy of the model is evaluated using the accuracy_score from scikit-learn

Essay Grading Model:
NLP Task: Text Evaluation or Scoring
Explanation: This model aims to determine the grade for an essay question. NLP can be employed for text evaluation by analyzing the content of the essay, assessing its coherence, relevance, and other linguistic features. Machine learning models, such as regression or classification models, can be trained on annotated essay data to predict the appropriate grade.
Assistive Technology for Medical Diagnosis:

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample data
essays = ["Well-structured essay with clear arguments.", "Lacks coherence and supporting evidence.", "Exceptional analysis and critical thinking."]

grades = [90, 65, 95]

# Vectorize the text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(essays)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, grades, test_size=0.2, random_state=42)

# Train a Linear Regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Make predictions
predictions = regressor.predict(X_test)

# Evaluate mean squared error
mse = mean_squared_error(y_test, predictions)

print("Predicted Grades:", predictions)
print("Mean Squared Error:", mse)


Predicted Grades: [80.]
Mean Squared Error: 100.0


Explanation: This code implements a basic essay grading model using Linear Regression for text evaluation.
It uses the TfidfVectorizer to convert the text data into numerical features and the LinearRegression model for training and prediction.
The grades are treated as continuous values, and the model predicts a numeric score for each essay.
The mean squared error is calculated as a measure of model performance.

NLP Task: Natural Language Understanding, Text-based Inference
Explanation: In this case, the model assists doctors in providing diagnoses based on the patient's answers to questions. NLP plays a crucial role in understanding and interpreting the natural language responses of patients. It involves extracting relevant information from the patient's responses, understanding medical context, and possibly generating or suggesting probable diagnoses. Techniques like information extraction, semantic role labeling, and inference can be applied.

In [5]:
import nltk

# Download nlk resource
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
patient_responses = ["I have a persistent cough and shortness of breath.", "Experiencing chest pain and fatigue.", "No specific symptoms, feeling normal."]

diagnoses = ["Respiratory infection", "Possible heart condition", "Healthy"]

# Tokenize and apply POS tagging
tokenized_responses = [nltk.word_tokenize(response) for response in patient_responses]
pos_tagged_responses = [nltk.pos_tag(tokens) for tokens in tokenized_responses]

# Flatten POS tagged responses
flat_pos_tagged_responses = [" ".join([tag[1] for tag in pos_tags]) for pos_tags in pos_tagged_responses]

# Vectorize the text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(flat_pos_tagged_responses)

# Encode labels
label_encoder = LabelEncoder()
encoded_diagnoses = label_encoder.fit_transform(diagnoses)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, encoded_diagnoses, test_size=0.2, random_state=42)

# Train a Random Forest classifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# Decode predictions
decoded_predictions = label_encoder.inverse_transform(predictions)

# Evaluate accuracy
accuracy = accuracy_score(y_test, predictions)

print("Predicted Diagnoses:", decoded_predictions)
print("Accuracy:", accuracy)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ojaga\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Predicted Diagnoses: ['Healthy']
Accuracy: 0.0


Explanation:

This code illustrates a basic model for medical diagnosis using a RandomForest classifier and natural language understanding.
It uses NLTK for tokenization and part-of-speech (POS) tagging of patient responses.
The responses are vectorized using the TfidfVectorizer and then fed into a RandomForest classifier.
Diagnoses are encoded using LabelEncoder, and the accuracy of the model is evaluated.