## NLP Project notebook.
---

This is my demonstration of understanding of Units 4, and 5 in CL-NLP.

In [1]:
print("Submitted by Karan Taneja |\nSAP ID: 500084399 | Batch: 5, AIML (H)")

Submitted by Karan Taneja |
SAP ID: 500084399 | Batch: 5, AIML (H)


# Unit - 4: Applications of NLP

## Information retrieval.

### in NLP.


Code to demonstrate Information Retrieval:

In this example, we define a collection of documents, a query, and use the TF-IDF algorithm to compute the similarity between the query and the documents. 

And simultaneously, I'll cover the steps involved.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Data Acquisition: The first step in any NLP task is to acquire the relevant data. In IR, this may involve crawling websites, scraping data from social media platforms, or accessing structured datasets such as news articles or scientific publications.

Here, the dataset is directly provided, so no acquisition is required.

In [3]:
# Define a collection of documents
docs = [
    'The quick brown fox jumps over the lazy dog',
    'The dog chased the cat',
    'The cat climbed a tree',
    'A bird in the hand is worth two in the bush',
    'Time flies like an arrow; fruit flies like a banana'
]

Indexing: After preprocessing, the text data is typically indexed for efficient retrieval. This involves creating an inverted index that maps terms to the documents that contain them, as well as other metadata such as document frequency and term frequency.

In [4]:
# Create a TF-IDF vectorizer and transform the documents
tfidf = TfidfVectorizer()
doc_vectors = tfidf.fit_transform(docs)


Query Processing: When a user submits a query, it needs to be processed to identify the relevant terms and to generate a ranked list of documents that match the query. Here, we're using TF-IDF.

In [5]:
# Define a query
query = 'dog'

# Transform the query into a vector
query_vector = tfidf.transform([query])

# Compute cosine similarity between the query vector and the document vectors
similarity = cosine_similarity(query_vector, doc_vectors)[0]


Retrieval: Using the index and query processing techniques, the system retrieves a ranked list of documents that are most relevant to the user's query. Here, we're using cosine similarity.

In [6]:
# Sort the documents by descending order of similarity
indices = np.argsort(similarity)[::-1]
scores = similarity[indices]
documents = [docs[i] for i in indices]


In [7]:
# Print the top 3 most similar documents
for i in range(3):
    print(f'Similarity score: {scores[i]:.2f}\nDocument: {documents[i]}\n')

Similarity score: 0.43
Document: The dog chased the cat

Similarity score: 0.29
Document: The quick brown fox jumps over the lazy dog

Similarity score: 0.00
Document: Time flies like an arrow; fruit flies like a banana



This is a barebones IR system, based on TF-IDF, and cosine similarity.

### Evaluation of IR systems:

Evaluation is a critical step in information retrieval (IR) to measure the performance of the IR system and to identify areas for improvement. There are several commonly used evaluation metrics in IR, including:

1. Precision: Precision measures the proportion of retrieved documents that are relevant to the query. In other words, it answers the question: "Of the documents that were retrieved, how many were actually relevant?" A high precision score means that the system retrieves mostly relevant documents.

2. Recall: Recall measures the proportion of relevant documents that are retrieved by the system. In other words, it answers the question: "Of all the relevant documents, how many were retrieved?" A high recall score means that the system retrieves most of the relevant documents.

3. F1 score: The F1 score is the harmonic mean of precision and recall, and provides a balanced measure of both precision and recall. It ranges from 0 to 1, where a score of 1 indicates perfect precision and recall.

4. Mean Average Precision (MAP): MAP measures the average precision of the system across multiple queries. It takes into account the order of the retrieved documents, and penalizes the system for returning irrelevant documents early in the ranking.

To evaluate an IR system, a set of queries and corresponding relevant documents (known as a "test collection") is typically used. The IR system is then evaluated on its ability to retrieve the relevant documents for each query, and the evaluation metrics are computed and reported.

### Design features relevant for IR systems.

1. **Indexing**: This is the process of creating an index of the documents to be searched. The index typically contains the words or terms that appear in the document, along with their frequency and location. The indexing process can involve techniques such as tokenization, stemming, and stopword removal.

2. **Query processing**: This is the process of parsing and processing user queries to extract the relevant terms or keywords. Techniques such as query expansion and relevance feedback can be used to improve the quality of the query.

3. **Ranking**: This is the process of ranking the documents in the index based on their relevance to the query. Techniques such as tf-idf weighting and BM25 can be used to assign a relevance score to each document.

4. **User interface**: This is the interface that allows users to interact with the IRS. The user interface can include features such as search bars, filters, and sorting options.

5. **Scalability**: The IRS should be able to handle large volumes of data and queries efficiently. Techniques such as distributed indexing and query processing can be used to improve scalability.

6. **Evaluation**: The IRS should be evaluated using appropriate metrics to measure its effectiveness and identify areas for improvement. Common evaluation metrics include precision, recall, F1 score, MAP, and NDCG.

7. **Adaptivity**: The IRS should be able to adapt to changes in the data or user behavior. Techniques such as machine learning and natural language processing can be used to improve adaptivity.

8. **Security**: The IRS should have appropriate security measures in place to protect sensitive data and user privacy. Techniques such as access control and encryption can be used to improve security.

## Information Extraction:

1. **Input data**: We start with a piece of text containing some information that we want to extract. For this example, let's say we have a string that contains some phone numbers.

In [8]:
import re

text = "Please contact us at 123-456-7890 or 555-555-5555."

2. **Define regular expression pattern**: We need to define a regular expression pattern that will match the information we want to extract. In this case, we want to extract phone numbers in the format "XXX-XXX-XXXX", so we can use the following pattern:



In [9]:
pattern = r"\d{3}-\d{3}-\d{4}"


3. **Compile regular expression**: We need to compile the regular expression pattern using the `re.compile()` function.

In [10]:
regex = re.compile(pattern)

4. **Search for matches**: We can use the `re.findall()` function to search for all occurrences of the regular expression pattern in the input text.

In [11]:
matches = regex.findall(text)

5. **Output extracted information**: We can output the extracted information to the console, or to a file, or store it in a variable for further processing.

In [12]:
print(matches)

['123-456-7890', '555-555-5555']


### Applications of IE:

Information extraction (IE) has many practical applications across a wide range of industries and domains. Here are some examples:

1. **Business intelligence and analytics**: IE can be used to extract and analyze large amounts of unstructured data from sources such as emails, social media, and customer feedback, in order to identify trends, patterns, and insights that can inform business decision-making.

2. **Healthcare**: IE can be used to extract structured data from medical records, such as patient demographics, diagnoses, and treatments, in order to improve clinical decision-making, patient outcomes, and healthcare management.

3. **Legal and regulatory compliance**: IE can be used to extract and categorize legal and regulatory documents, such as contracts, policies, and financial reports, in order to identify compliance risks, monitor regulatory changes, and ensure regulatory compliance.

4. **Information retrieval**: IE can be used to extract information from text documents in order to improve search results and relevance. For example, extracting key phrases and entities from a document can help improve the accuracy of search queries.

5. **Content generation**: IE can be used to generate structured data that can be used to automatically create content such as summaries, abstracts, and headlines, which can be used to improve information discovery and consumption.

These are just a few examples of the applications of information extraction. As the amount of unstructured data continues to grow across the internet, and emergence of Big Data, Deep Learning, and incredibly large models keeps its pace, it has become an extremely lucrative field to contribute to.

### Report generation:

Using NLP techniques to find the important and opinionated words within your documents. Here is a simple report generator for customer reviews for a restaurant:

In [14]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.chunk import ne_chunk

# Load the customer reviews
reviews = [
    "The food was excellent but the service was slow.",
    "The atmosphere was cozy and the staff were friendly.",
    "The prices were reasonable but the portions were small.",
    "The cocktails were delicious but the noise level was too high.",
    "The desserts were disappointing and the wait for a table was long."
]

# Define a function to preprocess the reviews
def preprocess_review(review):
    # Tokenize the review
    tokens = word_tokenize(review.lower())
    # Remove stop words and punctuation
    tokens = [token for token in tokens if token not in stopwords.words('english') and token.isalnum()]
    # Stem the tokens
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    # Perform named entity recognition
    named_entities = ne_chunk(nltk.pos_tag(tokens))
    # Extract the named entities
    named_entities = [' '.join(leaf[0] for leaf in tree.leaves()) for tree in named_entities if isinstance(tree, nltk.tree.Tree)]
    # Extract the key phrases
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    cp = nltk.RegexpParser(grammar)
    key_phrases = []
    tree = cp.parse(nltk.pos_tag(tokens))
    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
        key_phrase = ' '.join(word for word, tag in subtree.leaves())
        if len(key_phrase.split()) > 1:
            key_phrases.append(key_phrase)
    # Combine the named entities and key phrases
    extracted_information = named_entities + key_phrases
    # Return the preprocessed review
    return ' '.join(extracted_information)

# Preprocess the reviews
preprocessed_reviews = [preprocess_review(review) for review in reviews]

# Create a frequency distribution of the extracted information
flat_information = [item for sublist in preprocessed_reviews for item in sublist.split()]
freq_dist = nltk.FreqDist(flat_information)

# Print the top 3 key themes and issues
print("Top 3 key themes and issues:")
for item in freq_dist.most_common(3):
    print("- {} ({})".format(item[0], item[1]))


Top 3 key themes and issues:
- cozi (1)
- staff (1)
- dessert (1)


3 key themes raised by customers in reviews were about coziness, the staff, and the dessert. These may be key areas for the establishment owners to focus on.

### Ontology

Ontology refers to a formal representation of knowledge or concepts in a particular domain. It defines the relationships between different concepts in a domain, and the properties or attributes of those concepts. An ontology can be thought of as a shared vocabulary or taxonomy that provides a structured way of representing knowledge in a domain, which can be used to support tasks such as information retrieval, question answering, and knowledge management.

Ontologies are typically created by domain experts and knowledge engineers, who use formal languages such as the Web Ontology Language (OWL) to specify the relationships between concepts and properties. Ontologies can be represented in various formats, such as XML, RDF, and OWL, and can be stored and queried using ontology languages and tools.

Ontologies have many applications in NLP, such as in semantic search, where an ontology can be used to represent the meanings of words and their relationships, and in natural language understanding, where an ontology can be used to map natural language expressions to formal representations of meaning. Ontologies are also used in various knowledge-intensive applications, such as expert systems, decision support systems, and knowledge management systems.

An ontology can be thought of as a schema or a structured vocabulary for a particular domain. It specifies the concepts and relationships within a domain, and the properties and attributes of those concepts. In this way, an ontology provides a formal and structured way of representing knowledge that can be used to support various NLP applications.

### Advantages of Ontology:

Before ontology; taxonomies, thesauri, and frames were popular, but none of them really fit the requirements of NLP in a more expressive and informal application. While these approaches are still used in some contexts, ontologies have become more popular in recent years due to their flexibility and expressiveness. Ontologies provide a more formal and structured way of representing knowledge, which makes it easier to integrate and reason about knowledge from different sources. Additionally, ontologies can be used to support a wide range of NLP applications, such as information retrieval, question answering, and semantic search.

# Unit - 5: Emerging Technologies in NLP
---

Here's an outline of a few technologies highlighted in the presentations:

Multimedia presentation generation is the process of automatically creating a multimedia presentation from a set of input data, such as images, videos, and text. The goal is to generate a presentation that effectively communicates a message or tells a story to an audience.

Information extraction and natural language processing techniques can be used to generate the content of the presentation, while computer vision and machine learning techniques can be used to select and arrange the multimedia components.

Here is an example of how the process might work:

1. Input data: The input data might include a set of images, videos, and text related to a particular topic or theme. For example, the input data might consist of images and text related to a news story or a scientific discovery.

2. Content generation: Information extraction and natural language processing techniques can be used to analyze the input data and generate a script or outline for the presentation. This might involve identifying key concepts and themes, extracting important details and facts, and organizing the content into a coherent narrative structure.

3. Multimedia selection: Computer vision and machine learning techniques can be used to select the most relevant and effective multimedia components to include in the presentation. For example, images and videos might be selected based on their relevance to the content and their visual impact.

4. Multimedia arrangement: The selected multimedia components can be arranged and combined into a multimedia presentation using a variety of techniques, such as video editing software or presentation software.

5. Output: The final output of the process is a multimedia presentation that effectively communicates the message or story to the intended audience.

Multimedia presentation generation has many potential applications, including automated news reporting, educational content creation, and marketing and advertising.

### Language Interfaces for intelligent tutoring system


Language Interfaces for Intelligent Tutoring Systems (ITS) are designed to facilitate natural language communication between students and tutors. They are an essential part of modern ITS, as they allow students to ask questions and receive feedback in a more natural and intuitive way. Here are some short notes on language interfaces for intelligent tutoring systems:

- Language interfaces for ITS are typically designed using natural language processing (NLP) techniques, such as text classification, named entity recognition, and semantic parsing.

- One of the primary goals of language interfaces for ITS is to provide students with personalized feedback and support. This requires the system to be able to interpret and respond to the unique needs and characteristics of each individual student.

- Another key feature of language interfaces for ITS is their ability to adapt to changes in the student's knowledge and skills over time. This requires the system to be able to track the student's progress and adjust its feedback and support accordingly.

- Language interfaces for ITS can take many different forms, from simple chatbots to more sophisticated voice-activated assistants. The choice of interface depends on the specific needs and preferences of the students and the tutors.

- Language interfaces for ITS can be used in a wide range of educational contexts, from traditional classroom settings to online and remote learning environments. They are particularly useful in situations where students need to learn complex concepts or skills that require personalized feedback and support.

Concluding, language interfaces for ITS are an important tool for enhancing student learning and improving educational outcomes. They enable students to receive personalized feedback and support in a natural and intuitive way, which can help to improve their understanding and retention of key concepts and skills.

This allows for a more inclusive educational experience for disabled people, providing accessibility to knowledge. Another example of how technology can change the lives of many in a drastic way, and try to level out the playing field for everyone.

### CIRCSIM - Tutor
---

CIRCSIM-Tutor is an intelligent tutoring system designed to teach the principles of basic electronics and circuit analysis. It was developed at Carnegie Mellon University in the 1990s, and has been used in a variety of educational settings to help students learn about electrical circuits.

CIRCSIM-Tutor is built around a circuit simulator that allows students to design and test electronic circuits in a virtual environment. The system provides feedback and guidance to help students understand the behavior of circuits and how to analyze them.

One of the key features of CIRCSIM-Tutor is its ability to provide personalized feedback to students. The system tracks the student's progress and adapts its feedback and support to meet their individual needs. This helps to ensure that students receive the right level of guidance and support as they work through the material.

CIRCSIM-Tutor is also designed to be user-friendly and accessible, with a simple interface that allows students to focus on the content rather than the technology. The system provides a range of instructional materials, including tutorials, examples, and practice problems, to help students learn at their own pace.

Overall, CIRCSIM-Tutor is an effective tool for teaching the principles of basic electronics and circuit analysis. Its personalized feedback and guidance, along with its user-friendly interface and range of instructional materials, make it a valuable resource for students and educators alike.

### AutoTutor

AutoTutor was sort of the successor to CIRCSIM, developed in the early noughties by the University of Memphis.

AutoTutor is an intelligent tutoring system that uses natural language processing and machine learning techniques to simulate a conversation with a human tutor.

AutoTutor works by engaging the student in a conversation and asking them questions about the material. The system uses natural language processing techniques to understand the student's responses and provide appropriate feedback and guidance. It also uses machine learning algorithms to adapt to the student's individual learning style and provide personalized support.

One of the key features of AutoTutor is its ability to provide human-like interactions with the student. The system uses a conversational interface and can understand and respond to natural language inputs, making the learning experience more engaging and interactive.

AutoTutor has been used to teach a range of subjects, including physics, computer science, and history. It has been shown to be an effective tool for improving student learning outcomes and engagement, and has the potential to be used in a wide range of educational settings.

### ATLAS Andes

ATLAS Andes was an intelligent tutoring system that aimed to improve students' problem-solving skills in the field of physics. It was developed at Arizona State University and designed to help students learn through interactive problem-solving exercises.

ATLAS Andes uses a database of physics problems and associated solutions to create a personalized learning experience for each student. The system presents the student with a problem and asks them to solve it using the principles of physics. The student's responses are then analyzed, and the system provides feedback and guidance to help them improve their problem-solving skills.

One of the key features of ATLAS Andes is its ability to adapt to the student's individual learning style and pace. The system uses machine learning algorithms to analyze the student's responses and determine their strengths and weaknesses. This allows the system to provide personalized feedback and guidance that is tailored to the student's needs.

ATLAS Andes also includes a range of instructional materials, including tutorials and interactive simulations, to help students learn the principles of physics. The system is designed to be user-friendly and accessible, with a simple interface that allows students to focus on the content rather than the technology.

### Why2-ATLAS tutoring system

Why2-ATLAS is a combination of two intelligent tutoring systems, why2 and ATLAS Andes. 

Why2 is a natural language based tutoring system developed at Carnegie Mellon University that engages students in a conversation about science concepts to help them learn more effectively. 

ATLAS Andes is an intelligent tutoring system that focuses on improving students' problem-solving skills in physics.

By combining these two systems, Why2-ATLAS provided students with a more comprehensive and personalized learning experience. The system used natural language processing to engage students in a conversation about physics concepts, helping them to develop a deeper understanding of the material. It also used machine learning algorithms to adapt to each student's individual learning style and pace, providing personalized feedback and guidance that is tailored to their needs.

## Healthcare

There have been a wide range of applications for NLP in the Healthcare industry, here are a few examples:

Clinical documentation: NLP can be used to analyze and extract relevant information from clinical documents, such as medical records, discharge summaries, and progress notes. This helps to improve the accuracy and completeness of clinical documentation and can assist with clinical decision making.

Clinical decision support: NLP can be used to analyze clinical data and provide decision support to healthcare providers. For example, NLP algorithms can be used to identify patients who may be at risk for certain conditions or to recommend appropriate treatments based on patient data.

Patient monitoring: NLP can be used to monitor patient health and identify potential issues. For example, NLP can analyze electronic health records (EHRs) to identify patients who are at risk for readmission or to track patients' progress over time.

### Clinical Decision Support (CDS) System.

Clinical decision support (CDS) systems are computer-based tools designed to assist healthcare providers with clinical decision making. CDS systems use patient-specific information to provide tailored recommendations, alerts, and reminders to help clinicians make more informed decisions about patient care. These systems can help to reduce medical errors, improve patient outcomes, and increase efficiency in healthcare delivery.

CDS systems use a variety of technologies, including natural language processing (NLP), machine learning, and expert systems. They can be integrated into electronic health record (EHR) systems or other clinical information systems to provide real-time decision support to healthcare providers.

Examples of CDS system applications include medication dosing recommendations, alerts for potential drug interactions, and reminders for preventive care interventions. These systems can also be used to support clinical pathways and best practice guidelines to ensure that patients receive appropriate and consistent care.

### Implementation of a simple CDS System:

This classifies a patient as normal or abnormal.

In [15]:
import random

# Define the labels and the number of samples per label
labels = ["normal", "abnormal"]
num_samples = 100

# Generate the dataset
dataset = []
for label in labels:
    for i in range(num_samples):
        # Generate a random sentence
        sentence = "The patient's test results are " + label + "."
        
        # Append the sentence and label to the dataset
        dataset.append((sentence, label))

# Define a function to extract features from the text
def extract_features(text):
    features = {}
    # Count the occurrences of each word in the text
    for word in text.split():
        features[word] = features.get(word, 0) + 1
    return features

# Split the dataset into training and testing sets
random.shuffle(dataset)
train_data = dataset[:int(len(dataset)*0.8)]
test_data = dataset[int(len(dataset)*0.8):]

# Train a Naive Bayes classifier on the training data
import nltk
train_features = [(extract_features(text), label) for (text, label) in train_data]
classifier = nltk.NaiveBayesClassifier.train(train_features)

# Test the classifier on the testing data
test_features = [(extract_features(text), label) for (text, label) in test_data]
accuracy = nltk.classify.accuracy(classifier, test_features)
print("Accuracy:", accuracy)


Accuracy: 1.0


### Role of NLP in CDS, NLP-CDS

Natural Language Processing (NLP) can play a crucial role in Clinical Decision Support (CDS) systems by enabling computers to analyze unstructured clinical text data such as medical records, discharge summaries, and progress notes. NLP algorithms can extract relevant clinical information from these unstructured sources and convert it into a structured format that can be used to support clinical decision making.

NLP-CDS systems use a combination of NLP algorithms, machine learning, and other techniques to provide tailored recommendations, alerts, and reminders to healthcare providers based on patient-specific data. These systems can help to reduce errors, improve patient outcomes, and increase the efficiency of healthcare delivery.

## Sentiment Analysis

For various reasons, across various industries, due to the nature of the world we live in, where we have an incredible number of consumers, customers, and businesses; the need for these businesses to assess the opinions of their customers has increased, and as the technology has become more accessible, the applications have become more widespread.

### Difficulties in Sentiment Analysis:

Here's a list of a few common problems faced in sentiment analysis:

1. Ambiguity: Words and phrases can have multiple meanings, and their sentiment can vary depending on the context in which they are used. For example, the word "sick" can be used to describe both something that is unpleasant (e.g., "That woman was sick in the head") and something that is impressive (e.g., "That trick was sick").

2. Sarcasm and irony: Texts can contain sarcasm, irony, or other forms of figurative language that can be difficult for sentiment analysis tools to interpret. For example, a tweet that says "Thanks a lot" can be sarcastic and actually mean the opposite.

3. Negation: Negation can be difficult to handle in sentiment analysis because it can reverse the polarity of the sentiment. For example, the sentence "The food was not bad" actually means that the food was good.

4. Domain-specific language: Sentiment analysis models trained on general language may not perform well on text that contains domain-specific language. For example, a model trained on movie reviews may not perform well on text related to healthcare.

5. Data imbalance: In many datasets, one sentiment class may be more prevalent than the others, which can bias the model towards that class and reduce its accuracy on the other classes.

6. Cultural and linguistic differences: Sentiment analysis tools may perform differently in different languages and cultures. Some languages have complex grammar and syntax, which can make sentiment analysis more challenging.

Overcoming these difficulties requires careful consideration of the dataset and the specific context in which the sentiment analysis is being performed. It may also involve using specialized techniques, such as incorporating domain-specific language or training the model on data from multiple languages and cultures.


### Document level sentiment analysis:

Document-level classification in sentiment analysis involves analyzing the sentiment expressed in a document as a whole. The goal is to classify the document into one of several pre-defined categories, such as positive, negative, or neutral.

Document-level classification can be performed using a variety of machine learning algorithms, including Support Vector Machines (SVMs), Naive Bayes, and neural networks. The process typically involves the following steps:

1. Data preprocessing: The text data in the document is preprocessed to remove noise, such as stop words, punctuations, and HTML tags. The text data is then converted into a numerical representation, such as a bag-of-words or TF-IDF matrix.

2. Feature selection: The features used to represent the document are selected. This can include unigrams, bigrams, or n-grams. Other features can include part-of-speech tags or sentiment lexicons.

3. Model training: A machine learning model is trained on a labeled dataset of documents. The model learns to map the features of the document to its corresponding sentiment label.

4. Model evaluation: The performance of the trained model is evaluated on a separate test dataset. Common evaluation metrics include accuracy, precision, recall, and F1-score.

5. Prediction: The trained model can be used to predict the sentiment of new, unlabeled documents.

Document-level classification can be used in a variety of applications, such as analyzing product reviews, social media posts, and news articles. Important to keep in mind that it can be challenging to accurately capture the nuances of sentiment expressed in longer documents, such as articles or essays. This is because the sentiment expressed in different parts of the document may be complex and varied. Nonetheless, document-level classification remains an important tool for sentiment analysis in many domains.

### Sentence level sentiment analysis:

As the name suggests, in this method we try to map a sentiment to each sentence of a document, which is a more grassroot level, allowing us to identify sentiments expressed in different parts of the documents; and the variation throughout.

### Lexicons in sentiment analysis:

In sentiment analysis, the lexicon is a set of words which have a corresponding sentiment polarity assigned to them, i.e they sway the sentiment of the document/sentence. There are different types of lexicons, here are a few commonly used ones:

1. Polarity lexicons: These lexicons consist of words that are labeled with their corresponding sentiment polarity, such as positive or negative. Examples include SentiWordNet, AFINN, and MPQA.

2. Emotion lexicons: These lexicons consist of words that are labeled with their corresponding emotions, such as joy, anger, fear, or sadness. Examples include NRC Emotion Lexicon and WordEmotion.

3. Domain-specific lexicons: These lexicons are tailored to specific domains, such as social media or product reviews, and contain words or phrases that are relevant to that domain. Examples include SentiStrength and SenticNet.

Lexicons can be used in different ways for sentiment analysis, such as by counting the frequency of positive and negative words in a text, or by calculating the sentiment polarity of individual words and combining them to generate a sentiment score for the text as a whole. However, lexicons have limitations in terms of coverage, accuracy and domain specificity.

In different contexts, different words may mean different levels of emotions, so according to the use case, they may need to be tweaked accoridngly.

### Feature based sentiment analysis

Feature-based sentiment analysis is a type of sentiment analysis that identifies the sentiment expressed towards different aspects or features of a product, service, or entity. It involves analyzing individual words or phrases within a text and assigning a sentiment score to each one based on its polarity, i.e., positive, negative, or neutral.

In feature-based sentiment analysis, the goal is to identify the features or attributes that are being talked about in a text, and then analyze the sentiment associated with each feature. This is done by first extracting the relevant features or attributes from the text, and then using a sentiment lexicon or machine learning model to determine the sentiment polarity for each feature.

For example, if we are analyzing customer reviews for a restaurant, we might extract features such as the food quality, service, ambiance, and price. We would then analyze the sentiment associated with each feature separately, to get a more detailed understanding of the overall sentiment expressed in the reviews.

Now, I will show an implementation of a feature based sentiment analysis engine on randomly generated paragraphs using faker:

In [16]:
from faker import Faker
from pprint import pprint
import random

fake = Faker()

reviews = []
for i in range(100):
    rating = random.randint(1, 5)
    review = fake.paragraph()
    reviews.append((review, rating))


pprint(reviews[:5])

[('Year open hour well thing. Determine always large give safe story '
  'situation. Truth institution any defense interesting else her.',
  5),
 ('Plant such successful contain. Way game thought animal tree add say song. '
  'Various end foreign TV.',
  5),
 ('Purpose sometimes move speak sure five animal. Debate goal six party '
  'economy.',
  5),
 ('With knowledge should art husband group start accept. Though beautiful '
  'beyond who. Same he citizen top care set.',
  4),
 ('Project case figure author why professor. Manage decision type capital deal '
  'economic. Respond TV matter.',
  5)]


In [17]:
from textblob import TextBlob

polarities = []

for review, rating in reviews:
    blob = TextBlob(review)
    polarity = blob.sentiment.polarity
    if polarity > 0:
        print(f"Positive review (rating {rating}): {review}")
    elif polarity < 0:
        print(f"Negative review (rating {rating}): {review}")
    else:
        print(f"Neutral review (rating {rating}): {review}")
    polarities.append(polarity)
    
print(polarities)
print(np.mean(polarities))


Positive review (rating 5): Year open hour well thing. Determine always large give safe story situation. Truth institution any defense interesting else her.
Positive review (rating 5): Plant such successful contain. Way game thought animal tree add say song. Various end foreign TV.
Positive review (rating 5): Purpose sometimes move speak sure five animal. Debate goal six party economy.
Positive review (rating 4): With knowledge should art husband group start accept. Though beautiful beyond who. Same he citizen top care set.
Positive review (rating 5): Project case figure author why professor. Manage decision type capital deal economic. Respond TV matter.
Positive review (rating 4): Stock film section five. Meeting friend smile town involve.
Positive review (rating 1): Model black experience finally big yes focus song. Indicate pass professional significant.
Positive review (rating 2): Site fill government keep business. Camera billion cultural always entire.
Neutral review (rating 3): 

### Opinion Summarization:

Opinion summarization is a text summarization task that focuses on extracting the main opinions and sentiments expressed in a large set of reviews or opinions on a given topic. The goal is to generate a concise summary of the key opinions and sentiments expressed in the reviews or opinions.

Opinion summarization is typically done by first identifying the main aspects or features being discussed in the reviews, and then extracting the sentiment expressed for each aspect. The sentiments can be classified as positive, negative, or neutral. Once all the sentiments have been extracted, they are typically aggregated to generate an overall sentiment score for the topic.

Opinion summarization has applications in a variety of areas, including market research, product development, and customer service. It can help companies quickly identify the strengths and weaknesses of their products and services, and make data-driven decisions based on customer feedback.