# <b><u>RUN THIS ON YOUR OWN MACHINE NOT OUR SERVERS!</u></b>

# <b><u>Not mandatory material!</u></b>
Just included for those who want to move on to Generative AI or get some inspiration.

### <u>If you want to apply Generative AI, go down to the bottom and see Ollama and running a request to a LLM on your local machine.</u>


# Advanced Natural Language Processing (NLP)

This notebook explores more advanced NLP concepts that go beyond the basics of dictionary-based sentiment analysis. These techniques will allow for deeper linguistic analysis and more complex tasks such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and working with pre-trained transformer models.

---

## 1. Tokenization

### What is Tokenization?
Tokenization is the process of splitting text into individual units called **tokens**. These tokens are often words, but they can also be punctuation marks or other meaningful units in a text. Tokenization is a crucial step in many NLP pipelines, as it forms the basis for subsequent processing tasks.

### Example: Tokenizing Text with SpaCy
We will use `spaCy` to demonstrate how to tokenize text.

In [1]:
# If you encounter an issue related to spaCy, particularly one involving missing data or models, 
# you might need to download the English model explicitly. To do this, uncomment and run the following command:
# !python -m spacy download en_core_web_md !python -m spacy download en_core_web_md

In [1]:
import spacy
nlp = spacy.load("en_core_web_md")

# Sample text
text = "The company's revenue grew by 15%, despite the challenging economic conditions."

# Tokenizing the text
doc = nlp(text)
for token in doc:
    print(token.text)




The
company
's
revenue
grew
by
15
%
,
despite
the
challenging
economic
conditions
.


# 2. Part-of-Speech (POS) Tagging
### What is POS Tagging?
Part-of-Speech tagging is the process of labeling words in a text with their grammatical role, such as noun, verb, adjective, etc. POS tagging helps to understand the structure of sentences, which can be particularly useful for tasks like extracting specific financial information from texts.

### Example: POS Tagging with SpaCy
We will now use spaCy to tag each word in a text with its part of speech.

In [3]:
# Perform POS tagging
for token in doc:
    print(f"{token.text} - {token.pos_}")


The - DET
company - NOUN
's - PART
revenue - NOUN
grew - VERB
by - ADP
15 - NUM
% - NOUN
, - PUNCT
despite - SCONJ
the - DET
challenging - ADJ
economic - ADJ
conditions - NOUN
. - PUNCT


This example will output the grammatical category of each word in the sentence.



# 3. Named Entity Recognition (NER)
### What is Named Entity Recognition (NER)?
NER is a process that identifies and classifies named entities in text, such as people, organizations, locations, monetary values, and dates. In finance, NER is useful for extracting important entities like company names, product names, and financial terms.

### Example: NER with SpaCy
Here’s how to identify named entities in a text using spaCy.

In [4]:
# Perform Named Entity Recognition (NER)
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")


15% - PERCENT


In this example, spaCy will recognize entities like "company," "revenue," "15%" and classify them into categories like ORG (organization), MONEY, or PERCENT.

# 4. Sentiment Analysis Using Pre-trained Models
### What Are Pre-trained Sentiment Models?
Pre-trained models are machine learning models that have been trained on large datasets and can be fine-tuned for specific tasks, such as sentiment analysis. In this section, we will explore how to use a pre-trained sentiment model for more complex text analysis.

### Using transformers for Sentiment Analysis
We will use a pre-trained transformer model from Hugging Face’s transformers library to perform sentiment analysis.

In [5]:
# pip install transformers --upgrade

In [6]:
from transformers import pipeline

# Load a sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Analyze the sentiment of a financial text
result = sentiment_pipeline("The company's profits have surged, but the rising debt is concerning.")
print(result)



KeyboardInterrupt



This will return a sentiment score with a label like "POSITIVE" or "NEGATIVE," along with a confidence score.


# 5. Word Embeddings and Similarity
### What Are Word Embeddings?
Word embeddings are dense vector representations of words, where words with similar meanings are located close to each other in vector space. Word embeddings are more powerful than simple dictionary-based methods because they capture the meaning of words in context.

### Example: Finding Word Similarity with Word Vectors
SpaCy’s en_core_web_md and en_core_web_lg models include pre-trained word vectors, which allow us to measure how similar two words are.

In [7]:
# Example of word similarity using word vectors
token1 = nlp("profit")
token2 = nlp("gain")

similarity = token1.similarity(token2)
print(f"Similarity between 'profit' and 'gain': {similarity}")

Similarity between 'profit' and 'gain': 0.35158402153765206


This shows how semantically related the words "profit" and "gain" are, based on their positions in the word embedding space.



# 6. Advanced Text Preprocessing: Custom Stop Words and Domain-Specific Vocabulary
### Using Custom Stop Words
In some cases, we may need to remove additional words that are not part of default stop word lists but are specific to the domain we are analyzing. For example, terms like "report" or "figure" may be common in financial reports but are not necessarily informative for sentiment analysis.

### Example: Adding Custom Stop Words to SpaCy
We will add some domain-specific stop words to the existing list.

In [8]:
# Add custom stop words
custom_stop_words = ["company", "report", "figure"]
for word in custom_stop_words:
    nlp.Defaults.stop_words.add(word)

# Remove custom stop words from text
cleaned_text = ' '.join([token.text for token in doc if not token.is_stop])
print(cleaned_text)


company revenue grew 15 % , despite challenging economic conditions .


This allows us to tailor the preprocessing step to our specific domain.


# 7. Transformers and Transfer Learning in NLP (Optional)
### What Are Transformers?
Transformers are advanced neural network architectures that power state-of-the-art NLP models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). These models can be fine-tuned for a wide variety of tasks, including sentiment analysis, summarization, and question answering.

### Example: Loading a Transformer Model for Text Classification
Here’s an optional introduction to using transformers for NLP tasks, leveraging the Hugging Face library.

In [9]:
from transformers import pipeline

# Load a text classification model (e.g., sentiment or other task)
classifier = pipeline('text-classification', model='distilbert-base-uncased')

# Apply the classifier to text
result = classifier("The stock market crashed, causing widespread panic among investors.")
print(result)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Exception ignored in: <function tqdm.__del__ at 0x11c4e0ca0>
Traceback (most recent call last):
  File "/Users/xappvi/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1147, in __del__
    def __del__(self):
KeyboardInterrupt: 

KeyboardInterrupt



This model is more advanced than simple dictionary methods and can provide more context-aware results.

# Conclusion: Next Steps in NLP
In this advanced notebook, we have explored more complex NLP techniques, including tokenization, POS tagging, Named Entity Recognition, and the use of pre-trained models for sentiment analysis. These tools can provide a deeper understanding of textual data, which is especially useful for advanced financial analysis.

### The next steps could include:
Applying transformers to more complex financial texts for sentiment and trend analysis.
Exploring deeper linguistic features like syntax and dependency parsing.


---

### Explanation of Topics:
- **Tokenization**: Introduces the fundamental task of breaking down text into individual tokens, which is the first step in many NLP pipelines.
- **POS Tagging**: Helps identify the grammatical role of words in a sentence, useful for extracting structured information from text.
- **Named Entity Recognition (NER)**: Extracts entities such as organizations, dates, and monetary values from text, which is particularly useful in finance.
- **Pre-trained Sentiment Models**: Demonstrates how to use advanced models for sentiment analysis, beyond simple dictionary-based methods.
- **Word Embeddings and Similarity**: Shows how words can be represented in vector space, allowing for the comparison of word meanings.
- **Advanced Text Preprocessing**: Adds the ability to customize stop words, particularly for domain-specific vocabulary in finance.
- **Transformers and Transfer Learning**: Introduces the most advanced NLP techniques, showing how state-of-the-art models can be applied to financial text analysis.

<br><br><br><br><br><br><br><br><br>

# Understanding Ollama and Open Source Models

## What is Ollama?

Ollama is a platform that simplifies the use of large language models (LLMs) by providing easy access to various models, including open-source ones. It allows users to run language models locally or on the cloud, often making it more accessible for developers, researchers, and businesses to integrate powerful natural language processing models into their applications.

Ollama's main goal is to make working with language models seamless, allowing users to generate text, analyze documents, or answer questions using sophisticated AI models with just a few simple commands.

---

## What is Open Source?

**Open source** refers to software whose source code is freely available for anyone to use, modify, and distribute. This philosophy promotes collaboration, transparency, and community-driven development. Many open-source projects are built and maintained by communities of developers from around the world.

---

## What Are Open Source Models?

**Open source models** are machine learning or language models that are freely available for public use, modification, and distribution. These models are often shared on platforms like GitHub or Hugging Face, enabling anyone to integrate them into their own projects.

Open source models can be trained on large datasets and can be used for various natural language processing (NLP) tasks, such as translation, summarization, and sentiment analysis. The beauty of open-source models is that they are not controlled by any single entity, and anyone can contribute to improving their performance or tailoring them for specific tasks.

---

## The Meta LLaMA 3.1 Model

**LLaMA 3.1** is the latest version of the **Large Language Model** developed by Meta (formerly Facebook). It is designed to be smaller and more efficient than some of the other massive language models like GPT-4, while still delivering high-quality language understanding and generation capabilities.

Meta's LLaMA models are part of a broader push to make cutting-edge NLP technology more accessible, providing strong performance across tasks like language generation, summarization, and text classification, with fewer computational resources.

Key features of LLaMA 3.1:
- **Scalability**: Smaller and more efficient compared to other large models.
- **Performance**: Strong language understanding and generation abilities, even for tasks that require deep contextual knowledge.
- **Open-Source**: LLaMA models are often shared openly, encouraging researchers and developers to use and improve them.

---

## How to Install Ollama and Run It Locally

To use Ollama with models like **LLaMA 3.1**, you need to install it on your system. Here's a quick guide on how to install and run Ollama.

### Step 1: Install Ollama
Go to:
<url>https://ollama.com/</url>

### Step 3: Download the model

In [None]:
# ollama pull llama3.1

This command will download and install the LLaMA 3.1 model, ready for use.

### Step 3: Running a Model with Ollama
Once you have a model installed, you can run it locally by sending text prompts. Here's an example of how to analyze the sentiment of a dummy earnings call text using LLaMA 3.1.
<br><br><br>

# Explanation of the Process
## PromptTemplate:
A prompt template is constructed to analyze the sentiment of the earnings call transcript. The template guides the model to read the transcript carefully and provide a sentiment analysis, listing the positive and negative aspects of the text. The response is structured in JSON format, ensuring easy parsing of the results.

## Ollama Model:
The Ollama class is used to initialize and run the language model, specifically loading Meta's LLaMA 3.1 model. This model is designed to handle various language processing tasks, including sentiment analysis, by interpreting the provided text based on the prompt instructions.

## LLMChain:
An LLMChain is set up to integrate the Ollama model with the custom PromptTemplate. The chain takes the prompt and runs it through the LLaMA model, ensuring a streamlined interaction between the input (the earnings call text) and the model’s response. The JsonOutputParser is included to ensure that the model’s response is returned in a valid and structured JSON format.

## Running the Chain:
Once the chain is created, it is executed using the earnings call text as input. The result is processed by the JSON output parser, extracting the positive and negative sentiment aspects, along with the overall sentiment summary. The final result is displayed as a JSON object for easy interpretation and further analysis.

In [6]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import Ollama
from langchain.llms import OpenAI

# Dummy earnings call text
earnings_call_text = """
The company's quarterly performance has been exceptional, with strong revenue growth and increased profit margins.
However, rising operational costs and external market pressures have posed challenges.
Our leadership remains confident about future opportunities despite these short-term hurdles.
"""

# Define the prompt for Ollama
ollama_sentiment_prompt = PromptTemplate(
    template="""
        <|start_header_id|>system<|end_header_id|>

        You are an expert sentiment analysis model. Your task is to analyze the sentiment of the following earnings call transcript.

        ### Instructions:

        1. **Understand the Earnings Call:**
            - Read the transcript carefully.
            - Identify the positive and negative sentiment conveyed in the text.

        2. **Provide the Sentiment Analysis:**
            - List the positive aspects of the text.
            - List the negative aspects of the text.
            - Provide an overall sentiment summary based on the analysis.

        The earnings call transcript is:
        "{earnings_call_text}"

        Please provide a sentiment analysis, specifying the positive and negative aspects of the text. Return your analysis in JSON format, like this:
        "positive_aspects": ["positive aspect 1", "positive aspect 2"],
        "negative_aspects": ["negative aspect 1", "negative aspect 2"],
        "summary": "Overall sentiment summary"

        <|eot_id|>
        <|start_header_id|>assistant<|end_header_id|>
    """,
    input_variables=["earnings_call_text"],
)

# Create the Ollama model
llm = Ollama(model="llama3.1", temperature=0)
# Or if you would like to use ChatGPT
# llm = OpenAI(model="gpt-4", temperature=0)

# Create the chain (no output parser needed)
ollama_sentiment_chain = LLMChain(
    prompt=ollama_sentiment_prompt,
    llm=llm,
)

# Run the chain with the earnings call text, result is a string
sentiment_analysis_result = ollama_sentiment_chain.run(earnings_call_text=earnings_call_text)

# Print the result as a string
print(sentiment_analysis_result)


Here is the sentiment analysis in JSON format:

```
{
  "positive_aspects": [
    "exceptional quarterly performance",
    "strong revenue growth",
    "increased profit margins",
    "future opportunities"
  ],
  "negative_aspects": [
    "rising operational costs",
    "external market pressures",
    "short-term hurdles"
  ],
  "summary": "The overall sentiment is cautiously optimistic, with a focus on the company's strong performance and future potential, while also acknowledging some challenges."
}
```
