# Introduction to Natural Language Processing (NLP)

## What is NLP?

Natural Language Processing (NLP) is a field of computer science and artificial intelligence focused on the interaction between computers and human language. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful.

In simpler terms, NLP is the technology behind tools that allow computers to process and analyze large amounts of natural language data, such as written texts or spoken words.

---

## Why is NLP important in Finance and Economics?

In finance and economics, large volumes of textual data are generated daily, such as:
- Financial reports
- Earnings calls transcripts
- News articles
- Social media discussions (tweets, blogs, etc.)
- Regulatory filings

These texts often contain important information about companies, markets, and the economy, but manually analyzing them can be time-consuming and difficult. NLP can help by automatically extracting insights from this text data, including sentiment (positive or negative tone), key terms, and trends.

**Examples of NLP in Finance and Economics:**
- **Sentiment Analysis**: Analyzing the tone of financial reports or news articles to understand market sentiment. For instance, whether the market is optimistic or pessimistic about a company’s future performance.
- **Named Entity Recognition (NER)**: Identifying important entities such as company names, dates, locations, or products from text documents.
- **Text Summarization**: Automatically summarizing long financial reports or regulatory filings.

In this notebook, we will focus on one specific task of NLP: **sentiment analysis using a dictionary-based method**. This will allow us to analyze the sentiment (positive or negative) in financial texts based on predefined word lists (dictionaries).

---

# Getting Text Data

Before we can perform any analysis on text, we need to gather or generate text data. In real-world scenarios, this data often comes from a variety of sources such as news articles, financial reports, or transcripts of company earnings calls. However, for this tutorial, we'll focus on working with simple text data that we generate ourselves.

The goal is to get the text into a structured format so that we can process it using NLP techniques.

---

## Types of Text Data in Finance and Economics

Text data in finance and economics can come from various sources, each with unique characteristics:
- **News Articles**: Articles from financial news outlets (e.g., Bloomberg, Reuters) often discuss trends in the market or specific company performance.
- **Financial Reports**: Documents like quarterly earnings reports (10-Q) or annual reports (10-K) provide detailed insights into a company's financial performance.
- **Earnings Calls Transcripts**: Transcripts of calls where company executives discuss financial performance with analysts and investors.
- **Regulatory Filings**: Documents filed with regulatory authorities, such as the Securities and Exchange Commission (SEC), often include valuable information about a company’s operations.
- **Social Media and Blogs**: Discussions and opinions about market trends and specific companies can be found on platforms like Twitter, Reddit, and financial blogs.

In practice, text data may need to be extracted from these sources. For the purpose of this notebook, we will not focus on the extraction part (such as converting PDF reports into text), and instead assume that we already have the text available for analysis.

---

## Example: Simple Finance Text

Here is a basic example of a piece of text data related to finance:


In [1]:
finance_text = """
The company reported a significant increase in quarterly profits, 
but its debt levels have also risen. Analysts are concerned about 
the growing debt but remain optimistic about the company's overall growth potential.
"""


---

### Explanation:
- The introduction provides a broad overview of NLP and how it applies to finance/economics, written for a non-technical audience.
- In "Getting Text Data," the notebook guides the students through the types of textual data that exist in finance/economics, helping them understand where this data comes from.
- The example text shows how to generate a simple finance-related text for the analysis.

Once you’re ready, I can proceed with writing the text for the next section, **Basic Text Preprocessing**! Let me know if you'd like any modifications or further details.
<br><br><br>

# 1. Basic Text Preprocessing

Before we can analyze text using a dictionary-based sentiment analysis method, we need to preprocess the raw text. Preprocessing helps standardize the text and remove irrelevant information, making it easier to match words in the text with entries in our sentiment dictionary.

---

## Why Preprocess Text?

Raw text data can be messy and inconsistent. For example, the same word might appear in different forms (e.g., "Profit" vs. "profit"), or the text may contain irrelevant characters (e.g., punctuation) that we don’t want to include in our analysis. 

By preprocessing the text, we can clean it up and convert it into a form that is easier to work with. The preprocessing steps we will cover are:
- Lowercasing
- Removing punctuation and special characters
- Removing stop words
- Lemmatization

Let's walk through each of these steps in detail.

---

## Lowercasing

One of the simplest but most important preprocessing steps is converting all text to lowercase. This ensures that words like "Profit" and "profit" are treated the same, preventing case sensitivity from affecting the analysis.

### Example: Converting text to lowercase


In [2]:
# Convert text to lowercase
lowercased_text = finance_text.lower()
print(lowercased_text)


the company reported a significant increase in quarterly profits, 
but its debt levels have also risen. analysts are concerned about 
the growing debt but remain optimistic about the company's overall growth potential.



# 2. Removing Punctuation and Special Characters
Punctuation marks (such as periods, commas, and exclamation points) don’t provide useful information for sentiment analysis, so we remove them. The same goes for special characters, such as dollar signs ($) or percentages (%), which aren’t relevant for dictionary-based analysis.

## Example: Removing punctuation

In [3]:
import string

# Removing punctuation
no_punctuation_text = lowercased_text.translate(str.maketrans('', '', string.punctuation))
print(no_punctuation_text)


the company reported a significant increase in quarterly profits 
but its debt levels have also risen analysts are concerned about 
the growing debt but remain optimistic about the companys overall growth potential



This removes all punctuation marks, leaving us with only words.



# 3. Stop Words
Stop words are common words like "and," "the," "is," and "in" that are often considered irrelevant in text analysis because they don’t convey much meaning. In financial texts, we also often filter out domain-specific stop words (such as “company” or “business”) that do not contribute to sentiment.

Using a pre-defined list of stop words, we can remove these words from our text to focus on the more meaningful terms.

## Example: Removing stop words using spaCy
We will use the spaCy library’s built-in list of stop words to remove them from our text.

In [4]:
# pip install spacy pandas numpy --upgrade 

In [5]:
# If you encounter an issue related to spaCy, particularly one involving missing data or models, 
# you might need to download the English model explicitly. To do this, uncomment and run the following command:
# !python -m spacy download en_core_web_md !python -m spacy download en_core_web_md

In [6]:
import spacy

# Load spaCy's English language model
nlp = spacy.load("en_core_web_md")

# Convert text to spaCy Doc object
doc = nlp(no_punctuation_text)

# Removing stop words
no_stop_words_text = ' '.join([token.text for token in doc if not token.is_stop])
print(no_stop_words_text)





 company reported significant increase quarterly profits 
 debt levels risen analysts concerned 
 growing debt remain optimistic companys overall growth potential 



In this step, all common stop words are removed, leaving us with only the more significant words for analysis.



# 4. Lemmatization
Lemmatization is the process of converting words to their base or root form. For example, the words "running," "ran," and "runs" are all forms of the word "run." By converting words to their base form, we ensure that we capture the meaning of the word, regardless of its tense or form.

## Example: Lemmatizing text using spaCy
We will use spaCy’s built-in lemmatizer to convert words in the text to their root forms.

In [7]:
# Lemmatization of the remaining text
lemmatized_text = ' '.join([token.lemma_ for token in doc if not token.is_stop])
print(lemmatized_text)


 company report significant increase quarterly profit 
 debt level rise analyst concerned 
 grow debt remain optimistic company overall growth potential 



# Final Preprocessed Text
After applying all the preprocessing steps (lowercasing, removing punctuation, removing stop words, and lemmatization), we now have a clean version of the text that is ready for sentiment analysis using a dictionary.



In [8]:
# Final Preprocessed Text
print(lemmatized_text)



 company report significant increase quarterly profit 
 debt level rise analyst concerned 
 grow debt remain optimistic company overall growth potential 



At this stage, our text is in a standardized form, with all unnecessary elements removed. We can now proceed to the next step: applying a sentiment dictionary to this text.


---

### Explanation:
- **Lowercasing**: Explained in simple terms why converting text to lowercase is important. The code converts text to lowercase using Python’s `lower()` method.
- **Removing Punctuation**: The code example uses Python’s `string.punctuation` and `str.maketrans()` to remove punctuation from the text.
- **Stop Words**: Introduced the concept of stop words, followed by code using `spaCy` to remove them. The explanation emphasizes how this helps focus on the more meaningful words in text.
- **Lemmatization**: Explained why lemmatization is important and how it helps in simplifying words for analysis. The code example shows how to use spaCy’s lemmatizer.

With this text, we have completed the preprocessing steps necessary to prepare text for analysis. Let me know if you'd like any adjustments or further details before I proceed with the next section on **Sentiment Scoring Using a Dictionary**!


# 5. Sentiment Scoring Using a Dictionary

Now that we have preprocessed the text, we can move on to **sentiment scoring**. This process involves using a **sentiment dictionary** to calculate the overall tone (positive or negative) of the text. In our case, we will use a dummy dictionary with predefined **positive** and **negative** words, and we'll count how many times each of these words appears in the text.

---

## How Does Dictionary-Based Sentiment Analysis Work?

A **sentiment dictionary** is a predefined list of words that are labeled as either positive or negative. Each time a word from the text appears in the dictionary, it contributes to the sentiment score. For example:
- **Positive words** (e.g., "profit," "growth," "success") contribute positively to the sentiment score.
- **Negative words** (e.g., "loss," "debt," "decline") contribute negatively to the sentiment score.

The final sentiment score is the difference between the counts of positive and negative words in the text.

In this section, we will:
1. Define a dummy dictionary of positive and negative words.
2. Use a dummy earnings call transcript to calculate a sentiment score.
3. Apply basic counting techniques using sets.

---

## Dummy Earnings Call Transcript

We will start by creating a simple dummy text that mimics the kind of language used in an earnings call or financial report. This will allow us to demonstrate how the sentiment analysis works in a financial context.


In [9]:
# Dummy earnings call transcript
earnings_call_text = """
The company has experienced strong growth this quarter with increasing profits. 
Our revenue has risen significantly, and we are optimistic about future success. 
However, debt levels have also increased, and there are concerns about rising costs.
"""


This text contains a mix of positive and negative words, which we will analyze using a predefined dictionary.

# Creating a Dummy Sentiment Dictionary
Here’s a simple example of a sentiment dictionary. We will create two sets of words:

- Positive words: Words that convey positive sentiment in financial contexts.
- Negative words: Words that convey negative sentiment.

In [10]:
# Define a dummy sentiment dictionary
positive_words = {"growth", "profit", "revenue", "optimistic", "success", "increase", "risen"}
negative_words = {"debt", "loss", "decline", "concerns", "decreased", "costs"}


These words are common in financial contexts and represent both positive and negative sentiments that might appear in an earnings call transcript.



# Preprocessing the Text
Before we can score the sentiment, we need to preprocess the text using the steps we outlined earlier (lowercasing, removing punctuation, removing stop words, and lemmatization). This ensures that the text is in a clean format.

## Preprocessing the Dummy Text

In [11]:
# Preprocess the text (following the same steps as before)

# Convert to lowercase
lowercased_text = earnings_call_text.lower()

# Remove punctuation
import string
no_punctuation_text = lowercased_text.translate(str.maketrans('', '', string.punctuation))

# Load spaCy's English language model
import spacy
nlp = spacy.load("en_core_web_md")

# Convert to spaCy Doc object and remove stop words
doc = nlp(no_punctuation_text)
preprocessed_text = ' '.join([token.lemma_ for token in doc if not token.is_stop])

print(preprocessed_text)



 company experience strong growth quarter increase profit 
 revenue rise significantly optimistic future success 
 debt level increase concern rise cost 



This gives us a preprocessed version of the earnings call transcript that is ready for sentiment analysis.



# Counting Positive and Negative Words
Now that the text has been preprocessed, we can count how many times each word in the text appears in our positive and negative word lists. This will allow us to calculate the sentiment score.

## Counting Word Occurrences

In [12]:
# Split the preprocessed text into individual words (tokens)
words = set(preprocessed_text.split())

# Count occurrences of positive words
positive_count = len(words.intersection(positive_words))

# Count occurrences of negative words
negative_count = len(words.intersection(negative_words))

# Print results
print(f"Positive words count: {positive_count}")
print(f"Negative words count: {negative_count}")

Positive words count: 6
Negative words count: 1


In this step, we use Python's set.intersection() method to count how many words from the text match the words in our positive and negative word lists. We count each set of words and print the results.

## Calculating the Sentiment Score Using Percentages

To get a more balanced view of the sentiment in the text, we can calculate the percentage of positive and negative words relative to the total number of words in the preprocessed text. This will give us a clearer indication of the sentiment distribution, especially when analyzing longer texts.

### Sentiment Ratio Explanation

Instead of simply counting positive and negative words, we calculate the ratio of these words as a percentage of the total words in the text. The formulas we will use are:

- **Positive Sentiment Ratio** = (Number of Positive Words / Total Words) * 100
- **Negative Sentiment Ratio** = (Number of Negative Words / Total Words) * 100

This allows us to standardize the score regardless of the length of the text. A higher percentage indicates a stronger presence of positive or negative sentiment.

We can also calculate a **Combined Sentiment Score**:
- **Combined Sentiment Score** = Positive Sentiment Ratio - Negative Sentiment Ratio

This will give us a single score that represents the overall tone of the text. A positive score suggests an overall positive tone, while a negative score suggests a negative tone.

---

### Calculating the Positive and Negative Sentiment Ratios

Let's now calculate the sentiment ratios for our dummy earnings call text.

In [13]:
# Get the total number of words in the preprocessed text
total_words = len(preprocessed_text.split())

# Calculate the positive sentiment ratio (as a percentage of total words)
positive_ratio = (positive_count / total_words) * 100

# Calculate the negative sentiment ratio (as a percentage of total words)
negative_ratio = (negative_count / total_words) * 100

# Print the ratios
print(f"Positive Sentiment Ratio: {positive_ratio:.2f}%")
print(f"Negative Sentiment Ratio: {negative_ratio:.2f}%")


Positive Sentiment Ratio: 31.58%
Negative Sentiment Ratio: 5.26%


This gives us the percentage of positive and negative words in the text relative to the total number of words.



## Combined Sentiment Score
The combined sentiment score is calculated by subtracting the negative sentiment ratio from the positive sentiment ratio. This single value indicates whether the text has an overall positive or negative sentiment.

In [14]:
# Calculate the combined sentiment score
combined_sentiment_score = positive_ratio - negative_ratio

# Print the combined sentiment score
print(f"Combined Sentiment Score: {combined_sentiment_score:.2f}")


Combined Sentiment Score: 26.32


This combined score helps us understand the overall sentiment of the text. A positive combined score indicates that the text has more positive sentiment than negative sentiment, while a negative score indicates the opposite.



# Example Results
Let’s assume that our dummy earnings call text has the following counts and total words:

Positive Words Count: 4
Negative Words Count: 2
Total Words: 30
The sentiment scores would be:

Positive Sentiment Ratio: (4 / 30) * 100 = 13.33%
Negative Sentiment Ratio: (2 / 30) * 100 = 6.67%
Combined Sentiment Score: 13.33% - 6.67% = 6.66%
In this case, the text has a generally positive sentiment, as indicated by the positive combined score.

By calculating sentiment ratios as a percentage of the total text, we get a clearer and more standardized view of the sentiment, making it easier to compare texts of different lengths. This method is especially useful in finance, where the tone of financial reports, news articles, and earnings calls can significantly influence decision-making.

---

### Explanation:
- **Positive and Negative Sentiment Ratios**: Explained the concept of calculating sentiment ratios as a percentage of the total word count, which allows for standardization regardless of the length of the text.
- **Combined Sentiment Score**: Provided a formula and code to calculate a single score by subtracting the negative ratio from the positive ratio, giving a clear indication of the overall sentiment.
- **Code**: The code calculates and prints the positive and negative sentiment ratios as percentages, along with the combined score.

# EXTRA:
## Understanding SpaCy Language Models: `en_core_web_sm` vs `en_core_web_md`

When using `spaCy` for NLP tasks, you’ll encounter different language models, such as `en_core_web_sm` and `en_core_web_md`. These models differ in size, the amount of data they are trained on, and their performance in various NLP tasks.

---

### What Are SpaCy Language Models?

A **spaCy language model** is a pre-trained model that has learned the linguistic patterns and features of a particular language (in our case, English). These models are used to perform a wide range of NLP tasks, such as:
- **Tokenization**: Splitting text into words or tokens.
- **Part-of-Speech Tagging**: Identifying the grammatical role of each word (e.g., noun, verb, adjective).
- **Lemmatization**: Reducing words to their base form (e.g., "running" becomes "run").
- **Named Entity Recognition (NER)**: Detecting entities like names, dates, and organizations in the text.

---

### `en_core_web_sm` (Small Model)

`en_core_web_sm` is the **small** version of spaCy’s English language model. Here are the key characteristics of this model:

- **Size**: Small (~50 MB).
- **Speed**: Fast because of its small size, making it suitable for tasks where quick processing is needed.
- **Accuracy**: Reasonably accurate for basic tasks like tokenization, lemmatization, and part-of-speech tagging.
- **Limitations**: Since it's smaller, the model has fewer parameters and may not perform as well on more complex tasks like Named Entity Recognition (NER). It also lacks word vectors, which means it can’t understand nuanced relationships between words in the same way larger models can.

This model is ideal for **quick testing** and situations where you don't need deep contextual understanding of the text.

In [15]:
# Loading the small language model
#import spacy
#nlp = spacy.load("en_core_web_sm")

### `en_core_web_md` (Medium Model)

en_core_web_md is the medium version of spaCy’s English language model, and it comes with a number of improvements over the small model.

Size: Medium (~100 MB), larger than the small model.
Speed: Slightly slower than en_core_web_sm due to the larger size, but still efficient.
Accuracy: More accurate than the small model, especially for tasks like Named Entity Recognition (NER).
Word Vectors: This model includes word vectors, which means it can better capture the meaning and relationships between words. This can be useful for tasks like sentiment analysis, where understanding the context and relationships between words is important.
Because en_core_web_md has word vectors, it can handle more nuanced tasks, and its performance is generally better for more complex NLP tasks.



In [None]:
# Loading the medium language model
import spacy
nlp = spacy.load("en_core_web_md")

# When to Use Each Model
### Use en_core_web_sm if:

You are working with smaller datasets or running quick prototypes.
You don’t need very detailed linguistic analysis.
Speed and efficiency are more important than accuracy.

### Use en_core_web_md if:
You need better accuracy, especially for tasks like Named Entity Recognition (NER).
You are working with text where word meaning and context matter more (e.g., sentiment analysis).
You need word vectors to understand relationships between words.

### Larger Models: en_core_web_lg
There is also a larger model, en_core_web_lg, which is even more powerful:

Size: Large (~800 MB).
Accuracy: Very high due to more parameters and better word vectors.
Word Vectors: Full word vectors, meaning it can capture deep contextual relationships between words.
This model is ideal for large-scale NLP tasks where accuracy is critical and you have the resources to handle its larger size.

However, in many cases, en_core_web_md is a good balance between speed and accuracy for common NLP tasks, especially in finance, where you need both good performance and reasonable processing time.

## Conclusion: Which Model to Use?
For this course, where we are primarily focused on basic text preprocessing, sentiment analysis, and simple dictionary-based methods, en_core_web_sm will be sufficient. It is fast, efficient, and can handle our preprocessing tasks (lemmatization, stop word removal, etc.) with ease.

If you later move on to more complex NLP tasks that require better accuracy and deeper understanding of the text (like Named Entity Recognition or advanced sentiment models), consider upgrading to en_core_web_md or even en_core_web_lg.

By choosing the appropriate model, you can balance performance, accuracy, and speed based on your needs.

---

### Explanation:
- **`en_core_web_sm`**: A lightweight, fast model for basic NLP tasks, explained in terms of its size and limitations. I recommended this for quick, simple tasks like preprocessing and basic dictionary-based sentiment analysis.
- **`en_core_web_md`**: A medium-sized model with word vectors, better suited for tasks that require a deeper understanding of words and context, such as more advanced sentiment analysis or Named Entity Recognition (NER).
- **When to Use Each Model**: Clearly explained when to use each model based on the type of NLP task and resource requirements.
- **Conclusion**: The small model is ideal for the current level of the course, but the medium model can be used for more complex tasks if necessary.