<a href="https://colab.research.google.com/github/joshcova/NLP_Workshop/blob/main/03_NLP_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text representation in Spacy

`spacy` is a popular and highly efficient open-source library for Natural Language Processing (NLP) in Python. It offers capabilities for a wide range of NLP tasks, including tokenization, named entity recognition (NER), part-of-speech tagging (POS) etc.

When `spacy` processes text, it converts the text into a Doc object. This Doc object is essentially a sequence of Token objects, which are the fundamental building blocks for all subsequent NLP operations.

In NLP, a corpus refers to a large and structured set of texts used for linguistic analysis. A document is an individual text within that corpus. In spaCy, the Doc object represents an individual document that has been processed by the nlp pipeline.

Tokens are the smallest units of text obtained after splitting a sentence or phrase. They are typically individual words, punctuation marks, or symbols, separated by whitespace or specific rules defined by the tokenizer.

In [None]:
import spacy

In [None]:
# now we need to do some downloading, we download Spacy's language and trained pipelines, there are different sizes for English: small (en_core_web_sm), medium (en_core_web_md), large (en_core_web_lg)
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# load the Spacy pipeline into your environment
nlp = spacy.load("en_core_web_md")

In [None]:
# Here we specify a string (a sentence in this case) and save into a Doc object which we also parse through the Spacy's nlp pipeline

doc = nlp("This is a sentence. $20 is the price.")

In [None]:
# You can see that some tokens are words, some are not

for token in doc:
  print(token)

In [None]:
# Side note: What you have seen above is a for loop - something that is frquently used in Python as well as other programming languages.
# You can see its added value by running the code below:

for i in range(1,10):
  print(i)

In [None]:
# Back to Spacy and the NLP world
# Tokens are nice, but what tends to be more important is checking the Part-of-Speech (POS) and the Lemma of the individual token.

doc = nlp("This is a better sentence.")
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

In [None]:
# We can also do some Named Entity Recognition (NER), which allows us to systematically identify names, places and organizations

doc = nlp("Elon Musk bought Twitter for $44 billion.")

for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

In [None]:
# Now we can combine Spacy to the powerful pandas library and export the output of our analysis into a pandas datafrme

import pandas as pd

data = []
for ent in doc.ents:
    data.append({
        "text": ent.text,
        "label": ent.label_,
        "explanation": spacy.explain(ent.label_)
    })

df = pd.DataFrame(data)

# if we want to export it to a file which will appear under the files icon on you left, we can use this command:

df.to_csv("sample_df.csv")

# This is helpful if we want to use the file locally or send the output to a collaborator

## Stopwords

The Spacy library offers different stop words list for different languages

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

## Below for German
#from spacy.lang.de.stop_words import STOP_WORDS

Be careful in using off-the-shelf stop words list. The selection of which stop words are in scope or which ones are not should always be dictated by your research question and the corpora that you are working with.

In [None]:
doc = nlp("This is a sample sentence, showing off the stop words filtration.")

for token in doc:
    if token.is_stop:
        print(token.text)

Let's bring it all together using the dataframe that we used yesterday!

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/joshcova/NLP_Workshop/refs/heads/main/data/brexit_data.csv")

In [None]:
df_select = df[["text","party"]]

In [None]:
# let's sample some rows from the dataframe

df_select = df_select.sample(100, random_state=1)

As it always tends to be a good idea to pre-process our text (see lectures' slides), it is customary to include a user-defined function that allows it to scale up the pre-processing to other use cases.

In [None]:
# in this pre-processing function, we exclude the stop words from the in-built STOP_WORDS list of the Spacy library as well as punctuation. Your specific case may very well differ.
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return no_stop_words

In [None]:
# Now that we our function we can apply using the handy apply method to our dataframe.
# The code below therefore applies the in-built preprocess function on our text variable
df_select["text_no_stop_words"] = df_select["text"].apply(preprocess)

In [None]:
## Now e can create a new pre-processing function which does no longer remove stop words and punctuation, but rather is focused on retrieving the POS for every token in our dataframe

def pos_tag_text(text):
    doc = nlp(text)
    return [(token.text, token.pos_, token.tag_) for token in doc]

In [None]:
df_select["pos"] = df_select["text"].apply(pos_tag_text)

## Sentiment analysis

Using NLP to conduct sentiment analysis is a frequent use case for applied text-as-data research in the social sciences. But what is sentiment analysis?

While there are different methods, in essence a sentiment analysis aims to gauge the sentiment of a text by using computational methods.

The most basic application for sentiment analysis is that of sentiment dictionaries. There are different sentiment dictionaries, which list "positive" and "negative" words (e.g. Lexicoder Sentiment Dictionary).

These are called **word-list based sentiment analysis**

Here we will showcase how such a sentiment dictionary analysis works by using the [Afinn sentiment analysis](https://github.com/fnielsen/afinn) and the [VADER](https://vadersentiment.readthedocs.io/en/latest/)  sentiment analysis tool.

As these are not very commonly used libraries in Python, we would need to go ahead and `pip install` them

In [None]:
pip install Afinn

Collecting Afinn
  Downloading afinn-0.1.tar.gz (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: Afinn
  Building wheel for Afinn (setup.py) ... [?25l[?25hdone
  Created wheel for Afinn: filename=afinn-0.1-py3-none-any.whl size=53431 sha256=f79560413f68122539a3a2c5f9f9836d58d67337ec411a383cf465e42b964538
  Stored in directory: /root/.cache/pip/wheels/f9/72/27/74994e77200dae3d6aea2b546264500cee21f738c51241320b
Successfully built Afinn
Installing collected packages: Afinn
Successfully installed Afinn-0.1


In [None]:
from afinn import Afinn

In [None]:
# Some easy texts that help us get an intuition of how sentiment analysis work

texts = [
    "This is a good movie",
    "This is not good at all",
    "What an awful experience!"
]


In [None]:
afinn = Afinn()

In [None]:
texts_df = pd.DataFrame(texts, columns=["text"])

In [None]:
## Now we apply the afinn function from the Afinn library

texts_df["polarity_score"] = texts_df["text"].apply(afinn.score)

In [None]:
## How would you interpret this out?

print(texts_df)

Another commoly used rule-based dictionary especially for social media texts is the [VADER](https://vadersentiment.readthedocs.io/en/latest/) (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool

Contrary to word-lists, this is a **rules-based** dictionary. Let's analyze the same texts and see what (if anything) changes.

In [None]:

pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
analyzer = SentimentIntensityAnalyzer()

In [None]:
for t in texts:
  score = analyzer.polarity_scores(t)
  print(score)

{'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}
{'neg': 0.325, 'neu': 0.675, 'pos': 0.0, 'compound': -0.3412}
{'neg': 0.523, 'neu': 0.477, 'pos': 0.0, 'compound': -0.5093}


In [None]:
## We can apply the same code as we did above for Afinn

df_select["sentiment_scores"] = df_select["text"].apply(analyzer.polarity_scores)