# **spaCy and Selenium API Demonstration**

This notebook is designed to walk you through the core API functions used in the **Real-Time Bitcoin Sentiment Analysis with spaCy and Selenium** project. We'll focus on two powerful tools: **spaCy** for natural language processing (NLP) and **Selenium** for web scraping. This notebook serves as a companion to the main pipeline notebook, `spacy_selenium_example.ipynb`, and uses functions from `spacy_selenium_utils.py`.

If you're new to NLP or web scraping, don't worry! We'll break down each step, explain what’s happening, and share tips to help you get started. By the end, you'll have a solid understanding of how to process text data with spaCy and scrape tweets from X (Twitter) using Selenium.

## **Step 1: spaCy Demonstration**

spaCy is a popular Python library for NLP, widely used for tasks like tokenization, lemmatization, named entity recognition (NER), dependency parsing, and part-of-speech (POS) tagging. In our Bitcoin sentiment analysis project, we use spaCy to preprocess tweets before analyzing their sentiment. Let’s explore these capabilities with a simple example.

### **What You’ll Learn Here**
- How to clean and preprocess raw text (like tweets).
- How to break text into tokens (tokenization).
- How to simplify words to their base form (lemmatization).
- How to identify entities like people, organizations, or monetary values (NER).
- How to understand the grammatical structure of a sentence (dependency parsing and POS tagging).


### **Code Example: Processing a Tweet with spaCy**
Let’s start with a sample tweet about Bitcoin and walk through the preprocessing steps. We’ll use spaCy’s `en_core_web_sm` model, a small English model that’s lightweight and great for beginners.

In [6]:
import spacy
import re

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Example tweet text
text = "I just bought some Bitcoin #BTC at $50,000!"

# Clean the text
cleaned_text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
cleaned_text = re.sub(r"@\w+|#\w+", "", cleaned_text)
cleaned_text = cleaned_text.encode("ascii", "ignore").decode()  # Remove emojis
cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()
print(f"Cleaned Text: {cleaned_text}\n")

# Process the text with spaCy
doc = nlp(cleaned_text)

# Tokenization
print("Tokens:")
for token in doc:
    print(f"{token.text} (Lemma: {token.lemma_}, POS: {token.pos_})")

# Named Entity Recognition (NER)
print("\nEntities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

# Dependency Parsing
print("\nDependency Parsing:")
for token in doc:
    print(f"{token.text} --> {token.dep_} (Head: {token.head.text})")

# Part-of-Speech Tagging
print("\nPOS Tags:")
for token in doc:
    print(f"{token.text}: {token.pos_} ({spacy.explain(token.pos_)})")


Cleaned Text: I just bought some Bitcoin at $50,000!

Tokens:
I (Lemma: I, POS: PRON)
just (Lemma: just, POS: ADV)
bought (Lemma: buy, POS: VERB)
some (Lemma: some, POS: DET)
Bitcoin (Lemma: Bitcoin, POS: PROPN)
at (Lemma: at, POS: ADP)
$ (Lemma: $, POS: SYM)
50,000 (Lemma: 50,000, POS: NUM)
! (Lemma: !, POS: PUNCT)

Entities:
Bitcoin (PERSON)
50,000 (MONEY)

Dependency Parsing:
I --> nsubj (Head: bought)
just --> advmod (Head: bought)
bought --> ROOT (Head: bought)
some --> det (Head: Bitcoin)
Bitcoin --> dobj (Head: bought)
at --> prep (Head: bought)
$ --> nmod (Head: 50,000)
50,000 --> pobj (Head: at)
! --> punct (Head: bought)

POS Tags:
I: PRON (pronoun)
just: ADV (adverb)
bought: VERB (verb)
some: DET (determiner)
Bitcoin: PROPN (proper noun)
at: ADP (adposition)
$: SYM (symbol)
50,000: NUM (numeral)
!: PUNCT (punctuation)


### **Output Explanation**
Let’s break down the output to understand what spaCy is doing:

- **Cleaned Text**: We start with a raw tweet: "I just bought some Bitcoin #BTC at $$50,000!". After cleaning, we remove URLs, mentions (@username), hashtags (#BTC), emojis, and extra spaces, leaving: "I just bought some Bitcoin at $50,000!".
- **Tokens**: spaCy splits the text into tokens (words or punctuation). For each token, we see its **lemma** (base form, e.g., "bought" becomes "buy") and its **POS** (part of speech, e.g., "VERB" for "bought"). This helps us understand the structure of the sentence.
- **Entities (NER)**: spaCy identifies named entities. Here, "Bitcoin" is tagged as a `PERSON` (which isn’t quite correct—more on that below), and "$50,000" is correctly tagged as `MONEY`. NER is useful for extracting meaningful entities like prices or cryptocurrency names.
- **Dependency Parsing**: This shows the grammatical relationships between words. For example, "I" is the subject (`nsubj`) of the verb "bought", and "Bitcoin" is the direct object (`dobj`). This helps us understand how words connect in a sentence.
- **POS Tags**: Each token is labeled with its part of speech (e.g., "PRON" for pronoun, "VERB" for verb). The `spacy.explain()` function gives a beginner-friendly description of each tag.

### **Insights for Beginners**
- **Why Clean Text First?** : Tweets often contain noise like URLs, hashtags, and emojis that can confuse NLP models. Cleaning ensures spaCy focuses on the meaningful content."
- **NER Misclassification**: spaCy tagged "Bitcoin" as a `PERSON`, but it’s a cryptocurrency. This happens because spaCy’s models are trained on general text, not crypto-specific data. For better accuracy, you can fine-tune spaCy with custom data or add a post-processing step to correct such labels.
- **Choosing a Model**: We used `en_core_web_sm` (small model) for speed. If you need better accuracy, try `en_core_web_md` (medium) or `en_core_web_lg` (large), but they require more memory and are slower.
- **Practical Tip**: If you’re new to spaCy, start with small examples like this to get comfortable. Use `spacy.explain()` to learn what tags mean, it’s a great way to build your NLP vocabulary!

### **Additional Example: Comparing Different Tweets**
Let’s try spaCy on another tweet to see how it handles variations in text. This helps us understand how spaCy behaves with different sentence structures.

In [7]:
# Another example tweet
text2 = "Elon Musk says Bitcoin will hit $100,000 by 2025! 🚀 #CryptoNews"

# Clean the text
cleaned_text2 = re.sub(r"http\S+|www\S+|https\S+", "", text2, flags=re.MULTILINE)
cleaned_text2 = re.sub(r"@\w+|#\w+", "", cleaned_text2)
cleaned_text2 = cleaned_text2.encode("ascii", "ignore").decode()
cleaned_text2 = re.sub(r"\s+", " ", cleaned_text2).strip()
print(f"Cleaned Text: {cleaned_text2}\n")

# Process with spaCy
doc2 = nlp(cleaned_text2)

# Tokenization
print("Tokens:")
for token in doc2:
    print(f"{token.text} (Lemma: {token.lemma_}, POS: {token.pos_})")

# Named Entity Recognition (NER)
print("\nEntities:")
for ent in doc2.ents:
    print(f"{ent.text} ({ent.label_})")

Cleaned Text: Elon Musk says Bitcoin will hit $100,000 by 2025!

Tokens:
Elon (Lemma: Elon, POS: PROPN)
Musk (Lemma: Musk, POS: PROPN)
says (Lemma: say, POS: VERB)
Bitcoin (Lemma: Bitcoin, POS: PROPN)
will (Lemma: will, POS: AUX)
hit (Lemma: hit, POS: VERB)
$ (Lemma: $, POS: SYM)
100,000 (Lemma: 100,000, POS: NUM)
by (Lemma: by, POS: ADP)
2025 (Lemma: 2025, POS: NUM)
! (Lemma: !, POS: PUNCT)

Entities:
Elon Musk (PERSON)
Bitcoin (PERSON)
100,000 (MONEY)
2025 (DATE)


### **Expected Output**
- **Cleaned Text**: "Elon Musk says Bitcoin will hit $$100,000 by 2025!"
- **Tokens**: You’ll see tokens like "Elon" (Lemma: Elon, POS: PROPN), "says" (Lemma: say, POS: VERB), etc.
- **Entities**: "Elon Musk" should be tagged as `PERSON`, "Bitcoin" as `PERSON` (again, a misclassification), "$100,000" as `MONEY`, and "2025" as `DATE`.

### **Beginner Insight: Handling NER Errors**
As you saw, spaCy sometimes mislabels "Bitcoin" as a `PERSON`. In a real project, you can fix this by:

1. Using a Custom List: Check if entities match a list of known cryptocurrencies (like Bitcoin, Ethereum) and relabel them as `PRODUCT` or a custom label.
2. Training a Model: Fine-tune spaCy with crypto-related data to improve its accuracy.
3. Post-Processing: Write rules to correct common errors, e.g., if an entity is "Bitcoin," change its label to `PRODUCT`.


### **Why This Matters for Sentiment Analysis**
In our project, we use spaCy to preprocess tweets before feeding them into the VADER sentiment analyzer. Tokenization and lemmatization help standardize the text (e.g., "bought" and "buying" become "buy"), which makes sentiment analysis more consistent. NER helps us identify key entities like prices or crypto names, which we can use to match with CoinGecko data.

## **Step 2: Selenium Demonstration**

Selenium is a tool for automating web browsers, and we use it to scrape tweets from X (Twitter). In our project, we need to collect Bitcoin-related tweets to analyze public sentiment. X requires users to log in to access search results, so we’ll use Selenium to automate the login process and scrape tweets.

### **What You’ll Learn Here**
- How to set up Selenium to interact with a website.
- How to log in to X using Selenium.
- How to scrape tweets and handle dynamic content.
- Best practices for web scraping (e.g., avoiding rate limits, handling errors).

### **Code Example: Scraping Tweets with Selenium**
We’ll use the `BitcoinSentimentAnalyzer` class from `spacy_selenium_utils.py` to scrape tweets. This class handles the login and scraping process for us.

In [8]:
from spacy_selenium_utils import BitcoinSentimentAnalyzer

# Initialize the analyzer
analyzer = BitcoinSentimentAnalyzer(
    
    x_username="sidrohtest",
    x_password="siddhirohantesting#123"
)

# Scrape tweets
tweets = analyzer.scrape_tweets(keywords=["Bitcoin"], max_tweets=3)  # Limited to 3 tweets for demo

# Display the scraped tweets
print("Sample Tweets:")
for tweet in tweets:
    print(f"- {tweet['text']}\\n")

# Clean up
del analyzer

Sample Tweets:
- Use your  

#BITCOIN $BTC #BTC\n
-  BREAKING: Nach einem Treffen mit El Salvadors Präsident 
@nayibbukele
 postet Panamas Bürgermeister 
@Mayer
 über eine „Bitcoin Reserve“. \n
- Devs, degens, and even your grandma who still thinks Bitcoin is a slot machine.  Let’s dive into what they’re building, why it’s a big deal for Crypto Twitter (CT), the blockchain, and us commoners, plus why you should be BULLISH\n


###
### **Output Explanation**
The output shows three sample tweets about Bitcoin. Each tweet is a dictionary with `text` (the tweet content) and `timestamp` (when it was scraped). For example:

- "Bitcoin just touched $$74K. That same HR manager who flagged my crypto side hustle now runs 'on-chain payroll workshops.' Yeah Jessica, glad compliance caught up with capitalism."
- This tweet mentions Bitcoin’s price ($74K) and has a mix of sentiment (positive about Bitcoin, sarcastic about the HR manager).

### **How Selenium Works in This Example**
1. **Login**: The `BitcoinSentimentAnalyzer` class navigates to X’s login page, enters the username and password, and clicks the "Log in" button.
2. **Search**: It searches for the keyword "Bitcoin" using X’s search bar.
3. **Scrolling**: X loads tweets dynamically as you scroll. Selenium scrolls down the page to load more tweets until it collects the desired number `(max_tweets=3)`.
4. **Extraction**: It extracts the text of each tweet and stores it with a timestamp.

###
### **Insights for Beginners**
- **Why Selenium?** Unlike simple APIs, X requires a login to access tweets, and its content is dynamic (loaded via JavaScript). Selenium automates a real browser (like Chrome) to interact with the site as a human would.
- **Headless Mode**: The `BitcoinSentimentAnalyzer` uses Selenium in headless mode (no visible browser window) for efficiency. If you’re debugging, you can disable headless mode to watch Selenium in action. Just remove the `--headless` argument in `spacy_selenium_utils.py.`
- **ChromeDriver Setup**: You need to install ChromeDriver (a separate executable) that matches your Chrome browser version. If you get a version mismatch error, download the correct ChromeDriver from chromedriver.chromium.org and place it in your project directory or PATH.
- **Rate Limits and Ethics**: X has strict rules about scraping. Be cautious not to scrape too many tweets at once (we limited to 3 here for safety). Always respect X’s terms of service, and consider using X’s official API for larger projects (though it may require a paid plan).

### **Why This Matters for Sentiment Analysis**
Scraping tweets gives us raw data to analyze. In our project, we collect tweets mentioning "Bitcoin" to gauge public sentiment. The more diverse and recent the tweets, the better our sentiment analysis reflects current market moods.

## **Step 3: Integration with Main Pipeline**

The spaCy and Selenium functionalities we explored are integrated into the main pipeline in `spacy_selenium_utils.py`. The `BitcoinSentimentAnalyzer` class combines both tools to:

1. Scrape tweets (Selenium).
2. Preprocess them (spaCy).
3. Analyze sentiment (using VADER, which we’ll cover in the main pipeline).
4. Correlate sentiment with Bitcoin prices (using CoinGecko API).
5. Visualize the results.

For the full pipeline execution, see `spacy_selenium_example.ipynb`. That notebook ties everything together, showing how the preprocessing and scraping steps fit into the larger workflow.