# Text Scraping and Cleaning
Text scraping is an incredibly common application of Natural Language Processing. **Text scraping**, sometimes called web scraping, extracts text from websites or other online sources. It involves automatically retrieving web pages, parsing the HTML content, and extracting relevant text for further analysis. 

Text scraping is a common technique used to gather large amounts of textual data from the web, which can then be used for various NLP tasks such as sentiment analysis, topic modeling, language modeling, and more. By extracting text from websites, researchers and data scientists can obtain valuable data for training and testing NLP models or conducting text-based research

Typically, text scraping involves the following steps:

1. Retrieving web pages: The scraping process starts with fetching the HTML content of web pages. This can be done using libraries or frameworks like Beautiful Soup, Scrapy, or Selenium, which provide tools for interacting with web pages programmatically.

2. Parsing HTML content: Once the web page is retrieved, the HTML content needs to be parsed to identify the relevant text elements. HTML parsing libraries like Beautiful Soup or lxml can be used to navigate the HTML structure and extract specific text elements or classes.

3. Extracting text: After identifying the relevant HTML elements, the desired text is extracted from them. This may involve extracting the text from HTML tags, cleaning the text by removing unwanted characters or HTML markup, and organizing the extracted text in a suitable format.

Storing the extracted data: Finally, the extracted text can be stored in a structured format such as CSV, JSON, or a database for further analysis or integration into NLP workflows.

It's important to note that when performing text scraping, one should respect website owners' terms of service, adhere to legal and ethical guidelines, and be mindful of not overloading servers with excessive requests. Additionally, some websites may have restrictions on scraping or may require authentication for access.

## Data Collection

There are several ways to scrape web content from the internet. Here are some common methods:

1. Libraries and frameworks: Various programming libraries and frameworks provide functionality for web scraping. Some popular options include:

  -  **Beautiful Soup**: A Python library for parsing HTML and XML documents, making it easier to navigate and extract data from web pages.
  - **Scrapy**: A Python framework for building web spiders, which are specialized programs for automated web scraping.
  -  **Selenium**: A tool for automating web browsers, useful for scraping web pages that require JavaScript execution or user interaction.
  
2. API access: Many websites and online services provide APIs (Application Programming Interfaces) that allow you to access and retrieve data in a structured manner. APIs often provide specific endpoints or methods for retrieving data, which can be more reliable and efficient compared to scraping HTML directly. However, not all websites offer APIs, and some may have restrictions or require authentication.

3. RSS feeds: Some websites publish RSS (Really Simple Syndication) feeds that provide structured updates or summaries of their content. RSS feeds can be parsed to extract relevant text or metadata, making it easier to gather specific information from multiple sources.

4. Data marketplaces: There are data marketplaces and platforms that provide pre-scraped data from various websites, saving you the effort of scraping yourself. These platforms often offer APIs or downloadable datasets for easy integration into your projects.

5. Custom scripts: For websites without APIs or other scraping-friendly features, you can write custom scripts using programming languages like Python, Ruby, or JavaScript. These scripts typically simulate user interactions, such as sending HTTP requests, parsing HTML responses, and extracting desired content.

6. Scraping tools and services: Several tools and services are available that simplify the process of web scraping. They offer point-and-click interfaces, require little to no coding, and provide features like scheduling, data extraction, and storage. Examples include Octoparse, ParseHub, and Import.io.


We'll start by looking at two python libraries that can scrape Wikipedia for data: `wikipedia` and `beautiful soup`. In addition to these two libraries, we'll leverage `spacy` for some additional processing of the scaped text. 

In [1]:
pip install wikipedia

Collecting wikipedia
  Using cached wikipedia-1.4.0-py3-none-any.whl
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import package
import wikipedia

# Specify the title of the Wikipedia page
wiki = wikipedia.page("Tiger Woods")

# Extract the plain text content of the page
text = wiki.content
text

Now that we've used the `wikipedia` library to pull data into memory, let's use the `regular expressions` library, `re` to do some string manipulation to clean the data. The Python regular expression library, `re`, provides powerful tools for pattern matching and manipulation of text strings. It allows developers to search, match, and manipulate strings based on specified patterns using a combination of metacharacters and special sequences. With the `re` library, you can perform tasks like extracting specific information from text, validating input patterns, or replacing text based on certain criteria, making it a valuable tool for text processing and data extraction tasks. Read more about the library [here](https://docs.python.org/3/library/re.html)


In [None]:
# Import package
import re
# Clean text
text = re.sub(r'==.*?==+', '', text)
text = text.replace('\n', '')
text

As an alternative to the `re` library, the `BeautifulSoup` library can be used to scrape the web. `BeautifulSoup` is a Python library used for parsing HTML and XML documents. It provides an easy-to-use interface for navigating and manipulating the parsed data, making it convenient for web scraping tasks. With Beautiful Soup, developers can extract specific elements, search for tags or attributes, and navigate the document tree structure, simplifying the process of extracting desired information from web pages.

In [None]:
# Import packages
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Tiger_Woods').read()

# Make a soup 
soup = BeautifulSoup(source,'lxml')
soup

In [None]:
print(set([text.parent.name for text in soup.find_all(text=True)]))

Next, we can extract the plain text content from the paragraphs

In [None]:
# Extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
    text += paragraph.text
    
text

In [None]:
# Clean text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
text

## Using `spaCy` for Cleaning

`spaCy` is a powerful Python library for natural language processing (NLP) tasks. It provides efficient tools and models for tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and more. Spacy's focus on speed and efficiency makes it suitable for processing large amounts of text data and building scalable NLP pipelines.

In order to use various spacy models, you'll need to download them. Please run the following snippets of code in your local terminal app to download the models:
- `python -m spacy download en_core_web_sm`
- `python3 -m spacy download en_core_web_lg`

In [None]:
import numpy as np
import pandas as pd
import spacy

# Import the english language model
nlp = spacy.load("en_core_web_sm")

For this example, we'll use some sample wikipedia data about various people. 

In [None]:
df = pd.read_csv("../data/supplementary_content/people_wiki.csv")
df.shape

In [None]:
df.head()

In [None]:
df["text"].head()

In [None]:
# Select a single record 
txt = df["text"][1111]
txt[1000]

Using this sample text, we'll do the following:
- Use the `nlp()` function from the Spacy library is used to process the text and create a doc object, which represents the processed document.
- A list called olist is initialized to store the information of each token in the document. The code then iterates over each token in the doc object using a for loop. For each token, various properties such as the text itself (token.text), starting index (token.idx), lemma (base form of the token, token.lemma_), punctuation status (token.is_punct), space status (token.is_space), shape (token.shape_), part of speech (token.pos_), and POS tag (token.tag_) are extracted and stored in a list
- After processing all tokens, the olist is converted into a pandas DataFrame. The column names of the DataFrame are then set accordingly.

In [None]:
doc = nlp(txt)    
olist = []
for token in doc:
    l = [token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_]
    olist.append(l)
    
odf = pd.DataFrame(olist)
odf.columns= ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"]
odf.head()

This code iterates over each named entity (ent) in the `doc.ents` attribute, which contains a list of named entities recognized by `spaCy`.

For each named entity, the code extracts the text of the entity (`ent.text`) and its label (`ent.label_`), which represents the type or category of the named entity. The text and entity type are appended. 

In [None]:
doc = nlp(txt)
olist = []
for ent in doc.ents:
    olist.append([ent.text, ent.label_])
    
odf = pd.DataFrame(olist)
odf.columns = ["Text", "EntityType"]
odf.head()

The library also includes `display` functionality to better display text with named entities

In [None]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

In [None]:
txt = df["text"][3003]
doc = nlp(txt)
colors = {'GPE': 'lightblue', 'NORP':'lightgreen'}
options = {'ents': ['GPE', 'NORP'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun. We'll look at how to do noun phrase chunking using `spaCy`. In addition to noun phrase chunking, `spaCy` also gets us the root of the noun.


In [None]:
doc = nlp(txt)
olist = []
for chunk in doc.noun_chunks:
    olist.append([chunk.text, chunk.label_, chunk.root.text])
odf = pd.DataFrame(olist)
odf.columns = ["NounPhrase", "Label", "RootWord"]
odf

In addition to noun counts, we can use `spaCy` to parse through different levels of text, referred to as a **dependency parser**. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. 

In [None]:
doc = nlp(df["text"][1009])
olist = []
for token in doc:
    olist.append([token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children]])
odf = pd.DataFrame(olist)
odf.columns = ["Text", "Dep", "Head text", "Head POS", "Children"]
odf

A **syntactic dependency** tree visualization is a graphical representation of the syntactic structure and relationships among words in a sentence or text. It represents how words are connected to each other based on their grammatical roles and dependencies.

In a dependency tree, each word in the sentence is represented as a node, and the relationships between words are represented as labeled arcs or edges. The arcs indicate the grammatical dependencies, such as subject-verb, verb-object, or modifier relationships.

The visualization can provide valuable insights into the grammatical structure of a sentence, highlighting the dependencies between words and how they contribute to the overall meaning of the sentence. It can be useful for understanding and analyzing sentence structure, parsing, part-of-speech tagging, and other linguistic tasks.


`spaCy` can display a syntactic dependency tree visualization of the document using the 'dep' style with specified options. Other style options can be configured, read more about it [here](https://spacy.io/usage/visualizers)

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})


## Word Similarity 
Word similarity refers to the measurement of the degree of semantic or contextual similarity between two or more words. It involves quantifying the likeness or relatedness between words based on their meaning, context, or usage.

In [None]:
# Load spacy model
nlp = spacy.load("en_core_web_lg")

We'll calculate the `cosine similarity` as the spacial distance between words:

In [None]:
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

queen = nlp.vocab["software"].vector
computed_similarities = []
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
    similarity = cosine_similarity(queen, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

Now, that we've calculated cosine similarity let's check to see how words relate based on the sample text

In [None]:
software = nlp.vocab["software"]
computer = nlp.vocab["computer"]
silicon_valley = nlp.vocab["Silicon Valley"]
unix = nlp.vocab["unix"]
microsystems = nlp.vocab["microsystems"]
 
print("Word similarity score between software and computer : ",software.similarity(computer))
print("Word similarity score between software and Silicon Valley : ",software.similarity(silicon_valley))
print("Word similarity score between software and unix : ",software.similarity(unix))
print("Word similarity score between software and microsystems : ",software.similarity(microsystems))