# Web Pages Dataset Exploration

We're going to take a look at a few examples of how we can explore the Web Pages dataset.

In [None]:
dataset = "ARCHDATASETURL"

## pandas

Next, we'll setup our environment so we can load our Web Pages Information dataset into [pandas](https://pandas.pydata.org) DataFrames. If you're unfamiliar with DataFrames, but you've worked with spreadsheets before, you should feel comfortable pretty quick.

# Environment

Next, we'll setup our environment so we can load our derivatives into [pandas](https://pandas.pydata.org).

In [None]:
import pandas as pd

# Data Table Display

Colab includes an extension that renders pandas DataFrames into interactive displays that can be filtered, sorted, and explored dynamically. This can be very useful for taking a look at what each DataFrame provides!

Data table display for pandas DataFrames can be enabled by running:
```python
%load_ext google.colab.data_table
```
and disabled by running
```python
%unload_ext google.colab.data_table
```

In [None]:
%load_ext google.colab.data_table

## Loading our ARCH Dataset as a DataFrame

---


Next, we'll create pandas DataFrame from our dataset, and show a preview of it using the Data Table Display.

In [None]:
web_pages = pd.read_csv(dataset, compression="gzip", skipinitialspace=True)
web_pages

## Text Analysis

Next, we'll do some basic text analysis with our `web_pages` DataFrame with `nltk` and`spaCy`, and end with a word cloud.


In [None]:
import re

import nltk

In [None]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

In [None]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

We'll drop the `NaN` values in our DataFrame to clean things up a bit.

In [None]:
web_pages = web_pages.dropna()
web_pages

We need to set the [`mode.chained_assignment`](https://pandas.pydata.org/docs/user_guide/options.html?highlight=chained_assignment) to `None` now to silence some exception errors that will come up.

In [None]:
pd.options.mode.chained_assignment = None

Next, we'll setup a tokenizer which will split on words, and create a new column which is the tokenized text.

In [None]:
tokenizer = nltk.RegexpTokenizer(r"\w+")

In [None]:
web_pages["content_tokenized"] = web_pages["content"].map(tokenizer.tokenize)

Now well create a column with the tokenized value count.

In [None]:
web_pages["content_tokens"] = web_pages["content_tokenized"].apply(lambda x: len(x))

### Basic word count statistics with pandas!

Now we can use the power of pandas [Statisitcal functions](https://pandas.pydata.org/docs/user_guide/computation.html) to show us some basic statistics about the tokens.

**Mean**

In [None]:
web_pages["content_tokens"].mean()

**Standard deviation**


In [None]:
web_pages["content_tokens"].std()

**Max**

In [None]:
web_pages["content_tokens"].max()

**Min**

In [None]:
web_pages["content_tokens"].min()

### Pages with most words

Let's create a bar chart that shows the pages with the most words. Here we can see the power of pandas at work, in terms of both analysis and visualization.

First, let's show the query to get the data for our chart.

In [None]:
word_count = (
    web_pages[["url", "content_tokens"]]
    .sort_values(by="content_tokens", ascending=False)
    .head(25)
)

In [None]:
word_count

Next, let's create a bar chart of this.

In [None]:
import altair as alt

word_count_bar = (
    alt.Chart(word_count)
    .mark_bar()
    .encode(x=alt.X("url:O", sort="-y"), y=alt.Y("content_tokens:Q"))
)

word_count_rule = (
    alt.Chart(word_count).mark_rule(color="red").encode(y="mean(content_tokens):Q")
)

word_count_text = word_count_bar.mark_text(align="center", baseline="bottom").encode(
    text="content_tokens:Q"
)

(word_count_bar + word_count_rule + word_count_text).properties(
    width=1400, height=700, title="Pages with the most words"
)

### How about NER on the page with the most tokens?

[Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition), or NER, is an exciting field of natural language processing that lets us extract "entities" out of text; the names of people, locations, or organizations.

To do this, we first need to find the pages that have the most tokens.

In [None]:
word_count_max = (
    web_pages[["url", "content_tokens", "content"]]
    .sort_values(by="content_tokens", ascending=False)
    .head(1)
)
word_count_max["url"]

We'll remove the column width limit so we can check out our content for the page.

In [None]:
pd.set_option("display.max_colwidth", None)

Let's take a look at our page's content.

In [None]:
page = word_count_max["content"].astype("unicode").to_string()
page


#### Setup spaCy

We now need to set up [spaCy](https://en.wikipedia.org/wiki/SpaCy), a natural-language processing toolkit.


In [None]:
import en_core_web_sm
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

nlp.max_length = 1100000

Next we'll run the natual language processor from SpaCy, and then display the NER output. Watch how it finds organizations, people, and beyond!

In [None]:
ner = nlp(page)
displacy.render(ner, style="ent", jupyter=True)

### Sentiment Analysis

We'll be using the [vaderSentiment](https://github.com/cjhutto/vaderSentiment) library and [adapting examples](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/04-Sentiment-Analysis.html#) from Melanie Walsh's ["Introduction to Cultural Analytics & Python"](https://melaniewalsh.github.io/Intro-Cultural-Analytics).

In [None]:
%%capture

!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Initialize VADER
sentimentAnalyser = SentimentIntensityAnalyzer()

We'll create a function, that we'll then apply to a DataFrame to create sentiment analysis scores for the `content` column.

In [None]:
def calculate_sentiment(text):
    # Run VADER on the text
    scores = sentimentAnalyser.polarity_scores(text)
    # Extract the compound score
    compound_score = scores["compound"]
    # Return compound score
    return compound_score

Since it will take some time to run the sentiment analysis on the entire `web_pages` DataFrame, we'll create a sample from `web_pages`, and run the sentiment analysis on that for demostration purposes.

In [None]:
web_pages_sample = web_pages.sample(500)

In [None]:
web_pages_sample["sentiment_score"] = web_pages_sample["content"].apply(
    calculate_sentiment
)

Let's see what the the scores look like.

In [None]:
web_pages_sample[["sentiment_score", "content"]]

Finally, let's plot the sentiment score.

In [None]:
sentiment_scores = (
    web_pages_sample[["sentiment_score"]].value_counts().head(10).reset_index()
)
sentiment_scores = sentiment_scores.rename(
    {"sentiment_score": "Sentiment Score", 0: "Count"}, axis=1
)

sentiment_chart = (
    alt.Chart(sentiment_scores)
    .mark_circle()
    .encode(
        x=alt.X("Sentiment Score:Q", bin=True),
        y=alt.Y("Count:Q", bin=True),
        size="Count",
    )
)

sentiment_chart.properties(width=1400, height=700, title="Sentiment Score Distribution")

### Wordcloud

What better way to wrap-up this notebook than by creating a word cloud!

Word clouds are always fun, right?! They're an interesting way to visualize word frequency, as the more times that a word occurs, the larger it will appear in the word cloud.

Let's setup some dependencies here. We will install the [word_cloud](https://github.com/amueller/word_cloud) library and setup some stop words via `nltk`.

In [None]:
%%capture

!pip install wordcloud
from wordcloud import ImageColorGenerator, WordCloud

Let's remove the remove the stopwords from our data.

In [None]:
stopwords = stopwords.words("english")

In [None]:
web_pages["stopwords"] = web_pages["content_tokenized"].apply(
    lambda x: [item.lower() for item in x if item not in stopwords]
)

Next we'll pull 500 rows of values from our new column.

In [None]:
words = web_pages["stopwords"].head(500)

Now we can create a word cloud!

In [None]:
from matplotlib import pyplot as plt

wordcloud = WordCloud(
    width=2000,
    height=1500,
    scale=10,
    max_font_size=250,
    max_words=100,
    background_color="white",
).generate(str(words))
plt.figure(figsize=[35, 10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()