## References

This script regexes the hyperlinks of an article and gives an article a score based on the results.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, size, regexp_extract, sum, when
import re

# Set up Spark
conf = SparkConf().setAppName("Blog Metrics")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

# Define function to count hyperlinks
def count_hyperlinks(article_path):
    text_file = spark.read.text(article_path)
    hyperlinks = text_file.select(regexp_extract(col("value"), '<a\\s+.*?href="(.*?)".*?>', 1).alias("hyperlink"))
    return hyperlinks.filter(col("hyperlink").isNotNull()).count()

# Define function to calculate score
def calculate_score(hyperlink_count):
    return hyperlink_count * 1.5

# Example usage
article_paths = ["path/to/article1.html", "path/to/article2.html", "path/to/article3.html"]
hyperlink_counts = [count_hyperlinks(path) for path in article_paths]
scores = [calculate_score(count) for count in hyperlink_counts]

for i, path in enumerate(article_paths):
    print(f"{path} - Hyperlink count: {hyperlink_counts[i]}, Score: {scores[i]}")

The count_hyperlinks function takes in an article path, reads in the file, and extracts hyperlinks using a regular expression. It then filters out any null hyperlinks and returns a count of the remaining hyperlinks.

The calculate_score function takes in a hyperlink count and returns a score based on that count. In this case, we're simply multiplying the count by 1.5.

Finally, we loop through each article path, count the hyperlinks, and calculate the score using the above functions. We then print out the path, hyperlink count, and score for each article. Note that you will need to replace "path/to/article1.html", "path/to/article2.html", and "path/to/article3.html" with the actual file paths to your articles.