## Experience Markers

From Williams and Buchan

This script takes in the following metrics:

1) frequency of experience markers (I me, we, my, experience)
2) temporal events
3) presence of verbs in base form
4) past tense verbs and gerund verbs
5) bigrams where first word is I
6) presence of pronouns

And rewards points to an article based upon the presence of the markers

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, size, regexp_extract, sum, when, lag, concat_ws
import nltk

# Set up Spark
conf = SparkConf().setAppName("Blog Metrics")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

# Read in the blog article and split into words
text_file = spark.read.text("path/to/blog_article.txt")
words = text_file.select(regexp_extract(col("value"), "\\w+", 0).alias("word"))

# Calculate frequency of experience markers (I, me, we, my, experience)
experience_markers = ["i", "me", "we", "my", "experience"]
experience_count = words.filter(col("word").isin(experience_markers)).count()

# Identify temporal events
temporal_markers = ["yesterday", "today", "tomorrow", "last", "this", "next"]
temporal_count = words.filter(col("word").isin(temporal_markers)).count()

# Find presence of verbs in base form
nltk.download("averaged_perceptron_tagger")
pos = nltk.pos_tag(words.collect())
base_verb_count = len([w for w, t in pos if t.startswith("VB")])

# Count past tense and gerund verbs
past_tense_count = words.filter(col("word").like("%ed")).count()
gerund_count = words.filter(col("word").like("%ing")).count()

# Identify bigrams where the first word is "I"
bigrams = words.select(lag("word", 1).over(Window.orderBy("word")).alias("prev"), "word")
i_bigram_count = bigrams.filter(col("prev") == "i").count()

# Count presence of pronouns
pronoun_count = len([w for w, t in pos if t == "PRP"])

# Calculate total score
score = (experience_count * 2 + temporal_count + base_verb_count +
         past_tense_count + gerund_count * 1.5 + i_bigram_count * 2 + pronoun_count * 0.5)

# Print score
print(f"Blog score: {score}")

Now, we can calculate the frequency of the experience markers (I, me, we, my, experience) by filtering the words for those specific markers and counting their occurrences

To identify temporal events, we will look for words that indicate a specific time period (e.g., "yesterday", "last year") and count their occurrences:

To find the presence of verbs in base form, we will use part-of-speech tagging. We will first need to download the Natural Language Toolkit (NLTK) library, which provides a variety of natural language processing tools, including a part-of-speech tagger.

For past tense and gerund verbs, we will count the occurrences of the past tense and gerund forms of verbs:


To identify bigrams where the first word is "I", we will use a sliding window approach to create pairs of adjacent words and count the occurrences of bigrams where the first word is "I":

Finally, to count the presence of pronouns, we will count the occurrences of words that are tagged as pronouns using the NLTK part-of-speech tagger:

Now that we have all the necessary counts, we can calculate the total score for the article:

The weights used in the score calculation are arbitrary and can be adjusted based on the desired importance of each metric.