# Web Science CW Q2
### Isaac Tabb

#### STEP 0: Loading the json files

First, load in the background file.

In [None]:
# import json
import json

# create a list to hold all of the json objects
bg_json = []
# open the file
with open('bgQuality.json') as f:
    # iterate through each line in the file
    for line in f:
        # load an append to list
        dct = json.loads(line)
        bg_json.append(dct)

Let's take a look at the contents of the background json loads. Additionally, we will print the number of json's in the background model.

In [None]:
print(bg_json[0])
print(len(bg_json))

{'_id': '1613006911946776576', 'text': ['confidence', 'dont', 'fix', 'lost', 'need', 'negative', 'playing', 'right', 'support', 'well'], 'score': 0.6367300158372926}
2239


Second, load in the high quality file.

In [None]:
# create a list to hold all of the json objects
hq_json = []
# open the file
with open('highQuality.json') as f:
    # iterate through each line in the file
    for line in f:
        # load an append to list
        dct = json.loads(line)
        hq_json.append(dct)

Again, let's check the contents and length.

In [None]:
print(hq_json[0])
print(len(hq_json))

{'_id': {'$oid': '623cb1491fdfc2cfea3b85c4'}, 'username': 'miguel_zavalaa', 'description': '•Mind with no sense in it•   •Stay away from me ‘cause I’m dancing to quite a different drumbeat•', 'qualityS': 0.6081996434937612, 'text': 'Turisteando (@ Piccadilly Circus in London, Greater London) https://t.co/ear25pmZNc https://t.co/528VOdxIeF'}
2501


Finally, load in the low quality file.

In [None]:
# create a list to hold all of the json objects
lq_json = []
# open the file
with open('lowQuality.json') as f:
    # iterate through each line in the file
    for line in f:
        # load an append to list
        dct = json.loads(line)
        lq_json.append(dct)

And the contents and length.

In [None]:
print(lq_json[0])
print(len(lq_json))

{'_id': {'$oid': '623dbef7462d56a90e36b221'}, 'username': 'tobythereactive', 'description': "IG: The_Champagne_Shepherd Model ☀ Influencer ☀ Don't like people", 'qualityS': 0.4662280701754386, 'text': 'Just posted a photo @ London, United Kingdom https://t.co/YZWvowLCj2'}
1527


## Split the Datasets into Train/Validation
We will use the validation set for data analysis.

In [None]:
import random

# set a random seed so split is always the same
random.seed('websci')

# split into 80 / 20
random.shuffle(hq_json)
hq_json_valid = hq_json[:500]
hq_json = hq_json[500:]

# split into 80 / 20
random.shuffle(lq_json)
lq_json_valid = lq_json[:305]
lq_json = lq_json[305:]


## Counting Tf's in the Background Model

We will weight all of the term frequencies in the BG model by their quality score.

In [None]:
# will hold dictionary of term frequencies in background collection
bg_json_tf = {}
# will hold total term frequency
bg_total_terms = 0
# iterate through background tweets
for tweet in bg_json:
  # count term frequencies in tweet
  counts = Counter(tweet['text'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # if key is already in term freq dictionary, increment
    if key in bg_json_tf:
      # weight BG items by their quality score
      # NOTE: HQ/LQ terms are weighted by the qualityScore of the tweet they appear in (see HQ/LQ TF counting for expl.)
      bg_json_tf[key] += (item * tweet['score'])
      # increment total terms
      bg_total_terms += (item * tweet['score'])
    # else, make new key
    else:
      # +0.5 is added for smoothing purposes
      bg_json_tf[key] = (item * tweet['score']) + 0.5
      bg_total_terms += (item * tweet['score']) + 0.5

## Counting Tf's in the High Quality Model

Since we have both descriptions and tweet texts, we will use both!

Let's import Spacy.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

And create a spacy tokenization function.

In [None]:
def spacy_tokenization(tweet):
    tokens = []
    twt = nlp(tweet)
    # applies spacy to get rid of spaces, punctuation, stopwords
    for c in twt:
        if (not c.is_stop) and (not c.is_punct) and (not c.is_space):
            # lemmatize and convert to lowercase
            tokens.append(c.lemma_.lower())

    # spacy does not recognize the '•' character which appears in some tweets
    # this removes that character
    for token in range(len(tokens)):
      if '•' in tokens[token]:
        new_token = ''
        for c in tokens[token]:
          if c != '•':
            new_token += c
   
    # return the token list
    return tokens

For each tweet in the dataset, we will tokenize both the tweet text and the description. We will combine these sets together.

In [None]:
# this will give us a progress bar
from tqdm import tqdm 

for tweet in tqdm(hq_json):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

100%|██████████| 2001/2001 [00:46<00:00, 43.37it/s]


Now let's count the frequencies in the high quality dataset. Terms will be weighted by the quality score of the tweet they appear in. For instance, if term t appears once in a HQ tweet that has a quality score of 0.6, the term frequency for that term will gain 0.6. If term t appears once in a HQ tweet that is of higher quality, say 0.9, the term frequency for that term will gain 0.9. This makes it so that terms that are in the highest quality tweets are given more weight.

In [None]:
from collections import Counter
# import nltk stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stops = set(stopwords.words('english')) # holds the english stopwords set

# will hold dictionary of term frequencies in high quality collection
hq_json_tf = {}
# will hold total term frequency
hq_total_terms = 0
# iterate through high quality tweets
for tweet in hq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in hq_json_tf:
      # terms are weighted by the tweet's quality score
      hq_json_tf[key] += (item * tweet['qualityS'])
      # increment total terms
      hq_total_terms += (item * tweet['qualityS'])
    # else, make new key
    else:
      # +0.5 is for smoothing
      hq_json_tf[key] = (item * tweet['qualityS']) + 0.5
      hq_total_terms += (item * tweet['qualityS']) + 0.5


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(hq_total_terms)
print(hq_json_tf)

27255.660099283734


Let's also look at the average quality score for the HQ tweets. This will be useful in our relative importance scoring scheme.

In [None]:
sum = 0
for tweet in hq_json:
  sum += tweet['qualityS']

avg_hq_qs = round(sum/len(hq_json), 3)
print("Avg. HQ Qual. Score: " + str(avg_hq_qs))

Avg. HQ Qual. Score: 0.625


## Counting Tfs for the Low Quality Model

Now, let's do the same for the low quality tweets.

Let's tokenize.

In [None]:
for tweet in tqdm(lq_json):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

100%|██████████| 1222/1222 [00:33<00:00, 36.90it/s]


Let's look at the average quality score for the LQ tweets. This will be useful in our weighting scheme.


In [None]:
sum = 0
for tweet in lq_json:
  sum += tweet['qualityS']

avg_lq_qs = round(sum/len(lq_json), 3)
print("Avg. LQ Qual. Score: " + str(avg_lq_qs))

Avg. LQ Qual. Score: 0.421


As you can see, the average low quality tweet deviates less from the "average" tweet (qs=0.5) than the average high quality tweet does. A value of 0.42 is only .08 below 0.5. Note that the average HQ tweet has a score of 0.62 which is 0.12 above random.

We are going to re-score the low quality tweets so that the average low quality tweet score is matched up with the average high quality tweet score. This will ensure a fair scoring method. The goal is for the average low quality tweet to have a score of 0.375 (or 0.421 / 1.123). So we will scale all of the LQ tweet quality scores as such.

In [None]:
for tweet in lq_json:
  # rescore low quality tweets with scalar
  tweet['qualityS'] = tweet['qualityS'] / 1.123

Now let's look at the new LQ average quality score.

In [None]:
sum = 0
for tweet in lq_json:
  sum += tweet['qualityS']

avg_lq_qs = round(sum/len(lq_json), 3)
print("Avg. LQ Qual. Score: " + str(avg_lq_qs))

Avg. LQ Qual. Score: 0.375


Since we scaled all of the quality scores, the average is now 0.375 which creates a fair comparison between HQ and LQ term frequencies.

And now let's count the frequencies for the LQ set.

In [None]:
# will hold dictionary of term frequencies in low quality collection
lq_json_tf = {}
# will hold total term frequency
lq_total_terms = 0
# iterate through low quality tweets
for tweet in lq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in lq_json_tf:
      # terms are weighted by the one minus tweet quality score (see HQ TF for explanation)
      # NOTE: It is one minus the QS because LQ weights are compared in relation
      #       to the BG weights. If LQ has qualityS of 0.40 and BG has score of 0.55, LQ
      #       should be weighted 0.60 so that it is weighted more than BG. If it was weighted as 0.40,
      #       LQ terms would be at a loss against BG terms, and rarely have high relative importances.
      lq_json_tf[key] += (item * (1-tweet['qualityS']))
      # increment total terms
      lq_total_terms += (item * (1-tweet['qualityS']))
    # else, make new key
    else:
      # +0.5 is for smoothing
      lq_json_tf[key] = (item * (1-tweet['qualityS'])) + 0.5
      lq_total_terms += (item * (1-tweet['qualityS'])) + 0.5

In [None]:
print(lq_total_terms)
print(lq_json_tf)

12321.878079369713
{'fool': 9.974895611699202, 'thousand': 6.883387197975451, 'time': 27.398154538596373, 'atheist': 8.719732005712181, 'human': 11.773093215134207, 'physician': 8.719732005712181, 'imam': 371.72571088705365, 'nasser': 404.35816882549904, 'mohammed': 439.49327210104786, 'al': 413.7552564356, 'yamani': 386.77773549660094, 'almahdicaliphofallah': 359.82600600615143, 'طب': 69.93610808946622, 'عام': 24.954565477840255, 'good': 30.68778953471627, 'morning': 1.127126114214271, 'lovely': 1.7576278228548934, 'people': 51.07452526120597, 'god': 3.604015871190194, 'countdown': 1.127126114214271, 'continue': 1.127126114214271, 'week': 2.9797817657287125, 'ready': 1.7182909319939517, 'tcncentrallondon': 1.127126114214271, 'tcnlondon': 1.127126114214271, 'thecovenantnation': 1.127126114214271, 'churchinlondon': 1.127126114214271, 'church': 1.127126114214271, 'eventsinlondon': 1.127126114214271, 'pojuoyemade': 1.127126114214271, 'pastorpoju': 1.127126114214271, 'post': 244.8232320357

## Compute Relative Importances & Define Scoring Method

First, let's compute the relative importances for the High Quality model. 

In [None]:
# store the relative importances in a dictionary
hq_relimpo = {}
# iterate through the term frequency dictionary
for key, item in hq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / hq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / hq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with importances over 2
  if rel_impo > 2:
    hq_relimpo[key] = rel_impo

Now, let's compute the relative importances for the Low Quality model.

In [None]:
lq_relimpo = {}
for key, item in lq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / lq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / lq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with relative importance over 2
  if rel_impo > 2:
    lq_relimpo[key] = rel_impo

And here I create the function to calculate the newsworthiness score.

In [None]:
import math
# define newsworthy scoring method
def newsworthy_score(hq, lq, tokens):
  # sum of hq term relative importances
  shq = 0
  # sum of lq term relative importances
  slq = 0
  # iterate through given tokens
  for token in tokens:
    # add the token's relative importance in HQ model if exists
    try:
      shq += hq[token]
    except:
      pass

    # add the token's relative importance in LQ model if exists
    try:
      slq += lq[token]
    except:
      pass
  
  # compute newsworthiness score and return
  news_score = math.log2((1+shq)/(1+slq))
  return news_score

## Data Analysis

We created two validation sets (for HQ and LQ) to test on. Let's first tokenize them.

In [None]:
for tweet in tqdm(hq_json_valid):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

for tweet in tqdm(lq_json_valid):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

100%|██████████| 500/500 [00:11<00:00, 44.05it/s]
100%|██████████| 305/305 [00:06<00:00, 44.61it/s]


Now let's test on them.

In [None]:
import statistics as stat

# will hold count of newsworthy tweets
count = 0
# will hold sum of newsworthy scores
sum = 0
# will hold list of newsworthy scores
scores_list = []
# iterate through the validation set, calculating newsworthiness scores
for i in range(len(hq_json_valid)):
  # calculate score
  score = newsworthy_score(hq_relimpo, lq_relimpo, hq_json_valid[i]['tokens'])
  # append to list
  scores_list.append(score) 
  # increment score sum
  sum += score
  # note scores that are newsworthy
  if score > 0:
    count += 1

# calculate metrics (see below)
prop = round((count / len(hq_json_valid)), 3)
avg = round((sum / len(hq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print("Proportion of HQ texts classified as Newsworthy: " + str(prop))
print("Average score of HQ text: " + str(avg))
print("Standard Dev. of HQ text scores: " + str(sd))

# do the same for the low quality validation set
count = 0
sum = 0
scores_list = []
for i in range(len(lq_json_valid)):
  score = newsworthy_score(hq_relimpo, lq_relimpo, lq_json_valid[i]['tokens'])
  scores_list.append(score)
  sum += score
  if score < 0:
    count += 1

prop = round((count / len(lq_json_valid)), 3)
avg = round((sum / len(lq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print()
print("Proportion of LQ texts classified as Non-Newsworthy: " + str(prop))
print("Average score of LQ text: " + str(avg))
print("Standard Dev. of LQ text scores: " + str(sd))

Proportion of HQ texts classified as Newsworthy: 0.78
Average score of HQ text: 1.613
Standard Dev. of HQ text scores: 1.906

Proportion of LQ texts classified as Non-Newsworthy: 0.941
Average score of LQ text: -4.62
Standard Dev. of LQ text scores: 4.067


As we can see here, the newsworthiness scoring method is very good at predicting when tweets are not newsworthy but only alright when it comes to HQ tweets.

### Trying without Quality Score

Let's recalculate the TFs without a quality score weight.

In [None]:
# will hold dictionary of term frequencies in background collection
bg_json_tf = {}
# will hold total term frequency
bg_total_terms = 0
# iterate through background tweets
for tweet in bg_json:
  # count term frequencies in tweet
  counts = Counter(tweet['text'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # if key is already in term freq dictionary, increment
    if key in bg_json_tf:
      # increment terms for current key
      bg_json_tf[key] += item
      # increment total terms
      bg_total_terms += item 
    # else, make new key
    else:
      # +0.5 is added for smoothing purposes
      bg_json_tf[key] = item + 0.5
      bg_total_terms += item + 0.5

# will hold dictionary of term frequencies in high quality collection
hq_json_tf = {}
# will hold total term frequency
hq_total_terms = 0
# iterate through high quality tweets
for tweet in hq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in hq_json_tf:
      # increment terms for current key
      hq_json_tf[key] += item
      # increment total terms
      hq_total_terms += item
    # else, make new key
    else:
      # +0.5 is for smoothing
      hq_json_tf[key] = item + 0.5
      hq_total_terms += item + 0.5

# will hold dictionary of term frequencies in low quality collection
lq_json_tf = {}
# will hold total term frequency
lq_total_terms = 0
# iterate through low quality tweets
for tweet in lq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in lq_json_tf:
      # increment terms for current key
      lq_json_tf[key] += item
      # increment total terms
      lq_total_terms += item
    # else, make new key
    else:
      # +0.5 is for smoothing
      lq_json_tf[key] = item + 0.5
      lq_total_terms += item + 0.5

And now let's compute relative importances again.

In [None]:
# store the relative importances in a dictionary
hq_relimpo = {}
# iterate through the term frequency dictionary
for key, item in hq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / hq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / hq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with importances over 2
  if rel_impo > 2:
    hq_relimpo[key] = rel_impo

lq_relimpo = {}
for key, item in lq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / lq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / lq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with relative importance over 2
  if rel_impo > 2:
    lq_relimpo[key] = rel_impo

And finally, let's analyze once again. This is to be compared to the baseline.

In [None]:
# the following code calculates statistics the same way they were
# calculated earlier with the quality score, see earlier comments for explanation
count = 0
sum = 0
scores_list = []
for i in range(len(hq_json_valid)):
  score = newsworthy_score(hq_relimpo, lq_relimpo, hq_json_valid[i]['tokens'])
  scores_list.append(score)
  sum += score
  if score > 0:
    count += 1

prop = round((count / len(hq_json_valid)), 3)
avg = round((sum / len(hq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print("Proportion of HQ texts classified as Newsworthy: " + str(prop))
print("Average score of HQ text: " + str(avg))
print("Standard Dev. of HQ text scores: " + str(sd))

count = 0
sum = 0
scores_list = []
for i in range(len(lq_json_valid)):
  score = newsworthy_score(hq_relimpo, lq_relimpo, lq_json_valid[i]['tokens'])
  scores_list.append(score)
  sum += score
  if score < 0:
    count += 1

prop = round((count / len(lq_json_valid)), 3)
avg = round((sum / len(lq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print()
print("Proportion of LQ texts classified as Non-Newsworthy: " + str(prop))
print("Average score of LQ text: " + str(avg))
print("Standard Dev. of LQ text scores: " + str(sd))

Proportion of HQ texts classified as Newsworthy: 0.832
Average score of HQ text: 1.856
Standard Dev. of HQ text scores: 1.997

Proportion of LQ texts classified as Non-Newsworthy: 0.925
Average score of LQ text: -4.646
Standard Dev. of LQ text scores: 4.113


As you can see, taking out the quality score actually improves the model in terms of HQ. One reason for this is that the qualityScore metric may just not measure Hiqh Quality-ness very well.

If we look forward to the next section, we learn the top tokens for the HQ set vs. the LQ set. The tokens for the HQ set are pretty common terms in discussion about London while the LQ terms could be considered more specific. LQ tweets likely do better with quality score included because they are simply more distinguishable.

Since the method does better without quality score included, we will leave quality score out of our final model.

#### Multiple Thresholds with No Quality Score

Let's try some varying thresholds.

First, let's define a list of thresholds. We are going to try values 1.5, 2.5, 3.5, and 4.5.

In [None]:
import numpy as np
ths = np.arange(1.5, 4.6, 1.0).tolist()

Now let's see how effective the model is using these values for thresholds.

In [None]:
for val in ths:
  # store the relative importances in a dictionary
  hq_relimpo = {}
  # iterate through the term frequency dictionary
  for key, item in hq_json_tf.items():
    try:
      # this will try calculating the relative importance by looking for the key
      # in the background model
      rel_impo = (item / hq_total_terms) / (bg_json_tf[key] / bg_total_terms)
    except:
      # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
      rel_impo = (item / hq_total_terms) / (0.5 / bg_total_terms)
    
    # only save terms with importances over threshold
    if rel_impo > val:
      hq_relimpo[key] = rel_impo

  lq_relimpo = {}
  for key, item in lq_json_tf.items():
    try:
      # this will try calculating the relative importance by looking for the key
      # in the background model
      rel_impo = (item / lq_total_terms) / (bg_json_tf[key] / bg_total_terms)
    except:
      # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
      rel_impo = (item / lq_total_terms) / (0.5 / bg_total_terms)
    
    # only save terms with relative importance over threshold
    if rel_impo > val:
      lq_relimpo[key] = rel_impo


  # the following code calculates statistics the same way they were
  # calculated earlier with the quality score, see earlier comments for explanation
  sum_hq = 0
  scores_hq = []
  for i in range(len(hq_json_valid)):
    score = newsworthy_score(hq_relimpo, lq_relimpo, hq_json_valid[i]['tokens'])
    scores_hq.append(score)
    sum_hq += score
    if score > 0:
      count_hq += 1

  prop_hq = round((count_hq / len(hq_json_valid)), 3)
  avg_hq = round((sum_hq / len(hq_json_valid)), 3)
  sd_hq = round(stat.stdev(scores_hq), 3)

  print("Threshold: " + str(val))
  print("Number of HQ terms above threshold: " + str(len(hq_relimpo)))
  print("Proportion of HQ texts classified as Newsworthy: " + str(prop_hq))
  print("Average score of HQ text: " + str(avg_hq))
  print("Standard Dev. of HQ text scores: " + str(sd_hq))

  count_lq = 0
  sum_lq = 0
  scores_lq = []
  for i in range(len(lq_json_valid)):
    score = newsworthy_score(hq_relimpo, lq_relimpo, lq_json_valid[i]['tokens'])
    scores_lq.append(score)
    sum_lq += score
    if score < 0:
      count_lq += 1

  prop_lq = round((count_lq / len(lq_json_valid)), 3)
  avg_lq = round((sum_lq / len(lq_json_valid)), 3)
  sd_lq = round(stat.stdev(scores_lq), 3)

  print("Number of LQ terms above threshold: " + str(len(lq_relimpo)))
  print("Proportion of LQ texts classified as Non-Newsworthy: " + str(prop_lq))
  print("Average score of LQ text: " + str(avg_lq))
  print("Standard Dev. of LQ text scores: " + str(sd_lq))
  print()

Threshold: 1.5
Number of HQ terms above threshold: 6403
Proportion of HQ texts classified as Newsworthy: 0.83
Average score of HQ text: 1.807
Standard Dev. of HQ text scores: 1.942
Number of LQ terms above threshold: 1932
Proportion of LQ texts classified as Non-Newsworthy: 0.925
Average score of LQ text: -4.623
Standard Dev. of LQ text scores: 4.096

Threshold: 2.5
Number of HQ terms above threshold: 2306
Proportion of HQ texts classified as Newsworthy: 0.818
Average score of HQ text: 1.836
Standard Dev. of HQ text scores: 2.057
Number of LQ terms above threshold: 1630
Proportion of LQ texts classified as Non-Newsworthy: 0.928
Average score of LQ text: -4.715
Standard Dev. of LQ text scores: 4.154

Threshold: 3.5
Number of HQ terms above threshold: 2207
Proportion of HQ texts classified as Newsworthy: 0.816
Average score of HQ text: 1.899
Standard Dev. of HQ text scores: 2.14
Number of LQ terms above threshold: 1516
Proportion of LQ texts classified as Non-Newsworthy: 0.915
Average sc

It seems that the lower the threshold, the better the method performs. But why is that? Let's compare the top terms in the training and validation sets.

#### What are the top tokens in the training vs. validation sets?

First we need to calculate TFs for the validation sets.

In [None]:
# will hold dictionary of term frequencies in high quality collection
hq_json_tf_valid = {}
# iterate through high quality tweets
for tweet in hq_json_valid:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in hq_json_tf_valid:
      # increment terms for current key
      hq_json_tf_valid[key] += item 
    # else, make new key
    else:
      # +0.5 is for smoothing
      hq_json_tf_valid[key] = item + 0.5

# will hold dictionary of term frequencies in high quality collection
lq_json_tf_valid = {}
# iterate through high quality tweets
for tweet in lq_json_valid:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # due to punctuation that spacy doesn't understand
    # some stopwords remain, this causes them not to be added to the dict
    if key in stops:
      pass
    # if key is already in term freq dictionary, increment
    elif key in lq_json_tf_valid:
      # increment terms for current key
      lq_json_tf_valid[key] += item
    # else, make new key
    else:
      # +0.5 is for smoothing
      lq_json_tf_valid[key] = item + 0.5

Now let's find the top 10 tokens in each of the sets.

In [None]:
# define a function which returns the second element of a list
def second_elem(e):
    return e[1]

maxes_train = []
# iterate through the training HQ json term frequency dictionary
for key, item in hq_json_tf.items():
  # keep track of the top 10 max frequencies
  if len(maxes_train) < 10:
    maxes_train.append([key, item])
    maxes_train.sort(key=second_elem, reverse=True)
  elif item > maxes_train[len(maxes_train)-1][1]:
    maxes_train[len(maxes_train)-1] = [key, item]
    maxes_train.sort(key=second_elem, reverse=True)

maxes_valid = []
# do the same for HQ validation term frequencies
for key, item in hq_json_tf_valid.items():
  if len(maxes_valid) < 10:
    maxes_valid.append([key, item])
    maxes_valid.sort(key=second_elem, reverse=True)
  elif item > maxes_valid[len(maxes_valid)-1][1]:
    maxes_valid[len(maxes_valid)-1] = [key, item]
    maxes_valid.sort(key=second_elem, reverse=True)

print("High Quality Top Tokens")
print("In Training Set:")
print(maxes_train)
print("In Validation Set:")
print(maxes_valid)
print()

maxes_train = []
# do the same for the training LQ term frequencies
for key, item in lq_json_tf.items():
  if len(maxes_train) < 10:
    maxes_train.append([key, item])
    maxes_train.sort(key=second_elem, reverse=True)
  elif item > maxes_train[len(maxes_train)-1][1]:
    maxes_train[len(maxes_train)-1] = [key, item]
    maxes_train.sort(key=second_elem, reverse=True)

maxes_valid = []
# do the same for the validation LQ term frequencies
for key, item in lq_json_tf_valid.items():
  if len(maxes_valid) < 10:
    maxes_valid.append([key, item])
    maxes_valid.sort(key=second_elem, reverse=True)
  elif item > maxes_valid[len(maxes_valid)-1][1]:
    maxes_valid[len(maxes_valid)-1] = [key, item]
    maxes_valid.sort(key=second_elem, reverse=True)
  
print("Low Quality Top Tokens")
print("In Training Set:")
print(maxes_train)
print("In Validation Set:")
print(maxes_valid)
print()


High Quality Top Tokens
In Training Set:
[['london', 1573.5], ['kingdom', 587.5], ['united', 545.5], ['world', 315.5], ['official', 225.5], ['photo', 224.5], ['meet', 220.5], ['britain', 217.5], ['post', 216.5], ['handle', 210.5]]
In Validation Set:
[['london', 370.5], ['kingdom', 142.5], ['united', 129.5], ['world', 90.5], ['photo', 67.5], ['post', 61.5], ['britain', 58.5], ['meet', 55.5], ['official', 54.5], ['like', 52.5]]

Low Quality Top Tokens
In Training Set:
[['mohammed', 698.5], ['al', 657.5], ['nasser', 642.5], ['yamani', 614.5], ['imam', 590.5], ['almahdicaliphofallah', 571.5], ['london', 559.5], ['kingdom', 450.5], ['united', 414.5], ['post', 390.5]]
In Validation Set:
[['mohammed', 158.5], ['nasser', 149.5], ['al', 149.5], ['yamani', 142.5], ['almahdicaliphofallah', 135.5], ['imam', 134.5], ['london', 133.5], ['kingdom', 108.5], ['united', 102.5], ['post', 96.5]]



As we can see here, the top tokens in the High Quality and Low Quality training sets are very similar to the top tokens in their respective validation sets. We conclude from this that the training and validation sets are pretty similar. This might explain why lowering the threshold improves the model when evaluating on the validation sets. The validation sets look very similar to the training sets so the more features you add the better scoring does on the validation set. If you were to compute newsworthiness scores on tweets that looked quite different from the training sets, a lower threshold would not be as effective. 

### Trying Baseline with Stopwords

Let's redefine the spacy pipeline.

In [None]:
def spacy_tokenization(tweet):
    tokens = []
    twt = nlp(tweet)
    # applies spacy to get rid of spaces, punctuation
    # no longer removing stopwords
    for c in twt:
        if (not c.is_punct) and (not c.is_space):
            tokens.append(c.lemma_.lower())

    # spacy does not recognize the '•' character which appears in some tweets
    # this removes that character
    for token in range(len(tokens)):
      if '•' in tokens[token]:
        new_token = ''
        for c in tokens[token]:
          if c != '•':
            new_token += c
   
    return tokens

And let's tokenize again.

In [None]:
# this will give us a progress bar
from tqdm import tqdm 

for tweet in tqdm(hq_json):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

for tweet in tqdm(lq_json):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

100%|██████████| 2001/2001 [00:43<00:00, 46.00it/s]
100%|██████████| 1222/1222 [00:24<00:00, 49.30it/s]


Now, let's count the frequencies.

In [None]:
# will hold dictionary of term frequencies in background collection
bg_json_tf = {}
# will hold total term frequency
bg_total_terms = 0
# iterate through background tweets
for tweet in bg_json:
  # count term frequencies in tweet
  counts = Counter(tweet['text'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # if key is already in term freq dictionary, increment
    if key in bg_json_tf:
      # weight BG items by their quality score
      # NOTE: HQ/LQ terms are weighted by the qualityScore of the tweet they appear in (see HQ/LQ TF counting)
      bg_json_tf[key] += (item * tweet['score'])
      # increment total terms
      bg_total_terms += (item * tweet['score'])
    # else, make new key
    else:
      # +0.5 is added for smoothing purposes
      bg_json_tf[key] = (item * tweet['score']) + 0.5
      bg_total_terms += (item * tweet['score']) + 0.5

# will hold dictionary of term frequencies in high quality collection
hq_json_tf = {}
# will hold total term frequency
hq_total_terms = 0
# iterate through high quality tweets
for tweet in hq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # if key is already in term freq dictionary, increment
    if key in hq_json_tf:
      # terms are weighted by the tweet's quality score
      hq_json_tf[key] += (item * tweet['qualityS'])
      # increment total terms
      hq_total_terms += (item * tweet['qualityS'])
    # else, make new key
    else:
      # +0.5 is for smoothing
      hq_json_tf[key] = (item * tweet['qualityS']) + 0.5
      hq_total_terms += (item * tweet['qualityS']) + 0.5

# will hold dictionary of term frequencies in low quality collection
lq_json_tf = {}
# will hold total term frequency
lq_total_terms = 0
# iterate through low quality tweets
for tweet in lq_json:
  # count term frequencies in tweet
  counts = Counter(tweet['tokens'])
  # iterate thru term frequencies
  for key, item in counts.items():
    # if key is already in term freq dictionary, increment
    if key in lq_json_tf:
      # terms are weighted by the one minus tweet quality score (see HQ TF for explanation)
      lq_json_tf[key] += (item * (1-tweet['qualityS']))
      # increment total terms
      lq_total_terms += (item * (1-tweet['qualityS']))
    # else, make new key
    else:
      # +0.5 is for smoothing
      lq_json_tf[key] = (item * (1-tweet['qualityS'])) + 0.5
      lq_total_terms += (item * (1-tweet['qualityS'])) + 0.5

And finally, compute the relative importances.

In [None]:
# store the relative importances in a dictionary
hq_relimpo = {}
# iterate through the term frequency dictionary
for key, item in hq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / hq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / hq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with importances over 2
  if rel_impo > 2:
    hq_relimpo[key] = rel_impo

lq_relimpo = {}
for key, item in lq_json_tf.items():
  try:
    # this will try calculating the relative importance by looking for the key
    # in the background model
    rel_impo = (item / lq_total_terms) / (bg_json_tf[key] / bg_total_terms)
  except:
    # if a term does not exist in the background model, use add 0.5 smoothing to get rel impo
    rel_impo = (item / lq_total_terms) / (0.5 / bg_total_terms)
  
  # only save terms with relative importance over 2
  if rel_impo > 2:
    lq_relimpo[key] = rel_impo

We need to retokenize the validation sets.

In [None]:
for tweet in tqdm(hq_json_valid):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

for tweet in tqdm(lq_json_valid):
    # tokenize the tweet text
    tweet['tokens'] = spacy_tokenization(tweet['text'])
    # tokenize the description as well
    tweet['tokens'].extend(spacy_tokenization(tweet['description'])) 

    # the following code gets rid of links and non-alphabet terms
    token = 0
    while token < len(tweet['tokens']):
      if tweet['tokens'][token].isalpha() == False:
          del tweet['tokens'][token]
      else:
        token += 1

100%|██████████| 500/500 [00:12<00:00, 40.81it/s]
100%|██████████| 305/305 [00:06<00:00, 47.38it/s]


And now let's look at the statistics.

In [None]:
# the following code calculates statistics the same way they were
# calculated earlier with the quality score, see earlier comments for explanation
count = 0
sum = 0
scores_list = []
for i in range(len(hq_json_valid)):
  score = newsworthy_score(hq_relimpo, lq_relimpo, hq_json_valid[i]['tokens'])
  scores_list.append(score)
  sum += score
  if score > 0:
    count += 1

prop = round((count / len(hq_json_valid)), 3)
avg = round((sum / len(hq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print("Proportion of HQ texts classified as Newsworthy: " + str(prop))
print("Average score of HQ text: " + str(avg))
print("Standard Dev. of HQ text scores: " + str(sd))

count = 0
sum = 0
scores_list = []
for i in range(len(lq_json_valid)):
  score = newsworthy_score(hq_relimpo, lq_relimpo, lq_json_valid[i]['tokens'])
  scores_list.append(score)
  sum += score
  if score < 0:
    count += 1

prop = round((count / len(lq_json_valid)), 3)
avg = round((sum / len(lq_json_valid)), 3)
sd = round(stat.stdev(scores_list), 3)

print()
print("Proportion of LQ texts classified as Non-Newsworthy: " + str(prop))
print("Average score of LQ text: " + str(avg))
print("Standard Dev. of LQ text scores: " + str(sd))

Proportion of HQ texts classified as Newsworthy: 0.504
Average score of HQ text: 0.021
Standard Dev. of HQ text scores: 0.35

Proportion of LQ texts classified as Non-Newsworthy: 0.928
Average score of LQ text: -0.867
Standard Dev. of LQ text scores: 0.781


As you may notice, the average scores of the texts lower when you include stopwords. Let's look at why this might be.

#### Top Tokens with Stopwords Included

In [None]:
maxes_train = []
# iterate through the HQ json dictionary
for key, item in hq_json_tf.items():
  # keep track of the top 10 maximum frequencies
  if len(maxes_train) < 10:
    maxes_train.append([key, item])
    maxes_train.sort(key=second_elem, reverse=True)
  elif item > maxes_train[len(maxes_train)-1][1]:
    maxes_train[len(maxes_train)-1] = [key, item]
    maxes_train.sort(key=second_elem, reverse=True)

print("High Quality Training Set Top Tokens")
print(maxes_train)

maxes_train = []
# do the same as for the HQ set
for key, item in lq_json_tf.items():
  if len(maxes_train) < 10:
    maxes_train.append([key, item])
    maxes_train.sort(key=second_elem, reverse=True)
  elif item > maxes_train[len(maxes_train)-1][1]:
    maxes_train[len(maxes_train)-1] = [key, item]
    maxes_train.sort(key=second_elem, reverse=True)
  
print("Low Quality Training Set Top Tokens")
print(maxes_train)

High Quality Training Set Top Tokens
[['the', 1032.3494188927427], ['london', 969.7472453431797], ['and', 686.6962694125531], ['in', 631.1753321020847], ['of', 611.6567866285135], ['a', 584.2403906027608], ['be', 576.084262754156], ['to', 537.7969204980749], ['at', 398.1939589292483], ['i', 372.4294174223238]]
Low Quality Training Set Top Tokens
[['the', 486.9893442661689], ['a', 443.71530281577606], ['mohammed', 439.49327210104786], ['al', 413.7552564356], ['nasser', 404.35816882549904], ['yamani', 386.77773549660094], ['imam', 371.72571088705365], ['be', 362.8568581934021], ['almahdicaliphofallah', 359.82600600615143], ['and', 349.6394208668373]]


By looking at the top terms in each model, we can see why the High Quality set performs much worse with stopwords included and why the Low Quality set remains strong. The top terms in the HQ model are dominated by stopwords. Since stopwords occur in most tweets, there is no way to distinguish which tweets are newsworthy and which are not when looking at high quality tweets. With stopwords included, and with stopwords having so much weight, tweets are defined by their stopwords and are thus indistinguishable. Hence, the HQ classification turns into a coinflip (.504). 

On the other hand, the LQ model still does well even with stopwords included. Six out of the original top 10 words for the LQ model remained even with the presence of stopwords. Thus there are still heavily weighted terms in the LQ model that can help to distinguish a tweet's lack of newsworthiness.

One should note that the reason stopwords have such high weights is because they are absent in the BG model.