# Usage of StorySniffer

This notebook show how we can use the StorrySniffer library to Inspect a URL and estimate if it contains a news story.

In [92]:
!pip install scikit-learn==1.5.1 # Going beyond this version can cause errors
!pip install storysniffer
!pip install beautifulsoup4
!pip install mediacloud


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

## Using the Story Sniffer to estimate if a URL is a news story

In [93]:
def display_result(url, result):
    if result:
        print(f"{url} is pointing to a news story")
    else:
        print(f"{url} is not pointing to a news story")

### Using only the URL to evaluate if it is a news story or not

In [94]:
from storysniffer import StorySniffer
sniffer = StorySniffer()
url = "https://www.nytimes.com/2024/12/11/crosswords/connections-companion-550.html"
display_result(url, sniffer.guess(url))

https://www.nytimes.com/2024/12/11/crosswords/connections-companion-550.html is pointing to a news story


In [95]:
url = "https://www.nytimes.com/"
display_result(url, sniffer.guess(url))

https://www.nytimes.com/ is not pointing to a news story


#### Testing multiple URLs that are not pointing to news stories

In [96]:
urls = [
    "https://www.nytimes.com/section/world",
    "https://www.nytimes.com/subscription/all-access?campaignId=8WULY&channel=odisplay&areas=banner&campaign=AllAccessAcquisition",
    "https://www.nytimes.com/section/world/europe",
    "https://help.nytimes.com/hc/en-us/articles/115015385887-Contact-The-New-York-Times",
    "https://help.nytimes.com/hc/en-us",
    ""
]
for url in urls:
    display_result(url, sniffer.guess(url))

https://www.nytimes.com/section/world is not pointing to a news story
https://www.nytimes.com/subscription/all-access?campaignId=8WULY&channel=odisplay&areas=banner&campaign=AllAccessAcquisition is not pointing to a news story
https://www.nytimes.com/section/world/europe is not pointing to a news story
https://help.nytimes.com/hc/en-us/articles/115015385887-Contact-The-New-York-Times is not pointing to a news story
https://help.nytimes.com/hc/en-us is not pointing to a news story
 is not pointing to a news story


**Preliminary Results:** It appears the model is mostly able to correctly identify if a URL structure points to a news story

### Using only the URL and Text from a website to evaluate if it is a news story or not

As per the documentation - "If you have a text string, like the page’s `<title>` tag or the contents of an `<a>` tag, you can pass that in as an additional clue"

We can use http://sktoday.com/ (Currently, this site has multiple URLs that all point to the homepage, so relying solely on the URL pattern will not be sufficient to classify it as pointing to a news story)

**Using the Title text to improve our guessing**

In [97]:
import requests
from bs4 import BeautifulSoup

In [98]:
url ="http://sktoday.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
title = soup.title.string
display_result(url, sniffer.guess(url,text=title))

http://sktoday.com/ is not pointing to a news story


In [99]:
url ="http://sktoday.com/content/2233_ministry-foreign-affairs-slovak-republic-appreciates-course-ukrainian-parliamentary-ele"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
title = soup.title.string
display_result(url, sniffer.guess(url,text=title))

http://sktoday.com/content/2233_ministry-foreign-affairs-slovak-republic-appreciates-course-ukrainian-parliamentary-ele is pointing to a news story


**Adding the full HTML response in another attempt to improve the model performance**

In [100]:
url ="http://sktoday.com/content/2233_ministry-foreign-affairs-slovak-republic-appreciates-course-ukrainian-parliamentary-ele"
response = requests.get(url)
display_result(url, sniffer.guess(url,text=response.text))

http://sktoday.com/content/2233_ministry-foreign-affairs-slovak-republic-appreciates-course-ukrainian-parliamentary-ele is pointing to a news story


**Preliminary Results:** Adding text does not seem to improve the model's ability to predict whether a URL points to a story. In other words, the structure of the URL appears to have a greater impact on the model's performance. Is the model not content aware