# Section 2: Probability and Distributions

Not to be political, but I think most people would say that US Presidents Jimmy Carter and Ronald Regan were pretty different people. One was a peanut farmer. The other a movie actor. Let’s find out what their Wikipedia articles can tell us and quantify this comparison.
First, we need to do some scraping of Wikipedia and parsing of text to gather the data. (Note that we’re going to oversimplify the first part a little bit for the sake of time.)
1.	Snowflake values security. By default you can’t make outbound internet requests from your Python code. We would have to setup and use an “external network access” integration… but Snowflake trial accounts block this feature. So, I’ve created files for you to upload into your notebook!
a.	https://github.com/paulboal/data-5740-2025/blob/main/Lab%2003/carter.txt
b.	https://github.com/paulboal/data-5740-2025/blob/main/Lab%2003/regan.txt 
2.	Visit each of these files. Click the “download raw” icon to download each file. Then upload each file into your notebook in Snowflake.
3.	Tokenize the words from each article and remove common “stop words”
4.	Count how often each word occurs in each article, and
5.	Compute the probability of each word’s occurrence in each article
6.	For the top five words in each article, lookup the probability of that word occurring in each article and produce a table of probabilities for those ten words.


In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


In [None]:
reagan = open('regan.txt').read()  # Forgive the misspelling
carter = open('carter.txt').read()

len(reagan), len(carter)

In [None]:
import re, math, collections
stop = set("""would but president carter reagan had carter's reagan's said a an the and or of to in for on with by from is are was were be been being as at it its this that which who whom their his her they them he she we you your our not""".split())

# Our tokenize function will strip stop boards and split everything
# that has alphabetic characters into words.
def tokenize(text):
    words = re.findall(r"[a-z']+", text.lower())
    return [w for w in words if w not in stop and len(w) > 2]

carter_tokens = tokenize(carter)
reagan_tokens = tokenize(reagan)

# We'll count words and compute probabilities
def probs(tokens):
    c = collections.Counter(tokens) # shortcut for counting things
    N = sum(c.values())
    return c, {w: c[w]/N for w in c}

carter_cnt, carter_p = probs(carter_tokens)
reagan_cnt, reagan_p = probs(reagan_tokens)


In [None]:
[x[0] for x in carter_cnt.most_common(5)]

In [None]:
[x[0] for x in reagan_cnt.most_common(5)]

In [None]:
tops = pd.DataFrame(index=reagan_top5 + carter_top5, columns=["Carter","Reagan"])

In [None]:
for w in tops.index:
    tops.loc[w,"Carter"] = carter_p[w]
    tops.loc[w,"Reagan"] = reagan_p[w]

In [None]:
tops

In [None]:
# Leveraging what we know about cosine_similarity...
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(
    tops.loc[:,"Carter"].values.reshape(1,-1), 
    tops.loc[:,"Reagan"].values.reshape(1,-1)
)