# Data Mining Week 11 with Professor Sloan

## Maggie Boles

### 11/17/2025

### From Blackboard: Using the tweets.csv dataset from Week 3, use pattern matching to find every term that can be categorized as “SOCIAL_CAUSE”. Your result should be a Pandas DataFrame that contains the following information (the DataFrame can be very simple or very nice – that’s up to you):
    Author	Matched Text	Identifier
    katyperry	power to the people	SOCIAL_CAUSE
    BarackOba	roll back poverty	SOCIAL_CAUSE
    rihanna	people ov venezuala in your prayers	SOCIAL_CAUSE
    Note that the table above is just a small sample – you can match on as many terms as you want, but your code must leverage:
    Matcher
    PhraseMatcher (at least once)
    on_match callback (at least one)
    Matches on more than just specific text – use POS, IS_PUNCT or any other token attributes (for no less than one match pattern, but I encourage you to use other token attributes often since this is the real power of spaCy pattern matching)
    Regular expression (at least once)
    Note that the beauty of this Python package is that you define what a SOCIAL_CAUSE is, explicitly, using the text as your guide. This feature is especially important when dealing with domain-specific corpora, since the language is not simply Wikipedia data.

    

In [1]:
#import libraries
import pandas as pd
import spacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Doc

# 1. Load the data + model

df = pd.read_csv("twitter_sample.csv")   # your file
nlp = spacy.load("en_core_web_sm")

# Register custom extension for callback
Doc.set_extension("author", default="unknown", force=True)

results = []

# 2. on_match callback 
def add_match(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]
    results.append({
        "Author": doc._.author,
        "Matched Text": span.text,
        "Identifier": "SOCIAL_CAUSE"
    })

# 3. Matchers — tailored to the actual data

matcher = Matcher(nlp.vocab)
phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# PhraseMatcher: real phrases in the dataset 
phrases = [
    "#morethanmedicine", "more than medicine", "morethanmedicine",
    "#animalhealthmatters", "animal health matters",
    "#petsarefamily", "pets are family",
    "#animalhealth", "animal medicines", "animal welfare"
]
phrase_patterns = [nlp.make_doc(p) for p in phrases]
phrase_matcher.add("ANIMAL_HEALTH_CAMPAIGN", phrase_patterns)

# --- Matcher: advanced patterns that match ---
matcher.add("HASHTAG", [[{"TEXT": {"REGEX": "^#.*"}}]])  # any hashtag

matcher.add("POWERFUL_PHRASE", [[
    {"LOWER": "more"}, {"LOWER": "than"}, {"LOWER": "medicine"}
]], on_match=add_match)  # ← callback triggers here!

matcher.add("ANIMAL_HEALTH", [[
    {"LEMMA": {"IN": ["animal", "pet", "veterinary"]}},
    {"LEMMA": {"IN": ["health", "medicine", "welfare"]}}
]])

matcher.add("THANKS_MENTION", [[
    {"LOWER": "thank"}, {"IS_ALPHA": True, "OP": "?"}, 
    {"LOWER": "you"}, {"IS_PUNCT": True, "OP": "?"},
    {"TEXT": {"REGEX": "^@.*"}}
]])

# 4. Process each tweet using the columns

for _, row in df.iterrows():
    tweet = str(row["Tweet Content"])
    author = str(row["Username"])
    
    # Clean up common Twitter junk
    tweet = tweet.replace("&amp;", "&")
    
    if not tweet.strip() or tweet.lower() == "nan":
        continue

    doc = nlp(tweet)
    doc._.author = author

    # Run matchers
    matcher(doc)        # ← callback fires!
    phrase_matches = phrase_matcher(doc)

    # Collect PhraseMatcher results
    for match_id, start, end in phrase_matches:
        span = doc[start:end]
        results.append({
            "Author": author,
            "Matched Text": span.text,
            "Identifier": "SOCIAL_CAUSE"
        })


# 5. Results as a DataFrame
result_df = pd.DataFrame(results)

if not result_df.empty:
    result_df = (result_df
                 .drop_duplicates()
                 .sort_values(["Author", "Matched Text"])
                 .reset_index(drop=True))
    print(f"\nFOUND {len(result_df)} SOCIAL_CAUSE MATCHES! (All about #MorethanMedicine)\n")
else:
    result_df = pd.DataFrame(columns=["Author", "Matched Text", "Identifier"])
    print("No matches — something went wrong")

print(result_df.to_string(index=False))
result_df


FOUND 390 SOCIAL_CAUSE MATCHES! (All about #MorethanMedicine)

         Author         Matched Text   Identifier
        ACEPNow    #MoreThanMedicine SOCIAL_CAUSE
        ACEPNow     MoreThanMedicine SOCIAL_CAUSE
       ACNANSI1    #MorethanMedicine SOCIAL_CAUSE
       ACNANSI1     MorethanMedicine SOCIAL_CAUSE
       ACOEPRSO    #MoreThanMedicine SOCIAL_CAUSE
       ACOEPRSO     MoreThanMedicine SOCIAL_CAUSE
        AG_EM33    #MoreThanMedicine SOCIAL_CAUSE
        AG_EM33     MoreThanMedicine SOCIAL_CAUSE
      ALLOhioEM    #MoreThanMedicine SOCIAL_CAUSE
      ALLOhioEM     MoreThanMedicine SOCIAL_CAUSE
 AgedeGuillaume    #MorethanMedicine SOCIAL_CAUSE
 AgedeGuillaume     MorethanMedicine SOCIAL_CAUSE
   Agri_Updates    #MorethanMedicine SOCIAL_CAUSE
   Agri_Updates        #animalhealth SOCIAL_CAUSE
   Agri_Updates     MorethanMedicine SOCIAL_CAUSE
   Al3xCarvalho    #MorethanMedicine SOCIAL_CAUSE
   Al3xCarvalho     MorethanMedicine SOCIAL_CAUSE
   Al3xCarvalho       animal welfare

Unnamed: 0,Author,Matched Text,Identifier
0,ACEPNow,#MoreThanMedicine,SOCIAL_CAUSE
1,ACEPNow,MoreThanMedicine,SOCIAL_CAUSE
2,ACNANSI1,#MorethanMedicine,SOCIAL_CAUSE
3,ACNANSI1,MorethanMedicine,SOCIAL_CAUSE
4,ACOEPRSO,#MoreThanMedicine,SOCIAL_CAUSE
...,...,...,...
385,wegetittogether,morethanmedicine,SOCIAL_CAUSE
386,welshted,#morethanmedicine,SOCIAL_CAUSE
387,welshted,morethanmedicine,SOCIAL_CAUSE
388,wontarul277,#MorethanMedicine,SOCIAL_CAUSE


##### Initially I was following the example in the Social Cause from blackboard, and it wasn't returning any results, but I realize now that the CSV we are using isn't about any of that stuff at all, but I just refit the code for the #morethanmedicine tags. Which because of the hashtag in every item of the CSV doc we do return each tweet essentially, oops! But it just means it is positively identifying one of our phrases in the dataframe we created to search for! Which is great and what we are looking for with PhraseMatcher. The matcher with REGEX is capturing variations of the hashtag regardless of capitalization. Lemmatization is also matching with related concepts like "animal" + "health/welfare" even with varied phrasing. The on_match callback is ensuring we get a structured extraction with author attribution, to make a nice and pretty table presentation from the data as well. 

##### I thought this was pretty interesting, I like that we get to see how language relates to each other and can further encapsulate specific phrasing, to essentially cast a wider net and help narrow the focus or draw in what we are looking for specifically. It can help to ensure the data that has been compiled all fit and are useful for a specific purpose which would ensure we have relevent data compiled. Similar to if we were looking for specific e-mails in a server and wanted to directly query out information regarding sales, but from the company wide perspective, and were looking for key phrasing. Or if you were looking to see if there were any mentions of high level security in government capacities. Very fun to look into and see the output. Now that I'm thinking about it, there's more data points than the actual document, and I think it is hitting tweets more than once if it mentions one of the matchers/phrasematchers and is applying them to the end result, which is inflating the numbers and my answer before about it returning each tweet wasn't quite correct. That would make more sense, so this is printing each match each time, and not just once, but every instance is matches with. That's pretty interesting too! 

##### Overall happy with how this weeks exercise turned out, and I'm excited for a little break, and I appreciate your feedback this term Professor Sloan! Have a great Thanksgiving and break! I hope to see you around! 