# Analyzing AAVE Usage in Popular Comedic Media Through Semantic Textual Analysis


## Formatting the Black tweet documents into one corpora

In [None]:
import pandas as pd

# add in my black tweets from AAVE Corpus
thanksgiving = pd.read_csv('/content/drive/MyDrive/Black_Twitter/thanksgivingclapback_16.csv')
black_moms = pd.read_csv('/content/drive/MyDrive/Black_Twitter/blackmoms_19.csv')
oscars = pd.read_csv('/content/drive/MyDrive/Black_Twitter/oscarssowhite_15.csv')

In [None]:
# now i want to join my dataframes together to get one large tweet dataframe
# first im going to subset each twitter for easier manipulation
oscars_sub = oscars[['date', 'content', 'url', 'hashtags']]
thanksgiving_sub = thanksgiving[['date', 'content', 'url', 'hashtags']]
black_moms_sub = black_moms[['date', 'content', 'url', 'hashtags']]

In [None]:
# it may be useful to give the tweets a unique identifier so i know what topic it comes from
# so lets do that
oscars_sub['topic'] = 'oscars'
thanksgiving_sub['topic'] = 'thanksgiving'
black_moms_sub['topic'] = 'black_moms'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  oscars_sub['topic'] = 'oscars'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  thanksgiving_sub['topic'] = 'thanksgiving'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  black_moms_sub['topic'] = 'black_moms'


In [None]:
# then concatonate to one df
twitter_df = pd.concat([oscars_sub, thanksgiving_sub, black_moms_sub], ignore_index=True)
twitter_df

Unnamed: 0,date,content,url,hashtags,topic
0,2015-01-29 23:26:09+00:00,It's a trip that #OscarsSoWhite named their n...,https://twitter.com/RALPHREMINGTON/status/5609...,['OscarsSoWhite'],oscars
1,2015-01-29 23:22:01+00:00,#Oscarssowhite calls out racial disparity in H...,https://twitter.com/thedbk/status/560940962357...,['Oscarssowhite'],oscars
2,2015-01-29 23:14:42+00:00,#OscarsSoWhite First all white group of actin...,https://twitter.com/MeYouAndLaughs/status/5609...,"['OscarsSoWhite', 'joanrivers']",oscars
3,2015-01-29 21:38:42+00:00,RT @JeffJSays: #ThatADHDShow 002:#OscarsSoWhit...,https://twitter.com/seveneighteen_/status/5609...,"['ThatADHDShow', 'OscarsSoWhite']",oscars
4,2015-01-29 20:27:55+00:00,"How will we know when Hollywood,TV and movies ...",https://twitter.com/CrowdPleeza/status/5608971...,['OscarsSowhite'],oscars
...,...,...,...,...,...
1518,2019-06-28 15:18:44+00:00,@JeffersonJCPH @JeffersonUniv Thank you for th...,https://twitter.com/NatakiDuncan/status/114462...,"['WomensHealth', 'BlackMoms']",black_moms
1519,2019-06-28 14:22:26+00:00,So proud of these two. They’re doing big thing...,https://twitter.com/NOSAcadCoaching/status/114...,"['a4squad', 'favor', 'familymatters', 'lovemyf...",black_moms
1520,2019-06-28 09:03:26+00:00,#blackmoms will tell you if you home don't tur...,https://twitter.com/sdotvenom/status/114453174...,"['blackmoms', 'television']",black_moms
1521,2019-06-28 08:08:49+00:00,Just woke up and I'm already being shouted at ...,https://twitter.com/Brian12978608/status/11445...,"['africanparents', 'growingupblack', 'blackmoms']",black_moms


In [None]:
twitter_df.shape

(1523, 5)

In [None]:
twitter_df.to_csv('blacktweets_joined_cleaned.csv')

In [None]:
import re

# cleaning the tweet list to remove special characters (e.g., #, @, emojis)
def clean_tweet(tweet):
    tweet = re.sub(r'@\w+', '', tweet)  # remove mentions (@username)
    tweet = re.sub(r'#', '', tweet) # remove hashtags
    tweet = re.sub(r'http\S+', '', tweet) # remove URLs
    tweet = re.sub(r'[^\x00-\x7F]+', '', tweet) # remove non-ASCII characters
    tweet = re.sub(r'[^a-zA-Z\s]', '', tweet) # remove special characters and numbers (keep letters and spaces)
    tweet = ' '.join(tweet.split()) # remove extra whitespace
    return tweet.lower() # convert to lowercase

In [None]:
# apply to each tweet individually for strings
# this will be helpful for a cursory analysis later
tweets_list = twitter_df['content'].astype(str).tolist()
cleaned_tweets_list = [clean_tweet(t) for t in tweets_list]

# view first 10 tweets
cleaned_tweets_list[:10]

['its a trip that oscarssowhite named their nominees on mlk day and will have their broadcast in february during black history month smdh',
 'oscarssowhite calls out racial disparity in hollywood today via',
 'oscarssowhite first all white group of acting nominees since mmh are you bragging or complaining no fun without joanrivers anyway',
 'rt thatadhdshow oscarssowhite feat music by boeiii',
 'how will we know when hollywoodtv and movies are diverse enough oscarssowhite',
 'rt thatadhdshowoscarssowhitefeat tcobbrbvdemusic by boeiii',
 'gt rt thatadhdshowoscarssowhitefeat tcobbrbvdemusic by boeiii',
 'the problem instead of oscarssowhite i just think it should be moviessowhite or hollywoodsowhite',
 'watch ava duvernays perfect response to her selma oscar snub via oscarssowhite',
 'why do i have a feeling that next years oscars will have tons of poc nominees oscarssowhite']

In [None]:
# apply to whole
cleaned_tweets = clean_tweet(tweets_as_text)
cleaned_tweets

'its a trip that oscarssowhite named their nominees on mlk day and will have their broadcast in february during black history month smdh oscarssowhite calls out racial disparity in hollywood today via oscarssowhite first all white group of acting nominees since mmh are you bragging or complaining no fun without joanrivers anyway rt thatadhdshow oscarssowhite feat music by boeiii how will we know when hollywoodtv and movies are diverse enough oscarssowhite rt thatadhdshowoscarssowhitefeat tcobbrbvdemusic by boeiii gt rt thatadhdshowoscarssowhitefeat tcobbrbvdemusic by boeiii the problem instead of oscarssowhite i just think it should be moviessowhite or hollywoodsowhite watch ava duvernays perfect response to her selma oscar snub via oscarssowhite why do i have a feeling that next years oscars will have tons of poc nominees oscarssowhite icymi on validmag oscar nominations lack of diversity spurs social media hashtag oscarssowhite twibnation oscar oylamasna katlanlarn beyaz s erkek orta

In [None]:
type(cleaned_tweets_list)

list

In [None]:
type(cleaned_tweets)

str

## Cleaning and formatting

Here we start preparing the scripts and AAVE dictionary to do preliminary analyses on if some scripts have AAVE terms and also prepare the scripts for the primary semantic similarity analyses.

In [None]:
with open("/content/drive/MyDrive/Comedy_Scripts/nextfriday.txt", "r", encoding="utf-8") as f:
    nf_script = f.read().lower()

In [None]:
nf_script

'"next friday" -- by ice cube\n\n       ext. front lawn - overhead shot - night\n\n       debo is laid out on the grass.\n\n                           craig (v.o.)\n\n                 in the movies, when you beat up the\n\n                 neighborhood bully; you suppose to live\n\n                 happily ever after.  but around here;\n\n                 that\'s when all the drama begins...\n\n       blue and red police lights flash over debo\'s body.  two\n\n       sheriffs walk into our frame and stand over debo.  they flash\n\n       their lights on him.\n\n                           craig (cont\'d) (v.o.)\n\n                 last friday; i got fired for the first\n\n                 time.  i got high for the first time.  i\n\n                 got shot at for the first time and i\n\n                 kicked debo\'s ass for the first time...\n\n       they get him to his feet; but he stumbles and falls in the\n\n       bushes like a knocked out prize fighter.  the sheriffs laugh\n\n 

In [None]:
# here's the non-Black comedy script (superbad)

with open("/content/drive/MyDrive/Comedy_Scripts/superbad.txt", "r", encoding="utf-8") as f:
    superbad_script = f.read().lower()

In [None]:
# create function to remove \n, \r and replace the multiple spaces with a single space
def cleaning(data):
    to_remove = [
        '\n',
        '\r',
        '\'',
        '(V.O.)',
        '(v.o.)',
        '(cont\'d)',
        'ext.',
        'int.' ,
        '...',
        '(contd)'
    ]
    for item in to_remove:
        if item in data:
            data = data.replace(item, '')
    for item in data:
        if item in data:
            data = data.replace('  ', ' ')
    return(data)

In [None]:
next_fri_cleaned = cleaning(nf_script)
next_fri_cleaned

'"next friday" -- by ice cube front lawn - overhead shot - night debo is laid out on the grass. craig in the movies, when you beat up the neighborhood bully; you suppose to live happily ever after. but around here; thats when all the drama begins blue and red police lights flash over debos body. two sheriffs walk into our frame and stand over debo. they flash their lights on him. craig last friday; i got fired for the first time. i got high for the first time. i got shot at for the first time and i kicked debos ass for the first time they get him to his feet; but he stumbles and falls in the bushes like a knocked out prize fighter. the sheriffs laugh at him. debo looks dazed and confused. the sheriffs help him out the bushes and start to cuff him. craig i was the man that night; and debo ended up going to jail for a couple of years. but he told ezal he was getting out next friday. he said, when he see me, he was gonna smoke me on the spot they walk him out of frame fade to black. over 

In [None]:
# cleaning superbad
superbad_cleaned = cleaning(superbad_script)
superbad_cleaned

' superbad written by seth rogen & evan goldberg july 20, 2006 opening credits over super-funky blaxploitation-style music, which builds to an exciting crescendo filling us with the expectation of a thrilling, action-packed opening sequence. instead we get: seths car - morning seth, seventeen, a bit heavyset, in the midst of a sad attempt at growing a goatee and clearly a terrible driver, cruises along while fiddling with the cd player. he pulls out his cell and dials. seth yo. intercut with: evans house - kitchen - continuous2 2 evan, seventeen, a little too tall and slim, a boy who clearly never figured out how to style his hair, is finishing off a bowl of cereal. he is on his cell phone. evan whats up? seth i was doing research last night, for next year, and i think im gonna go with bang bus. evan which ones bang bus? seth the one where they bang the chicks on the bus. thirteen bucks a month. total access, live web cam feed. the works. itll be like im on the bus, banging them myself

In [None]:
# grabbing my file with AAVE phrases and definitions
aave_terms = pd.read_csv("/content/drive/MyDrive/AAVE_Dictionary_Cleaned.csv")
aave_terms

Unnamed: 0.1,Unnamed: 0,term,definition
0,0.0,about that,being passionate about or associated with some...
1,1.0,aggy,aggressive
2,2.0,aight,alright
3,3.0,ak-matic,AK47 ASSUALT RIFFLE
4,4.0,asf,emphasis
...,...,...,...
235,225.0,yas,yes
236,226.0,you ain't even,you aren't even
237,227.0,you better than me,much worse
238,228.0,"you got the right idea, but the wrong bxtch",they're not the person


In [None]:
# creating list with only the terms
aave_terms_list = aave_terms['term'].tolist()
aave_terms_list

['about that',
 'aggy',
 'aight ',
 'ak-matic',
 'asf',
 'as u should',
 'ate',
 'ate and left no crumbs',
 'ax',
 'aye',
 'back on my bs',
 'bae',
 'bank',
 'the bar is on the floor',
 'basic',
 'beef',
 'been',
 'if',
 'beat',
 'bet',
 'bet',
 'beyotch',
 'bffr',
 'bip',
 'biscuit',
 'bizzack',
 'boutta',
 'boi',
 'bomb',
 'boo',
 'booked',
 'bop',
 'boss',
 'brizzle',
 'bruh',
 'buckets',
 'bumpin',
 'bussin',
 'but blm, right?',
 'cap',
 'carry',
 'cat',
 'catch these hands',
 'cheesin',
 'chile',
 'chill',
 'clap emoji in between words',
 'coin',
 'come for',
 'cop',
 'cray cray',
 'crib',
 'cuddy',
 'cuh',
 'da hell',
 'dap',
 'deadass',
 'dig it',
 'dime',
 'din',
 'dip',
 'done',
 'dope',
 'dough',
 'down bad',
 'drag',
 'drip',
 'drop',
 'dry',
 'errybody',
 'errybody and they mama',
 'extra',
 'fam',
 "that's facts",
 'feeling some type of way',
 'fierce',
 'filthy',
 'fire',
 'fleek',
 'flex',
 'fo sho',
 'fuck outta here',
 'fuck with',
 'g',
 'gag',
 'game too strong',
 'g

With help from stack overflow I am going to use n-grams to find matches of AAVE terms within the scripts since separating the scripts by spaces will not provide the desired outcome due to some AAVE phrases being multiple terms. The post on the forum used the nltk's n-grams, so that's what I'll be using here.



In [None]:
from nltk import ngrams
from collections import Counter, defaultdict

d = defaultdict(list)
for i in aave_terms_list:
    k = i.split()
    d[len(k)].append(tuple(k))

print(d)

defaultdict(<class 'list'>, {2: [('about', 'that'), ('come', 'for'), ('cray', 'cray'), ('da', 'hell'), ('dig', 'it'), ('down', 'bad'), ("that's", 'facts'), ('fo', 'sho'), ('fuck', 'with'), ('git', 'box'), ('go', 'off'), ('good', 'looks'), ('have', 'time'), ('high', "cappin'"), ('hit', 'different'), ("it's", 'giving'), ("it's", 'over'), ('killin', 'it'), ('killing', 'it'), ('make', 'bank'), ('miss', 'girl'), ('miss', 'gurl'), ('n', 'word'), ('not', 'the'), ('not', 'you'), ('now', 'why'), ('on', 'god'), ('on', 'point'), ('on', 'that'), ('outta', 'pocket'), ('popping', 'off'), ('running', 'hands'), ('shoe', 'game'), ('the', 'smoke'), ('stank', 'face'), ('stay', 'woke'), ('straight', 'up'), ("that's", 'tuff'), ("that's", 'wraps'), ('the', 'struggle'), ('movin', 'weird'), ('moving', 'weird'), ('throwing', 'shade'), ('tried', 'it'), ('wayment', 'now'), ("what's", 'poppin'), ('wus', 'poppin')], 1: [('aggy',), ('aight',), ('ak-matic',), ('asf',), ('ate',), ('ax',), ('aye',), ('bae',), ('bank',

This is the explanation for this next part: "Then split my_text into a list, and for each key in d find the corresponding n-grams and build a Counter from the result. Then for each value in that specific key in d, update with the counts from the Counter"

In [None]:
def ngram_counter(my_text_split, d):
    match_counts = dict()
    for n,v in d.items():
        c = Counter(ngrams(my_text_split, n))
        for k in v:
            if k in c:
                match_counts[k] = c[k]
    return match_counts

In [None]:
next_fri_split = next_fri_cleaned.replace('.', '').split()
friday_matches = ngram_counter(next_fri_split, d)
friday_matches

{('about', 'that'): 1,
 ('on', 'that'): 2,
 ('the', 'smoke'): 3,
 ('ate',): 1,
 ('bank',): 1,
 ('been',): 9,
 ('if',): 25,
 ('beat',): 6,
 ('boss',): 1,
 ('done',): 2,
 ('drop',): 1,
 ('extra',): 1,
 ('fire',): 1,
 ('ghetto',): 4,
 ('hit',): 14,
 ('ice',): 1,
 ('imma',): 24,
 ('laid',): 1,
 ('mad',): 2,
 ('mean',): 4,
 ('n',): 1,
 ('playin',): 1,
 ('real',): 7,
 ('roll',): 6,
 ('school',): 2,
 ('snatched',): 1,
 ('thick',): 1,
 ('tight',): 3,
 ('trash',): 1,
 ('trip',): 2,
 ('wig',): 1,
 ('i', 'feel', 'you'): 1}

In [None]:
superbad_split = superbad_cleaned.replace('.', '').split()
superbad_matches = ngram_counter(superbad_split, d)
superbad_matches

{('about', 'that'): 2,
 ('dig', 'it'): 1,
 ('not', 'the'): 1,
 ('on', 'that'): 3,
 ('ate',): 1,
 ('been',): 17,
 ('if',): 38,
 ('beat',): 5,
 ('bet',): 2,
 ('buckets',): 2,
 ('bumpin',): 1,
 ('cap',): 1,
 ('cop',): 24,
 ('done',): 2,
 ('drag',): 1,
 ('drop',): 2,
 ('dry',): 1,
 ('extra',): 1,
 ('fire',): 1,
 ('hit',): 5,
 ('ice',): 1,
 ('laid',): 1,
 ('mad',): 3,
 ('mean',): 4,
 ('real',): 3,
 ('roll',): 4,
 ('school',): 11,
 ('spill',): 1,
 ('steady',): 2,
 ('tight',): 6,
 ('trash',): 2,
 ('trip',): 1}

## Starting the Semantic Textual Similarity analysis
I'm going to create sentence embeddings of both the scripts and the AAVE tweet corpus and use sentence transformers (model = all-mpnet-base-v2) to find the semantic similarity between the scripts and the corpus. The script will be counted as one set of embeddings and it will be compared to the average embeddings within the AAVE corpus. The model will calculate the cosine similarity between embeddings and give a value (0 to 1) that represents the overall similarity between the AAVE Corpus and a specific script. A value closer to 1 means higher similarity, while closer to 0 means lower similarity or unrelatedness.


[Information on the model used (all-MiniLM-L12-v2)](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

_The variables for the AAVE Corpus are:_
- Tweets as a list: cleaned_tweets_list
- Full tweet corpus: cleaned_tweets

In [None]:
# necessary libraries
import re
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

In [None]:
def load_script(file_path):
 with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

In [None]:
# adding and cleaning the scripts to prepare for analysis
# we already have superbad and next friday above

#bridesmaids (2011)
bridesmaids_script = load_script('/content/drive/MyDrive/Comedy_Scripts/bridesmaids.txt')
bridesmaids = cleaning(bridesmaids_script)

# the hangover (2009)
hangover_script = load_script('/content/drive/MyDrive/Comedy_Scripts/hangoverthe.txt')
hangover = cleaning(hangover_script)

# hot tub time machine (2010)
hot_tub_script = load_script('/content/drive/MyDrive/Comedy_Scripts/hottubtimemachine.txt')
hot_tub = cleaning(hot_tub_script)

# harold and kumar go to white castle (2004)
harold_script = load_script('/content/drive/MyDrive/Comedy_Scripts/haroldandkumargotowhitecastle.txt')
harold_kumar = cleaning(harold_script)

# scott pilgram vs. the world (2010)
scott_script = load_script('/content/drive/MyDrive/Comedy_Scripts/scottpilgrimvstheworld.txt')
scott = cleaning(scott_script)

# pineapple express (2008)
pineapple_script = load_script('/content/drive/MyDrive/Comedy_Scripts/pineappleexpress.txt')
pineapple = cleaning(pineapple_script)

# ted (2012)
ted_script = load_script('/content/drive/MyDrive/Comedy_Scripts/ted.txt')
ted = cleaning(ted_script)

In [None]:
# tropic thunder (2008)
tropic_script = load_script('/content/drive/MyDrive/Comedy_Scripts/tropicthunder.txt')
tropic = cleaning(tropic_script)

In [None]:
# initializing the embedding model
# using 'all-MiniLM-L12-v2'
model = SentenceTransformer('all-MiniLM-L12-v2')

def analyze_sim(script_content, tweets_list):
    # create embeddings directly from the script content
    script_embedding = model.encode([script_content])
    tweet_embeddings = model.encode(tweets_list)

    # calculate average embedding for the tweet corpus
    avg_tweet_embedding = np.mean(tweet_embeddings, axis=0).reshape(1, -1)

    # cosine similarity b/w the script and average tweet embeddings
    similarity_score = cosine_similarity(script_embedding, avg_tweet_embedding)[0][0]

    # cosine similarity between scripts and each tweet
    individual_similarities = cosine_similarity(script_embedding, tweet_embeddings)[0]

    # print out results
    results = {
        'script_to_avg_tweets': similarity_score,
        'individual_tweet_similarities': individual_similarities,
        'mean_individual_similarity': np.mean(individual_similarities),
        'max_individual_similarity': np.max(individual_similarities),
        'min_individual_similarity': np.min(individual_similarities),
        'std_individual_similarity': np.std(individual_similarities),
        'num_tweets': len(tweets_list)
    }

    return results

Extra explanation for a few of code:

    # 1. calculate average embedding for the tweet corpus
    avg_tweet_embedding = np.mean(tweet_embeddings, axis=0).reshape(1, -1)

    # 2. cosine similarity b/w the script and average tweet embeddings
    similarity_score = cosine_similarity(script_embedding, avg_tweet_embedding)[0][0]

    # 3. cosine similarity between scripts and each tweet
    individual_similarities = cosine_similarity(script_embedding, tweet_embeddings)[0]

1. `.reshape(1, -1)` ensures that embedding vectors conform to the 2D input format required by cosine_similarity.

2. A script–to–average comparison produces a 1×1 similarity matrix, requiring `[0][0]` to extract the scalar value.

3. Script–to–all-tweets comparison produces a 1×N vector, so only `[0]` is needed.

These shape-handling steps ensure that similarity computations run correctly and return interpretable numerical outputs.

In [None]:
# function for formatting the results of a script analysis
def print_results(results, tweets_list=None, top_n=5):
    # header
    print("SEMANTIC SIMILARITY ANALYSIS RESULTS")

    # give script to avg corpus embedding score
    print(f"\nScript vs Average Tweet Embedding:")
    print(f"  Cosine Similarity: {results['script_to_avg_tweets']:.4f}") # had help getting the ":.4f" parameter
    # individual tweet info
    print(f"\nIndividual Tweet Statistics:")
    print(f"  Mean Similarity:   {results['mean_individual_similarity']:.4f}")
    print(f"  Max Similarity:    {results['max_individual_similarity']:.4f}")
    print(f"  Min Similarity:    {results['min_individual_similarity']:.4f}")
    print(f"  Std Deviation:     {results['std_individual_similarity']:.4f}")
    # show number of tweets
    print(f"\nNumber of tweets analyzed: {results['num_tweets']}")

    # ok i had to get help for this part
    # this prints the top 5 most similar tweets
    if tweets_list is not None:
        sims = np.array(results['individual_tweet_similarities'])
        top_indices = np.argsort(sims)[-top_n:][::-1]

        print(f"\nTop {top_n} Most Similar Tweets:")
        for rank, idx in enumerate(top_indices, 1):
            sim_score = sims[idx]
            tweet_text = tweets_list[idx]
            print(f"\n  {rank}. Similarity: {sim_score:.4f}")
            print(f"     Tweet: {tweet_text[:100]}...")

## Result Output
Here I'm putting each script through the functions in order to get an output that has:
1. The cosine similarity score (Script vs. Tweet Average)
2. Evaluation metrics for script vs. individual tweet similarity scores (e.g., mean, max, min, standard deviation)
3. Number of tweets analyzed (1523 tweets)
4. Top 5 most similar tweets and their similarity scores


#### _Movies List:_
- Next Friday
- Superbad
- Bridesmaids
- The Hangover
- Hot Tub Time Machine
- Harold & Kumar Go to White Castle
- Scott Pilgrim vs. The World
- Pineapple Express _(2008)_
- Ted _(2012)_
- Tropic Thunder _(2008)_



In [None]:
# run the analysis
# first i'm going to try next friday since it is a Black film to test model performance
nxt_fri_results = analyze_sim(next_fri_cleaned, cleaned_tweets_list)
print_results(nxt_fri_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.2208

Individual Tweet Statistics:
  Mean Similarity:   0.1268
  Max Similarity:    0.3536
  Min Similarity:    -0.0911
  Std Deviation:     0.0597

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3536
     Tweet: cops touching on kids again its in the dna shopblack blacktech supportblackbusiness pitchblack black...

  2. Similarity: 0.3226
     Tweet: heard someone getting their life slapped out of them in target on black friday thanksgivingclapback...

  3. Similarity: 0.3170
     Tweet: i aint got time to be black friday shopping you aint got the money either thanksgivingclapback...

  4. Similarity: 0.3170
     Tweet: i aint got time to be black friday shopping you aint got the money either thanksgivingclapback...

  5. Similarity: 0.3042
     Tweet: when youre talkin slick smokin one on the porch and that cool uncle tries to slide into the convo th...


In [None]:
# next trying superbad
superbad_results = analyze_sim(superbad_cleaned, cleaned_tweets_list)
print_results(superbad_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.2013

Individual Tweet Statistics:
  Mean Similarity:   0.1156
  Max Similarity:    0.3176
  Min Similarity:    -0.0867
  Std Deviation:     0.0604

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3176
     Tweet: the throws in a diversity montage nice what yaw got cause yaw getting slammed right about now oscars...

  2. Similarity: 0.3056
     Tweet: sequels remakes amp oscarssowhite whats the deal hollywood via...

  3. Similarity: 0.3029
     Tweet: that adhd show presents oscarssowhite adhd adhdshow adhdpodcast adhdshow hollywod...

  4. Similarity: 0.3015
     Tweet: will watch now but oscarssowhite but i am a fan of amp they under damage control...

  5. Similarity: 0.3001
     Tweet: thank you for being true to urself amp let oscarssowhite know you can be blk amp talented...


In [None]:
# bridesmaids
bridesmaids_results = analyze_sim(bridesmaids, cleaned_tweets_list)
print_results(bridesmaids_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.2217

Individual Tweet Statistics:
  Mean Similarity:   0.1274
  Max Similarity:    0.3421
  Min Similarity:    -0.0500
  Std Deviation:     0.0692

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3421
     Tweet: the odd couple oscarssowhite...

  2. Similarity: 0.3308
     Tweet: auntie you still single me you still in and out the abortion clinic thanksgivingclapback woah...

  3. Similarity: 0.3176
     Tweet: speaking of the oscars wrote a new piece on oscarssowhite for read it here...

  4. Similarity: 0.3149
     Tweet: aunt you still wearing them fake weaves me you still faking that pregnancy thanksgivingclapback...

  5. Similarity: 0.3121
     Tweet: aunt didnt you wear that last year me didnt you wear a wedding ring last year thanksgivingclapback...


In [None]:
# the hangover
hangover_results = analyze_sim(hangover, cleaned_tweets_list)
print_results(hangover_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.2339

Individual Tweet Statistics:
  Mean Similarity:   0.1344
  Max Similarity:    0.3603
  Min Similarity:    -0.0857
  Std Deviation:     0.0774

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3603
     Tweet: u look like ure hungover n strugglin me like ur belly n the button to ur jeans thanksgivingclapback...

  2. Similarity: 0.3272
     Tweet: thanksgivingclapback latino edition...

  3. Similarity: 0.3151
     Tweet: its the most wonderful time of the year when i eat leftovers and read thanksgivingclapback quotes...

  4. Similarity: 0.3146
     Tweet: looking for some giggles look up thanksgivingclapback...

  5. Similarity: 0.3140
     Tweet: im laughing so hard reading through thanksgivingclapback...


In [None]:
# httm
hot_tub_results = analyze_sim(hot_tub, cleaned_tweets_list)
print_results(hot_tub_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.2040

Individual Tweet Statistics:
  Mean Similarity:   0.1172
  Max Similarity:    0.3534
  Min Similarity:    -0.0808
  Std Deviation:     0.0714

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3534
     Tweet: and thats how you hershey hersheypark hersheyparkhappy visithersheyharrisburg blackmoms blackmomsblo...

  2. Similarity: 0.3334
     Tweet: crazy uncle you have on too much makeup me and you snort up too much cocaine thanksgivingclapback...

  3. Similarity: 0.3330
     Tweet: and thats how you hershey hersheypark hersheyparkhappy visithersheyharrisburg blackmoms blackmomsblo...

  4. Similarity: 0.3245
     Tweet: grandma pick up yo pants me pick up yo titties thanksgivingclapback...

  5. Similarity: 0.3093
     Tweet: mom honey this mac amp cheese kinda dry pops just like ya pussy me slowly dying thanksgivingclapback...


In [None]:
# harold and kumar
harold_results = analyze_sim(harold_kumar, cleaned_tweets_list)
print_results(harold_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.0892

Individual Tweet Statistics:
  Mean Similarity:   0.0512
  Max Similarity:    0.3035
  Min Similarity:    -0.1327
  Std Deviation:     0.0744

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3035
     Tweet: white hollywood cartoon oscarssowhite oscars hollywood...

  2. Similarity: 0.2975
     Tweet: and the oscar goes to this white person ltfill in the blankgt oscarssowhite...

  3. Similarity: 0.2704
     Tweet: hollywood we have a problem great piece on whitewashing by oscarssowhite diversity...

  4. Similarity: 0.2686
     Tweet: oscars opt for a rather white poster oscarssowhite...

  5. Similarity: 0.2631
     Tweet: trying to write about oscarnoms but im too blinded but their whiteness ugh oscarssowhite oscars...


In [None]:
# scott pilgrim
scott_results = analyze_sim(scott, cleaned_tweets_list)
print_results(scott_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.1146

Individual Tweet Statistics:
  Mean Similarity:   0.0658
  Max Similarity:    0.2399
  Min Similarity:    -0.1304
  Std Deviation:     0.0514

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.2399
     Tweet: november thanksgiving thanksgivingclapback thanksgivingbreak picoftheday art...

  2. Similarity: 0.2283
     Tweet: print edition thanksgivingclapback...

  3. Similarity: 0.2218
     Tweet: with all the controversies over oscarssowhite and selmamovie im realizing what the world really look...

  4. Similarity: 0.2073
     Tweet: history vs movie via selmamovie blacktwitter oscarssowhite americansniper culture entertainment...

  5. Similarity: 0.2071
     Tweet: aunt all u do is stay on that spacebook me while ur husband stay on tinder thanksgivingclapback...


In [None]:
# pineapple express
pineapple_results = analyze_sim(pineapple, cleaned_tweets_list)
print_results(pineapple_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.1743

Individual Tweet Statistics:
  Mean Similarity:   0.1001
  Max Similarity:    0.2686
  Min Similarity:    -0.0517
  Std Deviation:     0.0461

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.2686
     Tweet: oscarssowhite they dont see color illustration by victoria courtney...

  2. Similarity: 0.2572
     Tweet: white on white good oscarssowhite graphic by the lat...

  3. Similarity: 0.2468
     Tweet: hands down best article on oscars american hypermasculine by oscarssowhite oscarssomale...

  4. Similarity: 0.2398
     Tweet: mom you want white or dark meat everyone looks at me me white my brother u sure you like everything ...

  5. Similarity: 0.2324
     Tweet: if you dont want another oscarssowhite go watch mcfarland usa also spare parts with and black and wh...


In [None]:
# ted
ted_results = analyze_sim(ted, cleaned_tweets_list)
print_results(ted_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.1471

Individual Tweet Statistics:
  Mean Similarity:   0.0845
  Max Similarity:    0.3340
  Min Similarity:    -0.0860
  Std Deviation:     0.0523

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3340
     Tweet: podcast with and oscarssowhite arrow theflash doctorwho goldenglobeawards vancouver...

  2. Similarity: 0.2909
     Tweet: rob rogers cartoon oscars oscarssowhite...

  3. Similarity: 0.2417
     Tweet: uncle jimmy im leaving because i heard you were coming me im coming because i heard you were leaving...

  4. Similarity: 0.2340
     Tweet: conan tells it how it is truth oscarssowhite blizzardof...

  5. Similarity: 0.2196
     Tweet: that adhd show presents oscarssowhite adhd adhdshow adhdpodcast adhdshow hollywod...


In [None]:
# tropic thunder
tropic_results = analyze_sim(tropic, cleaned_tweets_list)
print_results(tropic_results, cleaned_tweets_list, top_n=5)

SEMANTIC SIMILARITY ANALYSIS RESULTS

Script vs Average Tweet Embedding:
  Cosine Similarity: 0.1872

Individual Tweet Statistics:
  Mean Similarity:   0.1076
  Max Similarity:    0.3101
  Min Similarity:    -0.0604
  Std Deviation:     0.0516

Number of tweets analyzed: 1523

Top 5 Most Similar Tweets:

  1. Similarity: 0.3101
     Tweet: selmas ava duvernay amp david oyelowo teaming up for hurricane katrina movie cant wait for oscar to ...

  2. Similarity: 0.2965
     Tweet: juno turned new england so white that its already a shooin for an oscar next year oscarssowhite bliz...

  3. Similarity: 0.2903
     Tweet: history vs movie via selmamovie blacktwitter oscarssowhite americansniper culture entertainment...

  4. Similarity: 0.2767
     Tweet: all this snow and its still not as white as the oscars this year oscarssowhite...

  5. Similarity: 0.2726
     Tweet: oscarssowhite they named the winter storm juno...


## Reflection:

### _Initialization_
To evaluate how AAVE appears across different comedic films, I first applied an n-gram dictionary matching procedure. Using a cleaned AAVE glossary, I scanned each script for single-word and multi-word AAVE expressions. Next Friday was used as the control as it is widely considered Black media. Next Friday contained more identifiable AAVE terms than movies such as Bridesmaids, Superbad, or The Hangover, however it was still present in non-Black films. This step confirmed that AAVE usage is present within some of the scripts written from a non-Black perspective, but I was able to ascertain this from my familiarity with AAVE and it may be more challenging and intensive for less familiar researchers to come up with the same conclusion.

#### _Semantic Similarity Results (Cosine Similarity)_

Across all films, the semantic similarity scores cluster tightly together, regardless of whether the film uses noticeable AAVE features. Some cursory clustering shows:

Highest Similarity Cluster: ~0.20–0.23
- Next Friday — 0.2208
- Bridesmaids — 0.2217
- The Hangover — 0.2339
- Superbad — 0.2013
- Hot Tub Time Machine — 0.2040

Mid Similarity Cluster
- Pineapple Express — 0.1743
- Ted — 0.1471
- Tropic Thunder — 0.1872

Outlier Low Similarity
- Harold & Kumar Go to White Castle — 0.0892 (least similar overall)

What is striking is that the film with the most AAVE lexical content _(Next Friday)_ has nearly the same semantic similarity score as films with comparatively little AAVE usage. This pattern also extends to the mean similarities, max similarities, and similarity distributions. These metrics also remain tightly clustered across all nine films, despite dramatic differences in cultural grounding, dialogue style, register, and character demographics.

SentenceTransformer MiniLM is an English-based model that primarily encodes:

- topic similarity (themes like conflict, relationships, humor, family, drama)
- emotional tone
- discourse structure

What it does not encode well is:
- vernacular features (AAVE syntax, habitual “be,” copula absence, negative concord)
- racialized linguistic identity
- Code-switching
- cultural context
- sociolinguistic register

So even when AAVE-heavy tweets contain distinctly Black cultural and linguistic markers, the model maps them into a generic informal comedic dialogue vector space that many movies also occupy without respect to cultural contexts.

Consequently, _Next Friday_ and _Bridesmaids_ look almost the same to the model (≈0.22 cosine). Even movies with completely different racial and cultural grounding (_Hot Tub Time Machine, Superbad, The Hangover_) score within the same range. Movies with unusual genre structures (_Harold & Kumar, Scott Pilgrim_) only appear lower because they diverge in topic, not linguistic style.

#### _Implications: Structural Erasure of AAVE in Machine Learning Models_

These findings reveal an issue in NLP systems:

1. AAVE appears invisible in semantic embedding space:
Even when a script contains unmistakable AAVE terms and Black cultural rhetoric, the model collapses those linguistic features into generic English structures.

2. Topic overrides dialect:
The embeddings pick up on what is being talked about, not how it is being expressed.
So movies about interpersonal chaos, family conflict, and humor score similarly. even if the language used is culturally specific.

3. AAVE is underrepresented in training data:
MiniLM was trained on the English corpora that is deemed the societal standard. As a result, AAVE patterns are treated as statistical noise, not meaningful signals.

4. Dialectal and cultural distinctions collapse into a single semantic space:
From the model's perspective, all these movies are just “comedy scripts about messy people,” regardless of race, culture, or vernacular.

5. This extends beyond AAVE:
Other minoritized Englishes (Chicano English, Hawaiian Pidgin, Cajun English, Caribbean English, Appalachian English) would similarly be flattened.

#### _Conclusion_
These results demonstrate that general-purpose NLP models can detect thematic similarity across texts but fail to represent or distinguish the linguistic, cultural, and racialized nuances encoded in AAVE. Semantic similarity models cannot be used to evaluate dialectal overlap without additional linguistic scaffolding. AAVE is effectively “standardized away” when mapped into embedding space. Machine learning systems reproduce existing linguistic hierarchies by privileging majority speech patterns. Without explicit dialect-aware training corpora, models will continue to erase the linguistic richness of marginalized communities.
