# Final project guidelines

**Note:** Use these guidelines if and only if you are pursuing a **final project of your own design**. For those taking the final exam instead of the project, see the (separate) [final exam notebook](https://github.com/wilkens-teaching/info3350-s22/blob/main/final_exam/exam.ipynb).

## Guidelines

These guidelines are intended for **undergraduates enrolled in INFO 3350**. If you are a graduate student enrolled in INFO 6350, you're welcome to consult the information below, but you have wider latitude to design and develop your project in line with your research goals.

### The task

Your task is to: identify an interesting problem connected to the humanities or humanistic social sciences that's addressable with the help of computational methods, formulate a hypothesis about it, devise an experiment or experiments to test your hypothesis, present the results of your investigations, and discuss your findings.

These tasks essentially replicate the process of writing an academic paper. You can think of your project as a paper in miniature.

You are free to present each of these tasks as you see fit. You should use narrative text (that is, your own writing in a markdown cell), citations of others' work, numerical results, tables of data, and static and/or interactive visualizations as appropriate. Total length is flexible and depends on the number of people involved in the work, as well as the specific balance you strike between the ambition of your question and the sophistication of your methods. But be aware that numbers never, ever speak for themselves. Quantitative results presented without substantial discussion will not earn high marks. 

Your project should reflect, at minimum, ten or more hours of work by each participant, though you will be graded on the quality of your work, not the amount of time it took you to produce it.

#### Pick an important and interesting problem!

No amount of technical sophistication will overcome a fundamentally uninteresting problem at the core of your work. You have seen many pieces of successful computational humanities research over the course of the semester. You might use these as a guide to the kinds of problems that interest scholars in a range of humanities disciplines. You may also want to spend some time in the library, reading recent books and articles in the professional literature. **Problem selection and motivation are integral parts of the project.** Do not neglect them.

### Format

You should submit your project as a Jupyter notebook, along with all data necessary to reproduce your analysis. If your dataset is too large to share easily, let us know in advance so that we can find a workaround. If you have a reason to prefer a presentation format other than a notebook, likewise let us know so that we can discuss the options.

Your report should have four basic sections (provided in cells below for ease of reference):

1. **Introduction and hypothesis.** What problem are you working on? Why is it interesting and important? What have other people said about it? What do you expect to find?
2. **Corpus, data, and methods.** What data have you used? Where did it come from? How did you collect it? What are its limitations or omissions? What major methods will you use to analyze it? Why are those methods the appropriate ones?
3. **Results.** What did you find? How did you find it? How should we read your figures?
4. **Discussion and conclusions.** What does it all mean? Do your results support your hypothesis? Why or why not? What are the limitations of your study and how might those limitations be addressed in future work?

Within each of those sections, you may use as many code and markdown cells as you like. You may, of course, address additional questions or issues not listed above.

All code used in the project should be present in the notebook (except for widely-available libraries that you import), but **be sure that we can read and understand your report in full without rerunning the code**. Be sure, too, to explain what you're doing along the way, both by describing your data and methods and by writing clean, well commented code.

### Grading

This project takes the place of the take-home final exam for the course. It is worth 20% of your overall grade. You will be graded on the quality and ambition of each aspect of the project. No single component is more important than the others.

### Practical details

* The project is due at **11:59pm EST on Thursday, May 19, 2022** via upload to CMS of a single zip file containing your fully executed Jupyter notebook and all associated data.
* You may work alone or in a group of up to three total members.
    * If you work in a group, be sure to list the names of the group members.
    * For groups, create your group on CMS and submit one notebook for the entire group. **Each group member should also submit an individual statement of responsibility** that describes in general terms who performed which parts of the project.
* You may post questions on Ed, but should do so privately (visible to course staff only).
* Interactive visualizations do not always work when embedded in shared notebooks. If you plan to use interactives, you may need to host them elsewhere and link to them.

---

## 1. Introduction and hypothesis

The Cornell subreddit is a notoriously busy subreddit. With more than 40k users it safe to say that a large part of the cornell popullation is currently or has at some point come across the r/Cornell sub. Many people go to the cornell subreddit looking for adivce, friends. iclickers, or even just companionship. However reddit as a whole has been known to be a website filled with users whose post are more of a creative writing exercise rather than an accruate representation of their beliefs. For this reason I am interesting in seeing if a machine learning model can detect satire of sh*t-posting as it is refered to on the reditt. Satirical texual documents are notoriously difficult to detect even for human readers. Thus it will be interesting to see how a model compares to human labels. I hypothesis that both human and machine labels will not be fully accurate however a human reader might more accurately detect satire than a model. 

## 2. Data and methods

Code provided in part by github user parth647 to scrape post from a subreddit!

In [69]:
!pip install praw



In [70]:
%%time

import praw
import pandas as pd


#Create an instance of reddit class
reddit = praw.Reddit(client_id="GI_ZqSXXcV12KPsNELGvAw", #my client id
                     client_secret="G5uosCAwkAEPRag1h8x01NfYosmy0g",  #your client secret
                     user_agent="Pedro_Velazquez", #user agent name
                     username = "pedrov718",     # your reddit username
                     password = "Chispit@73")     # your reddit password


# Create sub-reddit instance
subreddit_name = "Cornell"
subreddit = reddit.subreddit(subreddit_name)

cornell = pd.DataFrame() # creating dataframe for displaying scraped data

# creating lists for storing scraped data
titles=[]
scores=[]
ids=[]
body = []

# looping over posts and scraping it
for submission in subreddit.top(limit=None):
    titles.append(submission.title)
    scores.append(submission.score) #upvotes
    ids.append(submission.id)
    body.append(submission.selftext)
    
    
cornell['Title'] = titles
cornell['Id'] = ids
cornell['Upvotes'] = scores #upvotes
cornell["body"] = body

Wall time: 14.8 s


In [71]:
print(cornell.shape)
cornell.sample(20)

(999, 4)


Unnamed: 0,Title,Id,Upvotes,body
551,"if only I knew, nine months ago. if only",hgbtg4,295,
786,early morning beebe lake 🍂,qmm8ig,258,
695,Grateful,jzb5ol,272,"With everything going on in the world, this su..."
980,A car saw me and actually sped up at a ctown i...,qtttqp,239,At least I was fast enough. Some of these fuck...
252,"Cancel culture sucks. Seriously, people need t...",qkn5zn,377,He made a mistake everyone is going overboard....
311,Haven't seen the Taliban on campus...,ptz8l3,358,"Huge shout out to the ROTC, they're really doi..."
46,Some of y'all really need to revaluate how you...,n2jka5,539,I know many professors have taken advantage of...
218,How I feel at Cornell sometimes,ix0dvx,391,
444,7 am strolls at the cascadilla gorge trail,mqolzc,319,
513,There are two types of people in this world,nyihzr,303,


In [72]:
cornell.body

0      I just got a cornell alert saying to avoid the...
1      I emailed my chem prof about being very sick a...
2                                                       
3      I summarized their summary with most of the bi...
4                                                       
                             ...                        
994    If Cornell’s endowment is SOOOO big, why am I ...
995    I don't know what is wrong with you. You leave...
996                                                     
997                                                     
998                                                     
Name: body, Length: 999, dtype: object

In [73]:
import string 
import re 
import sklearn 
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd

In [74]:
def remove_stop_words(dataframe, target_column_name, new_column_name):
    dataframe[new_column_name] = dataframe[target_column_name].apply(lambda x: " ".join([item for item in x.split() if item not in stopwords.words("english")]))
    return(dataframe)
def remove_punctuations(dataframe,target_column_name, new_column_name):
    dataframe[new_column_name] = dataframe[target_column_name].apply(lambda x: "".join([char for char in x if char not in string.punctuation]))  
    return(dataframe)
def stem_text(dataframe, target_column_name, new_column_name):
    dataframe[new_column_name] = dataframe[target_column_name].apply(lambda x: ps.stem(word) for word in x)
    return(dataframe)

In [75]:
cleaned_data = remove_stop_words(cornell, "body", "body_clean")

cleaned_data = remove_punctuations(cleaned_data, "body_clean", "no_stops")


In [76]:
# %%time
# clean_data = stem_text(cleaned_data, "no_stops", "stemmed")

In [77]:
cleaned_data.head()

Unnamed: 0,Title,Id,Upvotes,body,body_clean,no_stops
0,Cornell Alert: Anyone know whats going on?,qov789,1113,I just got a cornell alert saying to avoid the...,"I got cornell alert saying avoid arts quad, an...",I got cornell alert saying avoid arts quad any...
1,I threw up in my mask and had to continue taki...,qx9bkc,1057,I emailed my chem prof about being very sick a...,I emailed chem prof sick said I either take pr...,I emailed chem prof sick said I either take pr...
2,this professor gets it,k3ejjk,1053,,,
3,An actual summary of the 97 page report,hdvn9a,901,I summarized their summary with most of the bi...,I summarized summary bits affect students I wo...,I summarized summary bits affect students I wo...
4,I am a New Bus!,tsqsuw,816,,,


In [78]:
#Saving scraped data to my machine 
cleaned_data.to_csv("cornell_reddit_posts.csv", encoding='utf-8')

#### Now it is time to train a model to detect satire. At first we will try a simple apporach using NaiveBayes classifier trained on a labeled dataset of news article headlines. 

In [79]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

data = pd.read_json("Sarcasm.json", lines=True)
print(data.head())

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                            headline  is_sarcastic  
0  former versace store clerk sues over secret 'b...             0  
1  the 'roseanne' revival catches up to our thorn...             0  
2  mom starting to fear son's web series closest ...             1  
3  boehner just wants wife to listen, not come up...             1  
4  j.k. rowling wishes snape happy birthday in th...             0  


In [80]:
data["is_sarcastic"] = data["is_sarcastic"].map({0: "Not Sarcasm", 1: "Sarcasm"})
print(data.head())

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                            headline is_sarcastic  
0  former versace store clerk sues over secret 'b...  Not Sarcasm  
1  the 'roseanne' revival catches up to our thorn...  Not Sarcasm  
2  mom starting to fear son's web series closest ...      Sarcasm  
3  boehner just wants wife to listen, not come up...      Sarcasm  
4  j.k. rowling wishes snape happy birthday in th...  Not Sarcasm  


In [81]:

data = data[["headline", "is_sarcastic"]]
x = np.array(data["headline"])
y = np.array(data["is_sarcastic"])

cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [82]:
model = BernoulliNB()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.8448146761512542


#### Time to test the accuracy of our Naive Bayes Model:

In [83]:
titles = {}
for title in cornell.Title:
    user = str(title)
    data = cv.transform([user]).toarray()
    output = model.predict(data)
    titles[title] = output

In [84]:
bodies = {}
for post in cornell["no_stops"]:
    data = cv.transform([post]).toarray()
    output = model.predict(data)
    bodies[post] = output

In [85]:
pd.DataFrame.from_dict(bodies, orient= "index")

Unnamed: 0,0
I got cornell alert saying avoid arts quad anyone idea whats happening Everyone please safe hopefully resolved soon Edit We going get ahead turn megathread situation Edit 2 We put subreddit restricted mode temporarily avoid mountain posts New posts show now still comment Edit 3 If see people giving advice leaving sheltered please report it dont want people get hurt following bad advice Everyone please stay safe locked unless directed otherwise authorities Edit 4 Cornell confirmed bomb threat We still dont know anything else please stay safe locked down Edit 5 We took subreddit restricted mode Still keep stuff related thread post again,Sarcasm
I emailed chem prof sick said I either take prelim as long it’s Covid would give failing prelim grade So I come take exam person low behold I throw halfway mask run bathroom I come back waiting hallway doesn’t say anything I ask another mask mine ruined says I take holding paper towel face rest time He asked I one emailed I say yes doesn’t say anything else I ask I take prelim hallway i’m spreading germs everyone says yes I finish exam throwing one time time another great look chem department🤠,Sarcasm
,Not Sarcasm
I summarized summary bits affect students I would still encourage read report Health Monitoring Daily check maybe live httpsdailycheckcornelledu COVID symptoms Possibly require flu vaccine Face masks policy live httpsehscornelleducampushealthsafetyoccupationalhealthcovid19facecoveringandmaskrequirements Contact tracing recommended via phone app Testing Low threshold testing including contact Covidpositive cases “Surveillance Testing” regularly testing students staff Test students travel Ithaca remotely arrive Testing students already Ithaca movein Movein may 4 8 days accommodate Reactivation key card NetID requires completing checklist QuarantineIsolationContact Tracing Quarantine permanent residence Ithaca hotels including possibly Statler “Students provided immunity university disciplinary violations activities disclose contact tracing” Modifications Academic Activities Later start school year breaks 2 modes instruction online inperson remote access Classroom capacity reduced 1324 Wear masks classroom sit assigned seats bruh Virtual OH encouraged Barton Hall large spaces repurposed quiet study spaces limit gatherings elsewhere Regular grading reinstated Attendance shouldn’t counted credit taken facilitate contact tracing Recommend stricter cap 18 credit hours mental health Orientation focus behavioral expectations Modifications student life Student Organizations encouraged virtual activities among considerations Greek Life “The university develop addendum Risk Management Social Events Policy requires compliance NYS local public health guidelines wearing masks visitors house registration public health monitors events coordination SCL ensure vendors healthy safe registering addresses chapter annexes” “Some advocated strict enforcement strong sanctions others caution risks pushing activities “underground” may difficult enforce Greek leaders expressed desire involved enforcement behavioral expectations maintain close partnership university leadership “ “The university suspend inperson concerts lectures involve outside guests promote innovative approaches entirely new ways socializing distancing” Housing Eliminate quads triples Rooms assigned bathrooms reduce people sharing Lounges kitchens remain open Dining halls provide togo service tables properly spaced cutlery disposable reservations required dinein Dedensification “Campus dedensified inviting none students back residential instruction If option pursued following student groups given priority new students including transfer students residential advisors graduating seniors especially spring athletes if competition take place sport students would otherwise able maintain academic progress without access campus students programs require handson access special facilities lack access internet quiet learning spaces home” Comments suggest ideal That say proposing option NOT saying done In fact committee recommends AGAINST this ”Campaign Public Health Behavioral Influence Strategies” “Recommend system progressive sanctions Initial response would involve student educational Subsequent violations would involve parentlegal guardian student signed FERPA waiver Students could lose access university facilities ultimately referred Office Judicial Administrator repeated violations necessitate formal discipline removal enrollment”,Sarcasm
1 Color red 2 Animal bear 3 Winter cold 4 People stressed,Not Sarcasm
...,...
162 Trick everybody reddit thinking going propose partner Duffield never show up,Sarcasm
Woooooo,Not Sarcasm
Just evict Covid testers gorgeous browsing library put literally anywhere else many rooms less pretty campus put free popcorn back alcove put tables chairs lobby make Willard Straight good again Make campus human place Also make Memorial Room lounge It long until 90s I think would simple add nice leather chairs coffee tables lamps easily moved aside SA meetings formals meetings gatherings WSH valuable place campus whole point building provide social space students Why embrace original mission already have,Sarcasm
If Cornell’s endowment SOOOO big I wiping single ply 😐😐😐,Not Sarcasm


In [86]:
#importing labelled data to my machine
#I will be using this for the rest of my porject

labeled = pd.read_csv("cornell_reddit_posts_labled.csv",  encoding='utf-8')

In [87]:
labeled.head()

Unnamed: 0,Title,Id,Upvotes,Label,body,tokens,new_body,body_clean,no_stops
0,Cornell Alert: Anyone know whats going on?,qov789,1111,0.0,I just got a cornell alert saying to avoid the...,I just got a cornell alert saying to avoid the...,I just got a cornell alert saying to avoid the...,"I got cornell alert saying avoid arts quad, an...",I got cornell alert saying avoid arts quad any...
1,I threw up in my mask and had to continue taki...,qx9bkc,1049,0.0,I emailed my chem prof about being very sick a...,I emailed my chem prof about being very sick a...,I emailed my chem prof about being very sick a...,I emailed chem prof sick said I either take pr...,I emailed chem prof sick said I either take pr...
2,this professor gets it,k3ejjk,1047,0.0,,,,,
3,An actual summary of the 97 page report,hdvn9a,908,0.0,I summarized their summary with most of the bi...,I summarized their summary with most of the bi...,I summarized their summary with most of the bi...,I summarized summary bits affect students I wo...,I summarized summary bits affect students I wo...
4,I am a New Bus!,tsqsuw,811,0.0,,,,,


In [88]:
labeled.dropna(inplace =True)

In [89]:
posts = labeled.new_body.values
X = vectorizer.fit_transform(posts)

labels = labeled.Label.values


In [90]:
train_input, val_input, train_label, val_label = train_test_split(X, labels)

In [91]:
train_input.shape, train_label.shape

((240, 6355), (240,))

## 3. Results

In [92]:
baseline = 100 - (len(labeled.loc[labeled.Label == 1])/ len(labeled.new_body))

print(baseline)

99.82866043613707


In [93]:
model = BernoulliNB()
model.fit(train_input, train_label)
print(model.score(val_input, val_label))

0.8518518518518519


## Failed attempt at trying to implement a Bert model to detect satire in my reddit dataset

In [94]:
!pip3 install transformers



In [95]:
from transformers import BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)

In [96]:
!pip install tensorflow



In [97]:
import tensorflow as tf

In [98]:
print('Actual text', posts[0])

Actual text I just got a cornell alert saying to avoid the arts quad  anyone have any idea whats happening   Everyone please be safe and hopefully this is resolved soon   Edit  We are going to just get ahead of this and turn this into a megathread for this situation   Edit    We put the subreddit into restricted mode temporarily to avoid the mountain of posts  New posts will not show up for now  but you can still comment  Edit    If you see people giving advice such as leaving where they are sheltered  please report it  we dont want people to get hurt for following bad advice  Everyone please stay safe and locked down unless directed to do otherwise by authorities   Edit    Cornell confirmed that there is a bomb threat  We still dont know if there is anything else  so please stay safe and locked down   Edit    We took the subreddit out of restricted mode  Still keep all stuff related to this in this thread  but you can post again  


In [99]:
print("Tokens", tokenizer.tokenize(posts[0]) )

Tokens ['i', 'just', 'got', 'a', 'cornell', 'alert', 'saying', 'to', 'avoid', 'the', 'arts', 'quad', 'anyone', 'have', 'any', 'idea', 'what', '##s', 'happening', 'everyone', 'please', 'be', 'safe', 'and', 'hopefully', 'this', 'is', 'resolved', 'soon', 'edit', 'we', 'are', 'going', 'to', 'just', 'get', 'ahead', 'of', 'this', 'and', 'turn', 'this', 'into', 'a', 'mega', '##th', '##rea', '##d', 'for', 'this', 'situation', 'edit', 'we', 'put', 'the', 'sub', '##red', '##dit', 'into', 'restricted', 'mode', 'temporarily', 'to', 'avoid', 'the', 'mountain', 'of', 'posts', 'new', 'posts', 'will', 'not', 'show', 'up', 'for', 'now', 'but', 'you', 'can', 'still', 'comment', 'edit', 'if', 'you', 'see', 'people', 'giving', 'advice', 'such', 'as', 'leaving', 'where', 'they', 'are', 'sheltered', 'please', 'report', 'it', 'we', 'don', '##t', 'want', 'people', 'to', 'get', 'hurt', 'for', 'following', 'bad', 'advice', 'everyone', 'please', 'stay', 'safe', 'and', 'locked', 'down', 'unless', 'directed', 'to'

In [100]:
print('Token to ids', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(posts[0])))

Token to ids [1045, 2074, 2288, 1037, 10921, 9499, 3038, 2000, 4468, 1996, 2840, 17718, 3087, 2031, 2151, 2801, 2054, 2015, 6230, 3071, 3531, 2022, 3647, 1998, 11504, 2023, 2003, 10395, 2574, 10086, 2057, 2024, 2183, 2000, 2074, 2131, 3805, 1997, 2023, 1998, 2735, 2023, 2046, 1037, 13164, 2705, 16416, 2094, 2005, 2023, 3663, 10086, 2057, 2404, 1996, 4942, 5596, 23194, 2046, 7775, 5549, 8184, 2000, 4468, 1996, 3137, 1997, 8466, 2047, 8466, 2097, 2025, 2265, 2039, 2005, 2085, 2021, 2017, 2064, 2145, 7615, 10086, 2065, 2017, 2156, 2111, 3228, 6040, 2107, 2004, 2975, 2073, 2027, 2024, 18304, 3531, 3189, 2009, 2057, 2123, 2102, 2215, 2111, 2000, 2131, 3480, 2005, 2206, 2919, 6040, 3071, 3531, 2994, 3647, 1998, 5299, 2091, 4983, 2856, 2000, 2079, 4728, 2011, 4614, 10086, 10921, 4484, 2008, 2045, 2003, 1037, 5968, 5081, 2057, 2145, 2123, 2102, 2113, 2065, 2045, 2003, 2505, 2842, 2061, 3531, 2994, 3647, 1998, 5299, 2091, 10086, 2057, 2165, 1996, 4942, 5596, 23194, 2041, 1997, 7775, 5549, 2145,

In [101]:
max_len = 0
for post in posts:
    max_len = max(max_len, len(str(post)))

In [102]:
def mask_input_for_bert(posts, max_len):
    #tokenize and map senteces to word ID
    input_ids = []
    attention_masks = []
    i = 0
    for post in posts:
        if(i<3):
            print("Post", posts)
        encoded_dict = tokenizer.encode_plus(
            post,
            add_special_tokens = True,
            max_length = max_len,
            pad_to_max_length = True,
            return_attention_mask = True
            )
        if (i<3):
            print("dict", encoded_dict['input_ids'])
        input_ids.append(encoded_dict["attention_mask"])
        
        i = (i+1)
    
    input_ids = tf.convert_to_tensor(input_ids)
    attention_masks = tf.convert_to_tensor(attention_masks)
    return(input_ids, attention_masks)

In [104]:
# train_inp, train_mask = mask_input_for_bert(train_input, max_len)
# val_inp, val_mask = mask_input_for_bert(val_input, max_len)
# train_label = tf.convert_to_tensor(train_label)
# val_label = tf.convert_to_tensor(val_label)

In [105]:
print("Train_input_shape", train_inp.shape)
print("Train_mask_shape", train_mask.shape)
print("Validation_input_shape", val_inp.shape)
print("Validation_mask_shape", val_mask.shape)
print("Train_labelshape", train_label.shape)
print("Validation label shape", val_label.shape)

Train_input_shape (240, 9336)
Train_mask_shape (0,)
Validation_input_shape (81, 9336)
Validation_mask_shape (0,)
Train_labelshape (240,)
Validation label shape (81,)


In [107]:
# bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)

## 4. Discussion and conclusions

The purpose of this experiment is to see if satire can be detected on the cornell subreddit. My motivation for this porject was due to the recent surge in non-sincere content on the cornell reddit. Due to the nature of reddit. It being a place where a user can remain totally anonymous I believed that the amount of disingenuous post would be relatively high. 

In conducting this project I had to overcome many hurdle. The first of these hurdles was learning how to scrape data from reddit. The reddit API was a fantastic start for this kind of task. I also recieved a tremendous amount of help from this medium article: https://medium.com/swlh/scraping-reddit-using-python-57e61e322486. 
After the post data was scrapped from the reddit I realized that the posts needed to be cleaned. Since many of them contained non-ascii characters that would not be useful to a text analysis project. 

My second large hurdle for this porject was the absence of gold labels for my data. In preparring for this project I was able to find many useful examples online of people scrapping twitter data to conduct sentiment analysis. A very useful article is linked here: https://thecleverprogrammer.com/2021/08/24/sarcasm-detection-with-machine-learning/

I was able to find a data set of news article headlines and from this data set I was able to train a Naive Bayes model to detect satire. This model was trianed on the new article data and then implimented on the reddit data that I gathered. The naive bayed classifier was 84% accurate on the news article data. When a similiar model was trained and and evaluated on the reddit data its accuracy fell to 77%. Although an accuracy of 77% is less than the news article data I believe that it was a good first step in trying to detect sarcams using a model. Unfortunely the reddit data that I scraped also had to be hand labled meaning that I had to read through a dataset of 1000 reddit post and manually determine if the post were genuine or that. Although my initial assumption was that a large number of reddit posts were satirical, it turns out that only about 1% were in fact satire. This obviously had a huge effect on my project, since it entailed that I had very little data to train and test the model.

The model's ended up not beating the baseline, and becuase of this I thought I shoudl try training the model on a bert pre-trained model, however after a lot of debugging and reading the huggingface API I was unable to get the bert-model to work. 

Overall, I believe that this project was very ambitious. I learned how to scrape data from reddit. I learned the difficulties of trying to manually annotate data and I was able to deploy a machine learning model on a cleaned text dataset of my own creation. Much of my time was dedicated to learing thr Bert API and unfortunately because of this I was unable to detect more time to an indepth analysis of the data that was scraped. 