# AWeSoMe Lab Intro and Setup Assignment Notebook

This is the Jupyter notebook for the HMC AWeSoMe Lab Intro and Setup Assignment "Hello, Convokit" ([writeup here](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?usp=sharing)). Solutions to coding questions _and_ written questions should be put here (you can use markdown cells for the written questions).

## Part 1: Load a Corpus!

See the [writeup for Part 1](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?tab=t.0#heading=h.yhpnne8a6ns3) before continuing.

In [None]:
from convokit import Corpus, download, FightingWords

In [None]:
reddit_corpus = Corpus(filename=download("reddit-corpus-small"))

In [None]:
reddit_corpus.print_summary_stats()

## Part 2: Re-implement print_summary_stats

See the [writeup for Part 2](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?tab=t.0#heading=h.jz5jp4t11fwd) before continuing.

In [None]:
# Here's how you would iterate over Speakers:
n_speakers = 0
for speaker in reddit_corpus.iter_speakers():
    n_speakers += 1
print(n_speakers)

TASK: In the two code cells below, modify the provided code to count Utterances and Speakers.

In [None]:
print(reddit_corpus.iter_speakers())

In [None]:
def n_in_list(list):
    '''
    Iterates over list and gives the number of items in the corpus
    '''
    n_speakers = 0
    for speaker in list:
        n_speakers += 1
    return n_speakers

print(n_in_list(reddit_corpus.iter_speakers()))
print(n_in_list(reddit_corpus.iter_conversations()))
print(n_in_list(reddit_corpus.iter_utterances()))


## Part 3: Working with metadata

See the [writeup for Part 3](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?tab=t.0#heading=h.ciwzim5uquvi) before continuing.

In [None]:
# Here's an example of accessing Conversation metadata
# (demonstrated on a randomly selected Conversation)
c = reddit_corpus.random_conversation()
print(c.meta["title"])

TASK: Using conversation-level metadata and the iterators you practiced in Part 2, compute (a) the total number of subreddits in the Corpus, and (b) the 5 subreddits with the most conversations (along with the exact number of conversations in each of those 5 subreddits).

In [None]:

subreddits = {}

for post in reddit_corpus.iter_conversations():
    temp_sub = post.meta["subreddit"]

    # count number of posts per subreddit
    if temp_sub in subreddits:
        subreddits[temp_sub] = subreddits.get(temp_sub) + 1
    else:
        subreddits[temp_sub] = 1


print("The number of subreddits is " + str(len(subreddits)))
print("the top five subreddits are "+ str(sorted(subreddits, key=subreddits.get, reverse=True)[:5]))

print(subreddits)

## Part 4: Transformers, roll out!

See the [writeup for Part 4](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?tab=t.0#heading=h.lhawg1ufgxev) before continuing.

In [None]:
# Here's an example of applying a simple Transformation
# (the TextCleaner, which does basic text standardization such as lowercasing everything)
from convokit import TextCleaner
cleaner = TextCleaner(replace_text=False)
reddit_corpus = cleaner.transform(reddit_corpus)
utt = reddit_corpus.random_utterance()
print("Original text:", utt.text)
print("Cleaned text:", utt.meta["cleaned"])

TASK: Use the [Fighting Words Transformer](https://convokit.cornell.edu/documentation/fightingwords.html) to generate a plot comparing the usage of words in the subreddits "nfl" and "programming".

In [None]:
# write code that applies the fightingwords transformer to the reddit corpus to compare the language 
# used in the subreddits "nfl" and "programming". Produce plot showing differences

fw = FightingWords(ngram_range=(1,1))

fw.fit(reddit_corpus, class1_func=lambda utt: utt.meta['subreddit'] == 'nfl',
                      class2_func=lambda utt: utt.meta['subreddit'] == 'programming')

df = fw.summarize(reddit_corpus, plot=True, class1_name='r/nfl', class2_name='r/programming')

df

## Part 5: Your turn!

See the [writeup for Part 5](https://docs.google.com/document/d/1sMnhaWcx5VgZDhnTW4xSj0FmjITdqFqoMXMWCEoHyoE/edit?tab=t.0#heading=h.ep64m0asidvd) before continuing.

Before you write any code, please write down in this text cell what groups you have chosen to compare, and what hypotheses you came up with beforehand.

In [None]:
# Add your code here.
# (You may also add as many additional code cells as you want)

#compare AskMen and AskWomen



Now write down in this text cell some things that you found. How did the results compare to your expectations? Was there anything that surprised you? Is there anything you would do differently?