# Discord Data
### Overview
Today, we're looking at data collected from one of the Discord servers I use. The server was created in 2016 and is still active.

### Data Collection
The data was collected using a custom Discord bot written in Python.  [See here](https://github.com/jackstephenson19/discord-collector-bot) for the bot created and used to collected this data.

# Exploratory Data Analysis (EDA)

![Bar chart made in Tableau showing an overview of the data showing the total message count per year](resources/TotalMessageCountPerYear.png)
![Line charts made in Tableau showing the total message count per month from 2019-2022](resources/MessagesPerMonth2019-2022.png)

As shown from the above visualizations made in Tableau, the first two years of the servers lifespan saw limited interaction and only really began seeing use in 2018, however not for the entirety of the year. The server saw peak activity during 2019-2022, more specifically in 2020, where there is a large spike in usage from February with a total of 42 messages sent, to May with 2,738 messages sent. This is an all-time high for server activity, as the period of time from April to June 2020 were the only instances in the history of the server where the total message count per month was over 1000 messages sent, almost tripling that number in May. I speculate that this drastic increase in server usage is due to the begining of the COVID-19 pandemic.

![Bar chart made in R showing the number of messages per user, for users with over 100 messages](resources/NumberOfMessagesPerAuthor.jpeg)

The above bar chart was made in R programming and provides an overview of the total message rankings throughout the entirety of the servers lifespan, for users with over 100 messages sent.

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

df = pd.read_csv("all_messages_clean3.csv")
df.head(10)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\steph\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Channel,Author,Year,Date,Time,Message
0,general,young nastyman,2023,2023-04-01,0:04:51,dammit
1,general,young nastyman,2023,2023-04-01,0:00:46,must be a bug
2,general,young nastyman,2023,2023-04-01,0:00:40,thats weird why does it keep banning marco
3,general,ptat,2023,2023-04-01,0:00:24,you should make a bot that randomly bans someo...
4,general,young nastyman,2023,2023-03-31,23:59:38,i think its working now i have to let it run
5,general,young nastyman,2023,2023-03-31,23:35:42,&collect
6,general,young nastyman,2023,2023-03-31,23:29:33,&collect
7,general,young nastyman,2023,2023-03-31,23:26:55,&collect
8,general,young nastyman,2023,2023-03-31,23:25:36,&collect
9,general,CIamam,2023,2023-03-31,14:38:39,I just know it


In [2]:
# We need to create another column for month so that the messages can be aggregated by month
def extract_month(date_str: str) -> str:
    return date_str.split("-")[1]

df['Month'] = df["Date"].apply(extract_month)

# Now we will split up each message into a list of lemmatized words so that they can be compared
lemmatizer = WordNetLemmatizer()

def lemmatize_message(msg: str) -> list[str]:
    return [lemmatizer.lemmatize(word) for word in msg.split()]

df["Tokens"] = df["Message"].apply(lemmatize_message)

In [3]:
df.head()

Unnamed: 0,Channel,Author,Year,Date,Time,Message,Month,Tokens
0,general,young nastyman,2023,2023-04-01,0:04:51,dammit,4,[dammit]
1,general,young nastyman,2023,2023-04-01,0:00:46,must be a bug,4,"[must, be, a, bug]"
2,general,young nastyman,2023,2023-04-01,0:00:40,thats weird why does it keep banning marco,4,"[thats, weird, why, doe, it, keep, banning, ma..."
3,general,ptat,2023,2023-04-01,0:00:24,you should make a bot that randomly bans someo...,4,"[you, should, make, a, bot, that, randomly, ba..."
4,general,young nastyman,2023,2023-03-31,23:59:38,i think its working now i have to let it run,3,"[i, think, it, working, now, i, have, to, let,..."


In [4]:
# Now we will filter for just the year 2020 so we can investigate the huge peak in messages
df_2020 = df[df["Year"] == 2020]

# Now we want to aggregate the data by month and get a count of the most used words in each month
agg_df = df[["Month", "Tokens"]].groupby("Month").agg({'Tokens': 'sum'})
agg_df.head()

Unnamed: 0_level_0,Tokens
Month,Unnamed: 1_level_1
1,"[it, broken, with, pyke/swain, only, like, one..."
2,"[https://www.youtube.com/watch?v=-2E7Wkz3quA, ..."
3,"[i, think, it, working, now, i, have, to, let,..."
4,"[dammit, must, be, a, bug, thats, weird, why, ..."
5,"[https://youtu.be/6pxaL3uHWgc, broken, nerf, p..."


In [5]:
from collections import Counter

# Now we will turn the word lists for each month into a counter
agg_df['Counter'] = agg_df['Tokens'].apply(Counter)
agg_df.head()

Unnamed: 0_level_0,Tokens,Counter
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[it, broken, with, pyke/swain, only, like, one...","{'it': 25, 'broken': 2, 'with': 9, 'pyke/swain..."
2,"[https://www.youtube.com/watch?v=-2E7Wkz3quA, ...",{'https://www.youtube.com/watch?v=-2E7Wkz3quA'...
3,"[i, think, it, working, now, i, have, to, let,...","{'i': 52, 'think': 5, 'it': 35, 'working': 2, ..."
4,"[dammit, must, be, a, bug, thats, weird, why, ...","{'dammit': 1, 'must': 1, 'be': 15, 'a': 48, 'b..."
5,"[https://youtu.be/6pxaL3uHWgc, broken, nerf, p...","{'https://youtu.be/6pxaL3uHWgc': 1, 'broken': ..."


In [6]:
# Now that we have the word counts for each month, we want to filter them
# First, lets find the most common word in each month
def extract_top_word(d: Counter) -> str:
    top_word = ""
    top_word_count = 0
    for k, v in d.items():
        if v >= top_word_count:
            top_word_count = v
            top_word = k
    return top_word

agg_df['top_word'] = agg_df['Counter'].apply(extract_top_word)
agg_df['month_raw'] = agg_df.index
print(agg_df[['month_raw', 'top_word']].to_string())

      month_raw top_word
Month                   
01           01    #play
02           02    #play
03           03    FILE>
04           04    -play
05           05    -play
06           06    -play
07           07    -play
08           08    -play
09           09    FILE>
10           10    >play
11           11    >play
12           12    >play


Based on the output above, it appears that the most common words in each month are simply bot commands.  To get a real sense of the most used words in conversation, we will filter out the common bot commands.

Based on the table above, it appears that all commands start with a "#" symbol, a "-" symbol, a ">" symbol, or "FILE".  We will filter these out.

In [7]:
def filter_bot_cmds(word: str) -> bool:
    if (word[0] == "#") or (word[0] == ">") or (word[0] == "-") or ("FILE>" in word):
        return False
    return True

def filter_and_lemmatize_message(msg: str) -> list[str]:
    return [lemmatizer.lemmatize(word) for word in msg.split() if filter_bot_cmds(word)]

df["Tokens"] = df["Message"].apply(filter_and_lemmatize_message)
df.head()

Unnamed: 0,Channel,Author,Year,Date,Time,Message,Month,Tokens
0,general,young nastyman,2023,2023-04-01,0:04:51,dammit,4,[dammit]
1,general,young nastyman,2023,2023-04-01,0:00:46,must be a bug,4,"[must, be, a, bug]"
2,general,young nastyman,2023,2023-04-01,0:00:40,thats weird why does it keep banning marco,4,"[thats, weird, why, doe, it, keep, banning, ma..."
3,general,ptat,2023,2023-04-01,0:00:24,you should make a bot that randomly bans someo...,4,"[you, should, make, a, bot, that, randomly, ba..."
4,general,young nastyman,2023,2023-03-31,23:59:38,i think its working now i have to let it run,3,"[i, think, it, working, now, i, have, to, let,..."


In [8]:
# Now we need to recalculate the most common word for each month.  We will make a function to do this.
def most_common_words(df: pd.DataFrame, year: int = 2020) -> pd.DataFrame:
    new_df = df[df["Year"] == year]
    agg_df = new_df[["Month", "Tokens"]].groupby("Month").agg({'Tokens': 'sum'})
    agg_df['Counter'] = agg_df['Tokens'].apply(Counter)
    agg_df['top_word'] = agg_df['Counter'].apply(extract_top_word)
    agg_df['month_raw'] = agg_df.index
    return agg_df[['month_raw', 'top_word']]

df_2020_filtered = most_common_words(df, 2020)
print(df_2020_filtered.head())

      month_raw                      top_word
Month                                        
01           01                             I
02           02  https://www.twitch.tv/ciamam
03           03                        <VIDEO
04           04                        <VIDEO
05           05                        <VIDEO


In [9]:
# To make it easier to get rid of all bot commands, we will modularize the function to create the top word
bot_words = ["file", "video", "image", "<@!", "https", "www"]
bot_identifiers = [">", "-", "#"]
common_words = ["i", "you", "the", "a", "am", "like"]

def filter_bot_cmds(
    word: str,
    bot_words: list[str] = bot_words,
    bot_identifiers: list[str] = bot_identifiers,
    common_words: list[str] = common_words
    ) -> bool:
    if (word[0] in bot_identifiers) or any(bot_w in word for bot_w in bot_words) or (word in common_words):
        return False
    return True

def filter_lem_msg(msg: str, filter_func) -> list[str]:
    return [lemmatizer.lemmatize(word).lower() for word in msg.split() if filter_func(word.lower())]

def bot_filter(
    df: pd.DataFrame,
    bot_words: list[str] = bot_words,
    bot_identifiers: list[str] = bot_identifiers,
    common_words: list[str] = common_words,
    year: int = 2020
    ) -> pd.DataFrame:
    filter_func = lambda m: filter_bot_cmds(m, bot_words, bot_identifiers, common_words)
    lem_msg_func = lambda m: filter_lem_msg(m, filter_func)
    df["Tokens"] = df["Message"].apply(lem_msg_func)
    return most_common_words(df, year)

print(bot_filter(df).to_string())

      month_raw    top_word
Month                      
01           01       wanna
02           02        game
03           03          to
04           04    herc@big
05           05  thundercat
06           06          to
07           07          21
08           08          wa
09           09          to
10           10          is
11           11         and
12           12          is


Nice, so the function to generate most common words words, we just have to come up with a good list of stopwords to filter

In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')

print(bot_filter(df, common_words=list(stopwords.words('english'))).to_string())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\steph\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


      month_raw    top_word
Month                      
01           01       wanna
02           02        game
03           03         get
04           04    herc@big
05           05  thundercat
06           06         say
07           07          21
08           08  dickriding
09           09      heajtq
10           10        like
11           11           u
12           12        like
