[/r/soccer](https://www.reddit.com/r/soccer/) is my most frequented subreddit. I don't have stats to back that up so you'll have to take my word for it. In general, football subreddits and online forums are a gold mine for analyzing behaviour of fans. One could think of a subreddit as a "superorganism", having its own set of internally conflicting opinions and ideas. This is the motivation behind this blog post - analyzing fan behaviour by looking at comments on reddit's subreddits. I consider 5 subreddits (forums) - /r/soccer (the subreddit frequented by fans of all teams), /r/reddevils, /r/liverpoolfc, /r/chelseafc and /r/gunners (corresponding to fan subreddits for Manchester United, Liverpool, Chelsea and Arsenal).

## Getting the Data

Thanks to [python's reddit API wrapper](https://praw.readthedocs.io/en/latest/), getting comments data from reddit is easier than ever. [Here](https://github.com/kvsingh/soccer-comments/blob/master/get_data.py) is the link to the code for getting the comments using praw, and writing it to a pickle object.

Let's find out some basic information about the data.

In [7]:
import pickle
subreddits = ["soccer", "liverpoolfc", "Gunners", "reddevils", "chelseafc"]
all_comments = {}
for subreddit in subreddits:
    all_comments[subreddit] = pickle.load(open("reddit-top-1000-post-comments-" + subreddit + ".p", "rb"))
    print subreddit, ":", len(all_comments[subreddit]), "comments"

soccer : 39005 comments
liverpoolfc : 7363 comments
Gunners : 6462 comments
reddevils : 6606 comments
chelseafc : 4617 comments


## Data Cleaning

Comments on reddit are full of weird types of symbols and unicode characters. I spent some time trying to figure out the common characters/words which are irrelevant to our analysis. Any kind of text analysis will involve some customized study of our domain to figure out what characters/words we need to remove.

First, we need to remove unicode characters (internet forums are full of these). In addition, reddit stores "[removed]" as the comment for deleted comments. We also remove "\n" (newline character) and "'s" (I got the idea to remove this particular suffix by going through the code of [WordCloud](https://github.com/amueller/word_cloud) module.

In [12]:
import re
import unicodedata
comments_modified = {}
for subreddit in subreddits:
    this_comments_modified = []
    for comment in all_comments[subreddit]:
        comment_mod = unicodedata.normalize('NFKD', comment[1]).encode('ascii','ignore')
        comment_mod = re.sub(r"\n", "", comment_mod)
        comment_mod = re.sub(r"'s", "", comment_mod)
        if comment_mod != "[removed]":
            this_comments_modified.append([comment_mod, comment[0]])
    comments_modified[subreddit] = this_comments_modified
    print subreddit, ":", len(comments_modified[subreddit]), "comments"

soccer : 38994 comments
liverpoolfc : 7333 comments
Gunners : 6400 comments
reddevils : 6604 comments
chelseafc : 4601 comments


As an example, the following comment:

> Palace fan here\n. Well done. \U0001f44f\U0001f3fc\U0001f44f\U0001f3fc

Gets modified to :

> Palace fan here. Well done.

For any sort of analysis, we will need to "tokenize" the comments, i.e., identify the important words in each comment, and transform them into a form which can be used by your classfiers/analyzers. This varies according to the domain you are working on. For the purpose of our use case, we will be performing the following steps:

* Tokenize words
* Lemmatization, or identifying the root of the word
* Eliminating stop words and punctuation

In [None]:
comment_tokens = {}
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

wordnet_lemmatizer = WordNetLemmatizer()
for subreddit in subreddits:
    comments = comments_modified[subreddit]
    this_tokens = []
    for comment in comments:
        tokens = word_tokenize(comment[0])
        tokens = [t.lower() for t in tokens]
        tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
        tokens = filter(lambda a: a not in stopwords.words('english'), tokens)
        tokens = [re.sub(r'[^\w\s]','',s) for s in tokens]
        tokens = filter(lambda a: len(a) > 2, tokens)
        this_tokens.append(tokens)        
    comment_tokens[subreddit] = this_tokens
pickle.dump(comment_tokens, open("comment_tokens.p", "wb"))