# Level of Education Estimator

**Alex Shi, Mark Lee, Jun Ma**

## Introduction

This project is based on the idea of predicting education level by observing user behavior. More specifically, we plan to analyze the public comments of users on online forums and social media, including Facebook, CollegeConfidential, and Reddit, use natural language processing to estimate the level of sophistication of said comments, and correlate the estimations with the actual education level of the users.

## Overview

Generally, we have created a model that accepts as input a "comment" (some body of English text), and classifies the comment within one of three education levels: 

- college or above (2)
- high school (1)
- below high school (0)

One could imagine extending this model by creating an interface that takes as input a user's Facebook page, scrapes the user's public comments, and uses the model to make predictions about the user's education level.

## Data Collection

Our initial idea was to directly scrape comments and education levels from users' Facebook pages using the [Graph API](https://developers.facebook.com/docs/graph-api). However, we found that it was not only difficult to find many users who reveal their education publicly on their profiles, but also that the API prevents us from directly retrieving timeline posts from a user unless he or she explicitly grants us [permission](https://developers.facebook.com/docs/graph-api/reference/v2.5/user/feed) to do so.

We therefore modified our strategy to search directly for people between a certain age range, and then scrape comments from the resulting page. For instance, the following page:
> https://www.facebook.com/search/20/30/users-age-2

Consists of a list of users who are between 20 and 30 years old. 

However, this approach has its own shortcomings. For one, building a robust parser with BeautifulSoup was non-trivial, given the complicated and widely varying structure of Facebook pages. More importantly, this approach does not solve the problem of users not revealing their education levels or comments publicly, as well as returns results for users who aren't very far removed from our existing social circles. As a result, we weren't able to scrape that many comments, and even for comments were able to scrape, we found quite low variance in terms of features.

Rather than seeking out users directly, we decided to target specific demographics by pulling data from different online groups. For instance, rather than trying to find users who publicly reveal that they have no high school education, we found pages whose audience is known to primarily consist of younger, less educated audiences. Of course, we make the (potentially questionable) assumption that comments on those pages are representative of an average comment within that education level. To attempt to verify the legitimacy of this approach, we used cross-validation within each online group, as well as across different unrelated groups to measure accuracy (the results of which will be presented later). 

**The following summarizes our choice of groups for training:**

- College or above:
    - [The New Yorker](https://www.facebook.com/newyorker)
    - [The New York Times](https://www.facebook.com/nytimes)
    - [IEEE](https://www.facebook.com/IEEE.org/?fref=ts)
    - [Psychology](https://www.facebook.com/elsevierpsychology/)
    - [Facebook Engineering](https://www.facebook.com/engineering/)
    - [Nature](https://www.facebook.com/nature/)
- High school:
    - [Justin Bieber](https://www.facebook.com/justin.bieber.film)
    - [Twilight](https://www.facebook.com/TwilightMovie)
    - [College Confidential Discussion Board](https://talk.collegeconfidential.com)
    - [Reddit Debate Forum](https://www.reddit.com/r/Debate/)
    - [Worldstar Hip Hop](https://www.facebook.com/worldstarhiphop)
- Below high school:
    - [Club Penguin](https://www.facebook.com/clubpenguin)
    - [Minecraft](https://www.facebook.com/minecraft)
    - [Gucci Mane](https://www.facebook.com/guccimane)
    
**The following are for testing:**

- College or above:
    - [The Economist](https://www.facebook.com/TheEconomist)
- High school:
    - [HipHopDX](https://www.facebook.com/HipHopDX/?fref=ts)
- Below high school:
    - [Desiigner](https://www.facebook.com/LifeOfDesiigner/)
    
**Some observations:**

- A lot of the pages targetted towards younger audiences are mostly recreational. We virtually couldn't find any pages that were both non-recreational and had substantial posts by younger users. 
- It was very difficult to find pages for the "below high school" group. We suspect that kids of that age typically don't post very much online (hence most pages were dominated by parents posting on behalf of their children).

## Scrapers

**For Facebook pages**

We use the Facebook Graph API, which is basically the same as using HTTP requests with additional access token and unique page ID.

In [None]:
# get the unique page id of a certain Facebook page
def getid(pagename):
    url = "https://graph.facebook.com/" + pagename + "?access_token=" + access_token
    text = str(requests.get(url).text)
    index = text.find('id":"')
    return text[index+5:-2]

# get posts and their post id
def getpost(pagename):
    msg = []
    idl = []
    id = getid(pagename)
    url = "https://graph.facebook.com/v2.8/" + id + "/posts/?fields=message&limit=100&access_token=" + access_token
    text = requests.get(url).text
    data = json.loads(text, strict=False)
    for set in data["data"]:
        if "message" in set:
            msg.append(set["message"])
        if "id" in set:
            idl.append(set["id"])
    return msg, idl

# get comments of those posts based on post id and try to parse the comments
def getcomments(pagename):
    idlist = getpost(pagename)[1]
    parsed = []
    raw = []
    num_comments = 0
    while num_comments < 2000:
        for id in idlist:
            url = "https://graph.facebook.com/v2.8/" + id + "/comments?access_token=" + access_token
            response = (requests.get(url).text)
            raw_comments = {}
            parsed_comments = []
            try:
                raw_comments = json.loads(response, strict=False)["data"]
            except:
                continue
            for comment in raw_comments:
                try:
                    comment = comment["message"].encode("ascii")
                    comment = comment.decode("ascii")            
                    if (len(comment.split(" ")) > 5):
                        num_comments += 1
                        parsed_comments.append(comment)
                except:
                    continue
            raw.append(raw_comments)
            parsed.append(parsed_comments)
            time.sleep(0.2)
    return raw, parsed

# list of page names for scraping comments
pages = [
'nytimes',
'newyorker',
'TheEconomist',
'justin.bieber.film',
'TwilightMovie',
'minecraft',
'clubpenguin',
'nature',
'engineering',
'elsevierpsychology',
'IEEE.org',
'guccimane',
'HipHopDX_70',
'LifeofDesiigner',
'worldstarhiphop'
]

# scrape data and write into json files
for page in pages:
    print("getting data for {} ...".format(page))
    raw, parsed = getcomments(page)
    print("writing raw data ...")
    with open('{}_raw.json'.format(page), 'w') as outfile:
        json.dump(raw, outfile)

    print("finish writing raw data from {}".format(page))
    print("writing comments ...")
    with open('{}.json'.format(page), 'w') as outfile:
        json.dump(parsed, outfile)
    print("finish writing comments from {}".format(page))

**College Confidential and Reddit**

We scrape the webpages using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) since they are structurally simpler than Facebook pages.

In [None]:
# get Reddit comments
def getreddit(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    set = soup.findAll("p")
    count = 0
    result = []
    for i in set:
        if i.attrs == {}:
            if i.string != None:
                if len(i.string) > 25:
                    count += 1
                    if count > 10:
                        result.append(i.string)
    return result

# get College Confidential posts
def getcomment(turl, index):
    comment = []
    for i in range(1, index + 1):
        print i
        if i == 1:
            url = turl + ".html"
            html = requests.get(url).text
            soup = BeautifulSoup(html, "html.parser")
            set = soup.findAll("div", class_="Message")
            for element in set:
                comment.append(string.strip(element.contents[0]))
        else:
            p = "-p" + str(i)
            url = turl + p + ".html"
            html = requests.get(url).text
            soup = BeautifulSoup(html, "html.parser")
            set = soup.findAll("div", class_="Message")
            for element in set:
                comment.append(string.strip(element.contents[0]))
    return comment

## Training

For each training data set (i.e. not "The Economist", "HipHopDX", or "Desiigner", which are used for testing), we run several pre-processing steps:


1. Convert each comment into a vector of features (see below)
2. Associate each vector of features with a label (0, 1, or 2, as specified in previous section)
3. Use 70% of the labeled examples for training, and reserve 30% for the holdout validation set
4. Fit the model with the 70% training set, and then evaluate performance on the 30% validation set

**Features**

1. Ignore short comments (less than 3 words)
2. Ignore non-English comments (using the [langid](https://github.com/saffsd/langid.py) library)
3. Filter out punctuation (including things like emojis), proper nouns, and links
4. Compute a vector of four metrics:
    - `syllables_per_word`: count the total number of syllables and divide by total number of words
    - `words_per_sentence`: count the total number of words and divide by total number of sentences
    - `spelling_errors_per_sentence`: count the total number of spelling errors and divide by total number of sentences
    - `grammer_errors_per_sentence`: count the total number of grammar errors and divide by total number of sentences

To compute the syllables per word, we use the syllable counter in [nltk_contrib](https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/readability/syllables_en.py). To identify spelling errors, we check whether words (proper nouns i.e capitalized words excluded) are within a [dictionary of english words](https://github.com/dwyl/english-words). To identify grammer errors, we use the [grammer-check](https://pypi.python.org/pypi/grammar-check) library, which gives us the number of total grammar errors.

The relevant feature generation code is listed here:

In [None]:
# convert comment into sentences
def get_sentences (doc):
    return sent_tokenize(doc)

# convert sentence into words
def get_words (sentence):
    sentence = sentence.strip()

    words = []
    for token in tokenizer.tokenize(sentence):
        token = token.decode("utf-8")

        # remove urls
        modified = re.sub(url_regex, '', token)
        # remove punctuation
        modified = modified.translate(punctuation_table)
        # remove proper nouns
        modified = re.sub(proper_noun_regex, '', modified)
        # remove whitespace and standalone numbers
        modified = re.sub(space_or_num_regex, '', modified)

        if len(modified):
            words.append(token) 

    return words

# compute for a specific comment:
# number of syllables, number of words, number of spelling errors,
# number of grammar errors, number of sentences
def get_metrics (doc):
    global grammar_tool

    # initialize dict
    metrics = [
        'syllables', 'words', 'spelling_errors', 'grammar_errors', 'sentences'
    ]
    res = { metric: 0 for metric in metrics }

    # initial parse
    sentences = get_sentences(doc)

    # get metrics
    num_sentences = len(sentences)
    res['sentences'] = num_sentences
    for sentence in sentences:
        try:
            try:
                res['grammar_errors'] += len(grammar_tool.check(sentence))
            except Exception as e:
                print "grammar tool failed: {}".format(e)
                print "reinitializing grammar tool.."
                grammar_tool = grammar_check.LanguageTool('en-US')
                time.sleep(0.1)

            words_for_sentence = get_words(sentence)
            res['words'] += len(words_for_sentence)

            for word in words_for_sentence:
                try:
                    # handle trailing punctuation for spellchecker
                    if word[-1] in string.punctuation:
                        word = word[:-1]
                    res['syllables'] += count_syllables(word)
                    if not spelling_tool.check(word):
                        res['spelling_errors'] += 1
                except Exception as e:
                    print "inner exception:", e
                    continue
        except Exception as e:
            print "outer exception:", e
            continue

    if res['words'] == 0:
        print "discarding...", doc

    return res

# given a set of metrics returned by get_metrics, compute the feature vector:
# `syllables_per_word`: count the total number of syllables and divide by
# total number of words
# `words_per_sentence`: count the total number of words and divide by total
# number of sentences
# `spelling_errors_per_sentence`: count the total number of spelling
# errors and divide by total number of sentences
# `grammer_errors_per_sentence`: count the total number of
# grammer errors and divide by total number of sentences
def get_features (metrics):
    num_sentences = metrics['sentences']

    # document is too short
    if (num_sentences == 0 or metrics['words'] < MIN_WORDS_PER_DOC):
        return None

    res = []
    num_sentences = float(num_sentences)

    # compute features
    res.append(metrics['syllables'] / float(metrics['words']))
    res.append(metrics['words'] / num_sentences)
    res.append(metrics['spelling_errors'] / num_sentences)
    res.append(metrics['grammar_errors'] / num_sentences)
    return np.array(res)

# given a list of docs (body of text), parse into tokens
# if doc is too short, skip
# otherwise, use the tokens to build a feature, and label appropriately
def create_features (docs, labels):
    X, y = [], []
    non_english = 0
    too_short = 0

    for i, doc in enumerate(docs):
        # ignore if not english
        if langid.classify(doc)[0] != 'en':
            non_english += 1
            continue

        metrics = get_metrics(doc)
        features = get_features(metrics)
        if features is not None:
            X.append(features)
            y.append(labels[i])
        else:
            too_short += 1

    X = np.array(X)
    y = np.array(y)
    print X.shape, y.shape, non_english, too_short
    return X, y

**Model**

Initially, we used `sklearn.svm.SVC` as our model, using grid search to optimize the regularization parameter and scoring the accuracy over various kernels to determine the optimal kernel. However, we found that even when using the most accurate kernel ("poly"), the best performance we were able to achieve was 45% accuracy on the validation set. Furthermore, the non-linear kernel took several minutes just to train on the data, which prevented us from doing more extensive grid search. We fell back on the "linear" kernel which performed relatively similarly to "poly" (and also ran much faster), but in the end optimizing the regularization only resulted in 47% accuracy on the validation set.

The next thing we tried was LogisticRegression, which was comparable to the SVM but ran much faster (on the order of seconds, as opposed to minutes), and yet seemed to have comparable support for multi-class classification as SVM with non-linear kernel. Indeed, using just the default parameters we were able to achieve close to 45% accuracy, and because of the speed, we were able to perform a much more thorough search over the hyperparameters. When running with the "lbfgs" solver, with regularization set to 10 and max iterations to 100, we were able to achieve close to 49% accuracy with much less training time.

Rather than using grid search, we also tried using LogisticRegressionCV, which performs kfold cross-validation in order to optimize the hyperparameters. Indeed, using LogisticRegressionCV with 10 folds, regularizations from [10^-10, 10^10], and 10000 max iterations, we were able to get over 56% accuracy, which is a significant improvement.

The relevant code is listed:

In [None]:
# comments should be a nx1 list of strings
# labels should be a nx1 list of ints
# the ith label should correspond to the ith comment
def learn_classifier (X_train, y_train, kernel='best'):
    print "learning classifier..."
    clf = LogisticRegressionCV(
        Cs=list(np.power(10.0, np.arange(-10, 10))),
        penalty='l2',
        cv=10, # kfolds with k=10
        random_state=42,
        max_iter=10000,
        fit_intercept=True,
        solver='lbfgs',
        tol=1e-4
    )
    clf.fit(X_train, y_train)
    print "done learning classifier"
    return clf

# chooses optimal kernel for svm
def optimal_svm_kernel (X_train, y_train, X_validate, y_validate):
    best_kernel = None
    best_accuracy = 0
    for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
        classifier = SVM(X_train, y_train, kernel)
        accuracy = classifier.score(X_validate, y_validate)
        if best_kernel is None or accuracy > best_accuracy:
            best_accuracy = accuracy
            best_kernel = kernel
        print kernel, ":", accuracy
    return best_kernel, best_accuracy

# choose optimal params for logistic regression
def optimal_hyperparams (X_train, y_train):
    print "running grid search..."
    clf = LogisticRegression()
    params = {
        "solver": ["lbfgs", "newton-cg", "sag"],
        "max_iter": [100, 1000, 10000],
        "multi_class": ["ovr", "multinomial"],
        "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]
    }
    classifier = GridSearchCV(clf, params)
    classifier.fit(X_train, y_train)
    best = classifier.best_params_
    print "best params:", best
    print "best score:", classifier.best_score_
    print "dumping results to file: {}...".format(GRID_SEARCH_RESULTS_FILE)
    with open(GRID_SEARCH_RESULTS_FILE, 'w') as f:
        f.write("{}".format(best))


## Prediction and analysis

When running on the testing data, which our model was never trained on, we get just under 50% accuracy. While this may not seem very high, we are predicting across three categories that are, in some sense, not very discrete. Not only was the assignment of sources to categories based on our own judgment (and what limited demographic data we have online), but also the comments within and across the pages obviously are not uniformly of the same genre, tone, or style of writing. 

To improve our accuracy, we could try identifying different features that are perhaps more salient or representative of each education group, although our features are already relatively less biased than, say, any bag of words approach, which is even more heavily influenced by the genre of writing. 

For example, we visualized the feature data and observed that our expectations matched relatively well with our features. Namely, we can see that grammar and spelling errors increase as we go down the education levels, and syllables per word increases as we go up the education levels.

![feature_visualization](feature_visualization.png)

Perhaps more importantly is that our initial assumption, that comments from pages with audiences of certain demographics accurately represent those demographics, may not be entirely accurate, at least based on the pages we chose. To further improve the accuracy of our model, a better approach would be to find more reliable data, such as reverting to our initial idea of scraping education levels from Facebook pages using Beautifulsoup. Even though with this method we would probably get less data, the fact that we can ascertain that our sample is within a certain age group gives us relatively better confidence in the reliability of the data.

Further, we have only thus far tried relatively simple models, namely LogisticRegression and SVM with "linear" and "poly" kernels (with relatively un-optimized parameters). Further investigation into different types of classification models, as well as more extensive optimization of parameters would probably yield more favorable results.

## Appendix

- https://www.quora.com/What-are-the-demographics-of-Minecraft-players
- http://www.ibtimes.com/audience-profiles-who-actually-reads-new-york-times-watches-fox-news-other-news-publications-1451828
- https://pypi.python.org/pypi/pylinkgrammar
- http://stackoverflow.com/questions/10252448/how-to-check-whether-a-sentence-is-correct-simple-grammar-check-in-python
- https://github.com/dwyl/english-words