# Level of Education Estimator

**Alex Shi, Mark Lee, Jun Ma**

## Introduction

This project is based on the idea of predicting education level by observing user behavior. More specifically, we plan to analyze the public comments of users on online forums and social media, including Facebook, CollegeConfidential, and Reddit, use natural language processing to estimate the level of sophistication of said comments, and correlate the estimations with the actual education level of the users.

## Overview

Generally, we have created a model that accepts as input a "comment" (some body of English text), and classifies the comment within one of three education levels: 

- college or above (2)
- high school (1)
- below high school (0)

One could imagine extending this model by creating an interface that takes as input a user's Facebook page, scrapes the user's public comments, and uses the model to make predictions about the user's education level.

## Procedure of Project

### Data Collection

Our initial idea was to directly scrape comments and education levels from users' Facebook pages using the [Graph API](https://developers.facebook.com/docs/graph-api). However, we found that it was not only difficult to find many users who reveal their education publicly on their profiles, but also that the API prevents us from directly retrieving timeline posts from a user unless he or she explicitly grants us [permission](https://developers.facebook.com/docs/graph-api/reference/v2.5/user/feed) to do so.

We therefore modified our strategy to search directly for people between a certain age range, and then scrape comments from the resulting page. For instance, the following page:
> https://www.facebook.com/search/20/30/users-age-2

Consists of a list of users who are between 20 and 30 years old. 

However, this approach has its own shortcomings. For one, building a robust parser with BeautifulSoup was non-trivial, given the complicated and widely varying structure of Facebook pages. More importantly, this approach does not solve the problem of users not revealing their education levels or comments publicly, as well as returns results for users who aren't very far removed from our existing social circles. As a result, we weren't able to scrape that many comments, and even for comments were able to scrape, we found quite low variance in terms of features.

Rather than seeking out users directly, we decided to target specific demographics by pulling data from different online groups. For instance, rather than trying to find users who publicly reveal that they have no high school education, we found pages whose audience is known to primarily consist of younger, less educated audiences. Of course, we make the (potentially questionable) assumption that comments on those pages are representative of an average comment within that education level. To show the legitimacy of this approach, we used cross-validation within each online group, as well as across different unrelated groups (the results of which will be presented later). 

The following summarizes our choice of groups for training:

- College or above:
    - [The New Yorker](https://www.facebook.com/newyorker)
    - [The New York Times](https://www.facebook.com/nytimes)
    - [IEEE](https://www.facebook.com/IEEE.org/?fref=ts)
    - [Psychology](https://www.facebook.com/elsevierpsychology/)
    - [Facebook Engineering](https://www.facebook.com/engineering/)
    - [Nature](https://www.facebook.com/nature/)
- High school:
    - [Justin Bieber](https://www.facebook.com/justin.bieber.film)
    - [Twilight](https://www.facebook.com/TwilightMovie)
    - [College Confidential Discussion Board](https://talk.collegeconfidential.com)
    - [Reddit Debate Forum](https://www.reddit.com/r/Debate/)
    - [Worldstar Hip Hop](https://www.facebook.com/worldstarhiphop)
- Below high school:
    - [Club Penguin](https://www.facebook.com/clubpenguin)
    - [Minecraft](https://www.facebook.com/minecraft)
    - [Gucci Mane](https://www.facebook.com/guccimane)
    
The following are for testing:

- College or above:
    - [The Economist](https://www.facebook.com/TheEconomist)
- High school:
    - [HipHopDX](https://www.facebook.com/HipHopDX/?fref=ts)
- Below high school:
    - [Desiigner](https://www.facebook.com/LifeOfDesiigner/)

__*Reasons for chooing those above pages and forums:*__

__*Methods used for scraping data:*__

__For Facebook pages__, we use the Facebook Graph API, which basically is the same as using HTTP requests with additional access token and unique page id returned by Facebook Graph API, finally write all the collected comments into json file, the core code looks like this:


In [None]:
# get the unique page id of a certain Facebook page
def getid(pagename):
    url = "https://graph.facebook.com/" + pagename + "?access_token=" + access_token
    text = str(requests.get(url).text)
    index = text.find('id":"')
    return text[index+5:-2]

# get posts and their post id
def getpost(pagename):
    msg = []
    idl = []
    id = getid(pagename)
    url = "https://graph.facebook.com/v2.8/" + id + "/posts/?fields=message&limit=100&access_token=" + access_token
    text = requests.get(url).text
    data = json.loads(text, strict=False)
    for set in data["data"]:
        if "message" in set:
            msg.append(set["message"])
        if "id" in set:
            idl.append(set["id"])
    return msg, idl

# get comments of those posts based on post id and try to parse the comments
def getcomments(pagename):
    idlist = getpost(pagename)[1]
    parsed = []
    raw = []
    num_comments = 0
    while num_comments < 2000:
        for id in idlist:
            url = "https://graph.facebook.com/v2.8/" + id + "/comments?access_token=" + access_token
            response = (requests.get(url).text)
            raw_comments = {}
            parsed_comments = []
            try:
                raw_comments = json.loads(response, strict=False)["data"]
            except:
                continue
            for comment in raw_comments:
                try:
                    comment = comment["message"].encode("ascii")
                    comment = comment.decode("ascii")            
                    if (len(comment.split(" ")) > 5):
                        num_comments += 1
                        parsed_comments.append(comment)
                except:
                    continue
            raw.append(raw_comments)
            parsed.append(parsed_comments)
            time.sleep(0.2)
    return raw, parsed

# list of page names for scraping comments
pages = [
'nytimes',
'newyorker',
'TheEconomist',
'justin.bieber.film',
'TwilightMovie',
'minecraft',
'clubpenguin',
'nature',
'engineering',
'elsevierpsychology',
'IEEE.org',
'guccimane',
'HipHopDX_70',
'LifeofDesiigner',
'worldstarhiphop'
]

# scrape data and write into json files
for page in pages:
    print("getting data for {} ...".format(page))
    raw, parsed = getcomments(page)
    print("writing raw data ...")
    with open('{}_raw.json'.format(page), 'w') as outfile:
        json.dump(raw, outfile)

    print("finish writing raw data from {}".format(page))
    print("writing comments ...")
    with open('{}.json'.format(page), 'w') as outfile:
        json.dump(parsed, outfile)
    print("finish writing comments from {}".format(page))

__For other two websites College Confidential and Reddit Debate Forum__, we are scraping data directly from HTML text using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) since it is much easier to get the string of text than Facebook page. The parsing method is a little different between these two websites since Reddit doesn't split one topic into different pages whereas College Confidential does.

In [None]:
# get Reddit comments
def getreddit(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    set = soup.findAll("p")
    count = 0
    result = []
    for i in set:
        if i.attrs == {}:
            if i.string != None:
                if len(i.string) > 25:
                    count += 1
                    if count > 10:
                        result.append(i.string)
    return result

# get College Confidential posts
def getcomment(turl, index):
    comment = []
    for i in range(1, index + 1):
        print i
        if i == 1:
            url = turl + ".html"
            html = requests.get(url).text
            soup = BeautifulSoup(html, "html.parser")
            set = soup.findAll("div", class_="Message")
            for element in set:
                comment.append(string.strip(element.contents[0]))
        else:
            p = "-p" + str(i)
            url = turl + p + ".html"
            html = requests.get(url).text
            soup = BeautifulSoup(html, "html.parser")
            set = soup.findAll("div", class_="Message")
            for element in set:
                comment.append(string.strip(element.contents[0]))
    return comment

Some observations:

- A lot of the pages targetted towards younger audiences are mostly recreational. We virtually couldn't find any pages that were both non-recreational and had substantial posts by younger users. 
- It was very difficult to find pages for the "below high school" group. We suspect that kids of that age typically don't post very much online (hence most pages were dominated by parents posting on behalf of their children).

As such, we will make the assumption that the above few pages are relatively representative of each education level.

Having identified these pages, we then used the Facebook [Graph API](https://developers.facebook.com/docs/graph-api) to first gather all the posts, then parse all the comments within each post.

## Training

The end goal is to take a list of (comment, education level) tuples and train an SVM model that takes a given post or comment and outputs a predicted education level. For any given comment or post, we run the following pre-processing steps:

- ignore short posts (less than 5 words)
- ignore non-English posts (using the [langid](https://github.com/saffsd/langid.py) library)
- filter out punctuation (including things like emojis)
- tokenize the filtered comment or post, removing stopwords and rare words (including things like links, proper nouns)
- compute a vector of four metrics:
    - `syllables_per_word`: count the total number of syllables and divide by total number of words
    - `words_per_sentence`: count the total number of words and divide by total number of sentences
    - `spelling_errors_per_sentence`: count the total number of spelling errors and divide by total number of sentences
    - `grammer_errors_per_sentence`: count the total number of grammer errors and divide by total number of sentences
    
To compute the syllables per word, we use the syllable counter in [nltk_contrib](https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/readability/syllables_en.py). To identify spelling errors, we check whether words (proper nouns i.e capitalized words excluded) are within a [dictionary of english words](https://github.com/dwyl/english-words). To identify grammer errors, we use the [grammer-check](https://pypi.python.org/pypi/grammar-check) library, which gives us the number of total grammar errors.

The processed comments are then labeled based on which source it came from (e.g posts from The New Yorker will have label of 2) and then used to train the SVM using linear regression.

## Predicting

Given a new user's Facebook page, we use the same functions as before to scrape all posts and comments from the user, applying the same pre-processing steps as we did during training. Finally, we use the trained SVM to predict the education level.

## Current status

We have finished scraping all of the data and training the model. What remains is to use our model to make predictions. We are currently looking at different celebrities/prominent figures to make predictions on, as well as different ways to present our data graphically in order to capture the most interesting results.

## Appendix

- https://www.quora.com/What-are-the-demographics-of-Minecraft-players
- http://www.ibtimes.com/audience-profiles-who-actually-reads-new-york-times-watches-fox-news-other-news-publications-1451828
- https://pypi.python.org/pypi/pylinkgrammar
- http://stackoverflow.com/questions/10252448/how-to-check-whether-a-sentence-is-correct-simple-grammar-check-in-python
- https://github.com/dwyl/english-words