# Level of Education Estimator

**Alex Shi, Mark Lee, Jun Ma**

## Introduction

This project is be based on the idea of predicting education level by observing user behavior. More specifically, we plan to analyze the public posts of users on social media including Facebook and BBS such as CollegeConfidential and Reddit, using natural language processing to estimate the level of sophistication of public posts, and correlating the estimations with the actual education level of the users.

## Overview

Generally, we have created an interface that accepts input of a piece of text from either a person's social media or , and outputs the predicted education level of the user:

- college or above (2)
- high school (1)
- below high school (0)

The final goal is to take a user's some pieces of script and output a prediction of education level.

## Procedure of Project
### Data Collection
At first, we aimed to directly get posts and comments of certain users from [Facebook Graph API](https://developers.facebook.com/docs/graph-api). Thus we could effectively get posts and comments from our desired group of people, e.g. high shcool students, college students etc. However, sooner we found out that it was impossible to directly get timeline posts from a person if he/she did not grant our API token [permission](https://developers.facebook.com/docs/graph-api/reference/v2.5/user/feed).

Then we changed our strategy to search directly for people between a certain age range and tried to directly scrape data from HTTP request of the web page. We found out that we could search for users within ceartain age range using:
> https://www.facebook.com/search/20/30/users-age-2

This will give us users who are between 20 and 30 years old. 

However, when we were trying to parse the HTML text with BeautifulSoup and HTML parser, we could not get our desired data after several trails since the content structure of Facebook page is far more complicated than we predicted.

Finally, after sampling users from various Facebook Pages (instead of user's profile pages) and researching on several forums and social media, we found that rather than scraping education level directly from user profiles, it would be still reliable to scrape pages with certain demographics, and assume that posts and comments on those pages are representative of an average post within that education level. The following summarizes our findings:

- College or above:
    - [The New Yorker](https://www.facebook.com/newyorker)
    - [The Economist](https://www.facebook.com/TheEconomist)
    - [The New York Times](https://www.facebook.com/nytimes)
    - [IEEE](https://www.facebook.com/IEEE.org/?fref=ts)
    - [Psychology](https://www.facebook.com/elsevierpsychology/)
    - [Facebook Engineering](https://www.facebook.com/engineering/)
    - [Nature](https://www.facebook.com/nature/)
    
- High school:
    - [Justin Bieber](https://www.facebook.com/justin.bieber.film)
    - [Twilight](https://www.facebook.com/TwilightMovie)
    - [College Confidential Discussion Board](https://talk.collegeconfidential.com)
    - [Reddit Debate Forum](https://www.reddit.com/r/Debate/)
- Below high school:
    - [Club Penguin](https://www.facebook.com/clubpenguin)
    - [Minecraft](https://www.facebook.com/minecraft)
    
__*Reasons for chooing those above pages and forums:*__

__*Methods used for scraping data:*__

__For Facebook pages__, we use the Facebook Graph API, which basically is the same as using http requests with additional access token and unique page id returned by Facebook Graph API, finally write all the collected comments into jason file, the core code looks like this:


In [None]:
# get the unique page id of a certain Facebook page
def getid(pagename):
    url = "https://graph.facebook.com/" + pagename + "?access_token=" + access_token
    text = str(requests.get(url).text)
    index = text.find('id":"')
    return text[index+5:-2]

# get posts and their post id
def getpost(pagename):
    msg = []
    idl = []
    id = getid(pagename)
    url = "https://graph.facebook.com/v2.8/" + id + "/posts/?fields=message&limit=100&access_token=" + access_token
    text = requests.get(url).text
    data = json.loads(text, strict=False)
    for set in data["data"]:
        if "message" in set:
            msg.append(set["message"])
        if "id" in set:
            idl.append(set["id"])
    return msg, idl

# get comments of those posts based on post id and try to parse the comments
def getcomments(pagename):
    idlist = getpost(pagename)[1]
    parsed = []
    raw = []
    num_comments = 0
    while num_comments < 2000:
        for id in idlist:
            url = "https://graph.facebook.com/v2.8/" + id + "/comments?access_token=" + access_token
            response = (requests.get(url).text)
            raw_comments = {}
            parsed_comments = []
            try:
                raw_comments = json.loads(response, strict=False)["data"]
            except:
                continue
            for comment in raw_comments:
                try:
                    comment = comment["message"].encode("ascii")
                    comment = comment.decode("ascii")            
                    if (len(comment.split(" ")) > 5):
                        num_comments += 1
                        parsed_comments.append(comment)
                except:
                    continue
            raw.append(raw_comments)
            parsed.append(parsed_comments)
            time.sleep(0.2)       
    return raw, parsed

# list of page names for scraping comments
pages = [
'nytimes',
'newyorker',
'TheEconomist',
'justin.bieber.film',
'TwilightMovie',
'minecraft',
'clubpenguin',
'nature',
'engineering',
'elsevierpsychology',
'IEEE.org'
]

# scrape data and write into json files
for page in pages:
    print("getting data for {} ...".format(page))
    raw, parsed = getcomments(page)
    print("writing raw data ...")
    with open('{}_raw.json'.format(page), 'w') as outfile:
        json.dump(raw, outfile)

    print("finish writing raw data from {}".format(page))
    print("writing comments ...")
    with open('{}.json'.format(page), 'w') as outfile:
        json.dump(parsed, outfile)
    print("finish writing comments from {}".format(page))

__For other two websites College Confidential and Reddit Debate Forum__, we are scraping data directly from HTML text since it is much easier to get the string of text than Facebook page. The parsing method is a little different between these two websites since Reddit doesn't split one topic into different pages whereas College Confidential does.

Some observations:

- A lot of the pages targetted towards younger audiences are mostly recreational. We virtually couldn't find any pages that were both non-recreational and had substantial posts by younger users. 
- It was very difficult to find pages for the "below high school" group. We suspect that kids of that age typically don't post very much online (hence most pages were dominated by parents posting on behalf of their children).

As such, we will make the assumption that the above few pages are relatively representative of each education level.

Having identified these pages, we then used the Facebook [Graph API](https://developers.facebook.com/docs/graph-api) to first gather all the posts, then parse all the comments within each post.

## Training

The end goal is to take a list of (comment, education level) tuples and train an SVM model that takes a given post or comment and outputs a predicted education level. For any given comment or post, we run the following pre-processing steps:

- ignore short posts (less than 5 words)
- ignore non-English posts (using the [langid](https://github.com/saffsd/langid.py) library)
- filter out punctuation (including things like emojis)
- tokenize the filtered comment or post, removing stopwords and rare words (including things like links, proper nouns)
- compute a vector of four metrics:
    - `syllables_per_word`: count the total number of syllables and divide by total number of words
    - `words_per_sentence`: count the total number of words and divide by total number of sentences
    - `spelling_errors_per_sentence`: count the total number of spelling errors and divide by total number of sentences
    - `grammer_errors_per_sentence`: count the total number of grammer errors and divide by total number of sentences
    
To compute the syllables per word, we use the syllable counter in [nltk_contrib](https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/readability/syllables_en.py). To identify spelling errors, we check whether words (proper nouns i.e capitalized words excluded) are within a [dictionary of english words](https://github.com/dwyl/english-words). To identify grammer errors, we use the [grammer-check](https://pypi.python.org/pypi/grammar-check) library, which gives us the number of total grammar errors.

The processed comments are then labeled based on which source it came from (e.g posts from The New Yorker will have label of 2) and then used to train the SVM using linear regression.

## Predicting

Given a new user's Facebook page, we use the same functions as before to scrape all posts and comments from the user, applying the same pre-processing steps as we did during training. Finally, we use the trained SVM to predict the education level.

## Current status

We have finished scraping all of the data and training the model. What remains is to use our model to make predictions. We are currently looking at different celebrities/prominent figures to make predictions on, as well as different ways to present our data graphically in order to capture the most interesting results.

## Appendix

- https://www.quora.com/What-are-the-demographics-of-Minecraft-players
- http://www.ibtimes.com/audience-profiles-who-actually-reads-new-york-times-watches-fox-news-other-news-publications-1451828
- https://pypi.python.org/pypi/pylinkgrammar
- http://stackoverflow.com/questions/10252448/how-to-check-whether-a-sentence-is-correct-simple-grammar-check-in-python
- https://github.com/dwyl/english-words