# Explore the biggest social graph

In this tutorial, we are going to explore the biggest social graph on earth - Facebook.

First, we'll need to setup the environment to be able to call Graph API. This include creating access token, downloading SDK and making some simple requests.

Then, we want to verify that famous 6-degree of seperation theory.

At last, we will do some NLP and ML work. 




## First query

In this section, we will need to get an access token from Facebook, download the third party SDK and then using that SDK to get a list of friends of yours.

you can find a detailed introduction to Graph API here https://developers.facebook.com/docs/graph-api/overview


### Create Access Token
Since our third-party SDK don't have the login ability, we will have to use the Graph API Explorer to get our token.

1. open https://developers.facebook.com/tools/explorer
2. Click on the **Get Token button** in the top right of the Explorer.
3. Choose the option **Get User Access Token**.
4. In the following dialog don't check any boxes, just click the blue **Get Access Token** button.
5. You'll see a Facebook Login Dialog, click **OK** here to proceed.
6. Now you could see your **Access Token** filled in the Explorer.


**NOTE: This is a short-term token. It will expire in about 2 hours.**



In [1]:
ACCESS_TOKEN = "EAACEdEose0cBAPR6ssxOepwIyOnj9NPpueuSrSJjEDpNLnlvQ3ZBIhy97EOe0puZC1ZA3BxoCElVYLNBmEZCiJv6KtTRlyJde1frVYWJ6vRx0ZBcua41Pbdq3XMCYiaFvxuHgfCj5OJfLIM9B8pfdTfX0FKybwhMugXPSeSf4APZCOnvqp4KQM"

### Make the first request

The second thing you need to do is to download the sdk.

In your terminal, use the following command to download the **facebook-sdk** using pip
```
pip install facebook-sdk
```
once this succeeds, you should be able to import the sdk in Python.

Try the following code, it should not give you any warning.

**NOTE: the facebook sdk only support version up to 2.7**


In [161]:
import facebook
import json, requests, time, io, string
import pandas as pd
import numpy as np
import nltk
from collections import Counter
import sklearn

# graph = facebook.GraphAPI(access_token=ACCESS_TOKEN, version='2.7')


### Work with the Graph

Let's explain some basic concept here first.

The Graph API is named after the idea of a 'social graph' - a representation of the information on Facebook composed of:

* **nodes** - basically "things" such as a User, a Photo, a Page, a Comment
* **edges** - the connections between those "things", such as a Page's Photos, or a Photo's Comments
* **fields** - info about those "things", such as a person's birthday, or the name of a Page

#### Nodes

Each node has a unique **ID** which is used to access it via the Graph API. 
In the provided SDK, you could get any object by using the `get_object()` method.

Below is an example for how to get your own User Object.

In [None]:
me = graph.get_object(id='me') # NOTE, in graph api, 'me' is an alias of current token owner's id

#### Edges

Edges don't have an ID for it. It can only be used along with Nodes. for example, you could query for all your friends using the following code.

In [None]:
friends = graph.get_connections(id='me', connection_name='friends')
print friends['summary']['total_count']

### Get a friends list

As you can see in the above example, the output actually has pagination. So the final part of this secion is to handle that to get the full list of your friends' ids'. 

In [None]:
def get_all_connections(graph, id, connection_name):
    results = []
    after = ''
    while True:
        response = graph.get_connections(id=id, connection_name=connection_name,after=after,limit=100)
        data = response['data']
        results.extend([node['id'] for node in data])
        if 'paging' not in response:
            break;
        if 'next' in response['paging']:
            after = response['paging']['cursors']['after']
        else:
            break;
    return results
            

def get_friends(graph, id):
    return get_all_connections(graph, id, 'friends')

In [None]:
print len(get_friends(graph, 'me'))

It turns out that Since Graph API V2.0, Facebook stops to provide a full list of friends of the user. that makes the original idea of this tutorial impossible.
Now there is two possible solution here.
1. Change to Twitter which I can get much more data than Facebook.
2. Targeting public figures on Facebook like Tyler Swift. Those pages have publicly accessable posts. Then I can still do something like training a ML model to predict the type of the Page (Artist, Business, Places, etc).

## Public Page predictor

In this section, we want to train a model to predict if a page represents an artist based on the publicly available posts.

First, we need to use GraphAPI to collect all these data.

### Data Collection
#### Get the last 100 posts from the page


In [None]:
def get_posts(graph, id, limit=100):
    print 'get', id
    response = graph.get_connections(id=id, connection_name='posts', limit=limit)
    df = pd.DataFrame(response['data'])
#     print df.head()
    if df.shape[0]:
        return df.drop('story', axis=1, errors='ignore')
    else:
        return None

posts_taylor_swift = get_posts(graph, '260735133964390')
print (posts_taylor_swift.head())
print len(posts_taylor_swift)

#### Search For All Public Pages
The given sdk doesn't support search entpoint, we'll have to write it from scratch.

In [None]:
class GraphAPI:
    def __init__(self, token):
        self.token = token   
    
    def node_request(self, url, **payload):
        base_url = "https://graph.facebook.com/v2.7/"
        payload['access_token'] = self.token
        r = requests.get(base_url + url, params=payload)
        return json.loads(r.text)
    
    def connection_request(self, url, limit=None, **payload):
        base_url = "https://graph.facebook.com/v2.7/"
        payload['access_token'] = self.token
        payload['limit'] = limit if limit <= 100 else 100
        r = requests.get(base_url + url, params=payload)
        response = json.loads(r.text)
        
        result = []
        result.extend(response['data'])
        while len(result) < limit:
            if 'next' not in response['paging']:
                break
            response = json.loads(requests.get(response['paging']['next']).text)
            result.extend(response['data'])
#             time.sleep(1)
        return pd.DataFrame(result)
        
api = GraphAPI(ACCESS_TOKEN)
taylor = api.node_request('105799462785959', fields=['category'])
artists = api.connection_request('search', type='page', q='Artist', limit=200)
print taylor
print len(artists)
print artists.head()


#### Build the training and evaluation table
Having these API, we now are able to collect all the data we need.
First, we will make some search queries to get preferably random list of public pages.
Then, we need to get all the posts from these pages. Also, we could get the page's category and use that as the source of truth to decide if this Page represents an Artist.

Now, we could use the thing have to get a relatively small but enough dataset.
Since Facebook doesn't provide an API to get all public pages, we have to use the `Search` query to get some of them with a hand picked list.

In [44]:
# keyword_list = ['Artist', 'Musician', 'Trump', 'Hillary', 'PHP', 'Data Science', 'NBA', 'Alpine Ski', 'Scuba Diving', 'Computer']

# def get_all_posts(graph, api, keyword_list, limit=1000):
#     all_pages = []
#     all_posts = {}
#     for keyword in keyword_list:
#         pages = api.connection_request('search', type='page', q=keyword, limit=limit)
#         posts = {id: get_posts(graph, id) for id in pages['id']}
#         all_pages.append(pages)
#         all_posts.update(posts)
#     return all_pages, all_posts


with open('all_pages.csv', 'r') as csv:
    df_pages = pd.DataFrame.from_csv(csv, encoding='UTF-8')
    
with open('all_posts.csv', 'r') as csv:
    df_posts = pd.DataFrame.from_csv(csv, encoding='UTF-8')
    
# # all_pages, all_posts = get_all_posts(graph, api, keyword_list)
# print len(df_pages)
# print len(df_posts)
print df_pages.head()
print df_posts.head()

                                                     name
id                                                       
105799462785959                                    Artist
354577571346409               Artist in Residence Program
105453336153572                                  Artistic
906740039387472                     Artistically Speaking
348604348593352  Associazione Artisti di Strada di Milano
                                                           message
id                                                                
283319135042170  اعزاءنا الطلبة \nللاطلاع على القاعات الخاصة با...
283319135042170  إعلان هام لجميع الطلبة\nننوه للطلبة الأعزاء بض...
283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...
283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...
283319135042170  الجامعة تستقبل الدكتور سايمون غالبين المدير ال...


We also need to get each pages Category as the source of truth if the Page represents an artist.

In [46]:
# def get_page_category(graph, page_ids):
#     result = {}
#     for i in range(0, len(page_ids), 50):
#         pages = graph.get_objects(ids=page_ids[i:i+50], fields='category')
#         result.update({id: page['category'] for id, page in pages.items() if 'category' in page or 'Unknown' })
#     return [result[id] for id in page_ids if id in result or 'Unknown']

# # df_pages = pd.concat(all_pages)
# df_pages = df_pages.assign(category=get_page_category(graph, [str(id) for id in df_pages.index]))
for k,v in df_pages.groupby('category').groups.items():
    print k, '\t', len(v)


Financial Planner 	1
Industrials 	5
Author 	1
Insurance Company 	13
Performance Art 	2
Arts & Entertainment 	123
Entertainer 	1
TV Show 	19
Engineering/Construction 	5
App Page 	8
Community Organization 	26
Outdoor Gear/Sporting Goods 	39
Public Figure 	63
Shopping/Retail 	115
Science Website 	6
TV Genre 	1
Brand 	1
Legal/Law 	5
Technology Company 	3
Medical & Health 	8
Chemicals 	1
Business/Economy Website 	1
Podcast 	1
Phone/Tablet 	4
Political Organization 	93
Record Label 	25
Bar 	50
Just For Fun 	5
Athlete 	46
Attractions/Things to Do 	28
Magazine 	13
Landmark 	7
Organization 	47
Sports Team 	38
Scuba Diving Center 	3
Profession 	15
Hospital/Clinic 	1
Ski & Snowboard Shop 	1
Public Services & Government 	1
Science & Engineering 	3
Personal Blog 	5
Sport 	6
Event 	6
Non-Governmental Organization (NGO) 	7
News Personality 	2
Spas/Beauty/Personal Care 	167
Pet 	1
TV Network 	3
Computer Company 	1
Interest 	20
Internet/Software 	64
Local Service 	1
Amateur Sports Team 	16
Movie Theate

In [47]:
# print len(all_pages)
# df = pd.concat(all_pages)
# print df.head()
# print len(df)

# posts = [[id, posts.iloc[i]['message']] for id, posts in all_posts.items() if posts is not None for i in range(len(posts)) if 'message' in posts.iloc[i]]
# pdf = pd.DataFrame(posts, columns=['id', 'message'])


with open('all_pages.csv', 'w') as output:
    df_pages.to_csv(output, encoding='UTF-8', index=True)
    
with open('all_posts.csv', 'w') as output:
    df_posts.to_csv(output, encoding='UTF-8', index=True)
        
        
        
print df_pages.head()
print df_posts.head()

                                                     name  \
id                                                          
105799462785959                                    Artist   
354577571346409               Artist in Residence Program   
105453336153572                                  Artistic   
906740039387472                     Artistically Speaking   
348604348593352  Associazione Artisti di Strada di Milano   

                               category  
id                                       
105799462785959              Profession  
354577571346409  Community Organization  
105453336153572                Interest  
906740039387472           Event Planner  
348604348593352               Community  
                                                           message
id                                                                
283319135042170  اعزاءنا الطلبة \nللاطلاع على القاعات الخاصة با...
283319135042170  إعلان هام لجميع الطلبة\nننوه للطلبة الأعزاء بض...
2833191350

                id                                      name  \
0  105799462785959                                    Artist   
1  354577571346409               Artist in Residence Program   
2  105453336153572                                  Artistic   
3  906740039387472                     Artistically Speaking   
4  348604348593352  Associazione Artisti di Strada di Milano   

                 category  
0              Profession  
1  Community Organization  
2                Interest  
3           Event Planner  
4               Community  


## Model Training

### Data pre processing
Now that we've got the data, we need to do some pre-process first. This include

1. remove all the pages with no posts
2. remove all the non-English posts
3. use a page's category to get the label column: 1 as IT related and -1 as non-IT related

In [166]:
def pre_process_data(df_posts):
    df = df_posts[pd.notnull(df_posts['message'])]
    
    def is_ascii(message):
        try:
            message.decode('ascii')
            return True
        except:
            return False
    
    df = df[df['message'].apply(is_ascii)]
    df = df.reset_index(drop=True)
    return df

print len(df_posts)
posts = pre_process_data(df_posts)
print posts.head()
print len(posts)

305967
                 id                                            message
0  1507446969543009            GAME TIME!! Houston Rockets - NY Knicks
1  1507446969543009  Dal caldo al freddo....da Manu Ginobili al Bar...
2  1507446969543009   Tutto pronto per San Antonio Spurs - Miami Heat!
3  1507446969543009  Vista la giornata nuvolosa, relax e shopping s...
4  1507446969543009                                       EXTRA GAME!!
150326


### NLP to get feature 
Now that we have all our row data with some pre-process. we could use Nature Language Process(NLP) technique to create our features to train the Machine Learning model.

In this section, we need to do convert all the post content into a set of tokens for each page. for each token, we need to:
1. convert all the token to lowercase
2. convert all the token to their lemmatized form. This will remove all the Non-English content.
3. all words with punctuation should be processed as follows: (a) Apostrophe of the form `'s` should be ignored. (b)All other apostrophe should be ignored. (c) Break the word at all other punctuations


In [140]:
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """
    text = text.lower()
    text = text.replace("'s", '').replace("'", '')
    replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
    text = text.translate(replace_punctuation)
    tokens = nltk.word_tokenize(text)

    result = []
    for token in tokens:
        try:
            res = lemmatizer.lemmatize(token)
            result.append(res)
        except:
            continue
    return result


In [154]:
dict = {1:[12,3,4,5], 2:[12,23,23]}
print dict.keys()
print dict.values()
df = pd.DataFrame().assign(id=dict.keys(), value=dict.values())
print df

[1, 2]
[[12, 3, 4, 5], [12, 23, 23]]
   id          value
0   1  [12, 3, 4, 5]
1   2   [12, 23, 23]


With the help of this function, we could now generate a list of tokens for each Page.

In [171]:
def generate_tokens(df_pages, df_posts):
    result = {}
    for id, indices in df_posts.groupby('id').groups.items():
        token = [process(str(df_posts.iloc[index]['message'])) for index in indices]
        token = np.concatenate(token)
        result[id] = token
    
    data = pd.merge(pd.DataFrame().assign(id=result.keys(), message=result.values()), df_pages, on='id', how='left')
    return data

data = generate_tokens(df_pages, posts)
print data.head()


                id                                            message  \
0  314534308659695  [skonczona, praca, kosatattoo, w90, wygojona, ...   
1      65409974260  [2, lake, huron, shipwreck, lost, since, the, ...   
2  258308394333462  [it, is, about, time, this, poop, is, held, ac...   
3      78367162355  [congratulation, to, tomeka, reid, http, www, ...   
4  712517218826239  [lol, literally, sleeping, with, the, fish, wh...   

                                                name  \
0             Studio Tatuażu  SPEAK IN COLOR by Kosa   
1                                       Total Diving   
2                            Hillary For Prison 2016   
3  AACM - Association for the Advancement of Crea...   
4                              Scuba Diving Globally   

                      category  
0               Local Business  
1  Outdoor Gear/Sporting Goods  
2       Political Organization  
3      Non-Profit Organization  
4               Travel/Leisure  


When doing prediction based on natural languange, it's almost certain that you don't need that many possible words.
Some of them are very popular which add no value to our model, like stopwords. Others are too rare that are most likely to be typos.
In the NLTK package, they provide a list of stopwords we could borrow. In the following section, we need to get a list of rare words. Rare words are defined as words only occured once.

In [172]:
def get_rare_words(data):
    """ use the word count information across all posts in training data to come up with a feature list
    Inputs:
        data: pd.DataFrame: the output of generate_tokens() function
    Outputs:
        list(str): list of rare words, sorted alphabetically.
    """
    token_list = [token for tokens in data['message'] for token in tokens]
    counter = Counter(token_list)
    return [k for k,v in counter.iteritems() if v == 1]

rare_words = get_rare_words(data)
print len(rare_words) 

91644


Now we could create a feature matrix for each page using `sklearn.feature_extraction.text.TfidfVectorizer`.

In [173]:
def create_features(data, rare_words):
    """ creates the feature matrix using the Page posts
    Inputs:
        data: pd.DataFrame: Page posts collected above
        rare_words: list(str): the output of get_rare_words() function
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.extend(rare_words)
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stopwords)
    transformer = sklearn.feature_extraction.text.TfidfTransformer()

    # get frequency counts (sparse) matrix
    all_tokens = [' '.join(tokens) for tokens in data['message']]
    freq_matrix = vectorizer.fit_transform(all_tokens)
    return (vectorizer, freq_matrix)

# AUTOLAB_IGNORE_START
(tfidf, X) = create_features(data, rare_words)
print X.shape, len(data)
# AUTOLAB_IGNORE_STOP

(3619, 72898) 3619


In [175]:
def create_labels(data):
    it_category = ['Computers/Technology', 'Computer Company', 'Internet/Software', 'Computers', 'Software',
                  'Internet Company', 'Computers/Internet Website']
    return np.array([1 if category in it_category else 0 for category in data['category']])

# AUTOLAB_IGNORE_START
y = create_labels(data)
# AUTOLAB_IGNORE_STOP

In [181]:
def learn_classifier(X, y, kernel='linear'):
    """ learns a classifier from the input features and labels using the kernel function supplied
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features_and_labels()
        y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_features_and_labels()
        kernel: str: kernel function to be used with classifier. [best|linear|poly|rbf|sigmoid]
                    if 'best' is supplied, reset the kernel parameter to the value you have determined to be the best
    Outputs:
        sklearn.svm.classes.SVC: classifier learnt from data
    """
    clf = sklearn.svm.SVC(kernel=kernel)
    clf.fit(X, y)
    return clf

# AUTOLAB_IGNORE_START
N = len(y)
X_train = X[:N/2]
y_train = y[:N/2]
X_eval = X[N/2:]
y_eval = y[N/2:]
classifier = learn_classifier(X_train, y_train, 'linear')
# AUTOLAB_IGNORE_STOP

In [180]:
def evaluate_classifier(classifier, X_validation, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier: sklearn.svm.classes.SVC: classifer to evaluate
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_train: numpy.ndarray(int): dense binary vector of class labels
    Outputs:
        double: accuracy of classifier on the validation data
    """
    result = [classifier.predict(X_validation[i]) == y_validation[i] for i in range(X_validation.shape[0])]
    return float(sum(result)) / X_validation.shape[0]

# AUTOLAB_IGNORE_START
accuracy = evaluate_classifier(classifier, X_train, y_train)
print accuracy # should give 0.954850271708

accuracy = evaluate_classifier(classifier, X_eval, y_eval)
print accuracy # should give 0.954850271708
# AUTOLAB_IGNORE_STOP

0.9773355445
0.882872928177
