# Named Entity Recognition on OSN's

**TASK**:Find the most popular entites(people, organizations, locations) on Facebook and Twitter for any event.

1. Collect data from both networks on a common topic.
2. Run NER on the data.
3. Find intersecting entities and rank by total mentions.

## data collection

I chose "nba" as the topic - the season just started, so I guessed there will be a significant amount of online activity.

I used tweepy(easy_install tweepy) for collecting the 2500 most recent english tweets from twitter with the query string as "nba". Facebook no longer supports search across all public posts, so after consulting Yatharth, I took the 500 most recent posts from two pages - NBA and NBAtv.

Run *test_tweepy.py* and *test_graph_api.py* for collecting data; you will need to create an additional file for access token called *AccessTokens.py* and store all api keys there. I used mongodb for storing raw data locally.

## named entity recognition

I tried 2 methods initially:
1. NLTK NER chunker.
2. Stanford NER.

In [2]:
import nltk
from nltk.tag.stanford import NERTagger
tagger = NERTagger("stanford-ner-2014-06-16/classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner-2014-06-16/stanford-ner.jar", encoding='utf-8')

def stanford_ner(text):
    print "--using stanford--"
    for sent in nltk.sent_tokenize(text):
        ne_tagged_sent = tagger.tag(nltk.word_tokenize(sent))[0]
        for tup in ne_tagged_sent:
            if tup[1]!="O":
                print tup


def nltk_ner(text):
    print "--using nltk--"
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if len(chunk)==1:
                print chunk

In [4]:
tweets = ["Dorothy Bland: I was caught 'walking while black' - Dallas Morning News https://t.co/EvcWF8pdss", 
         "NBA Trade Rumors: Chicago Bulls Moving Derrick Rose Within This Season?: Will point guard Derrick Rose https://t.co/aJf9rQ8bmn #Kpopstarz"
         ]
for tweet in tweets:
    print tweet
    nltk_ner(tweet)
    stanford_ner(tweet)
    print "\n----------\n"

Dorothy Bland: I was caught 'walking while black' - Dallas Morning News https://t.co/EvcWF8pdss
--using nltk--
(PERSON Dorothy/NNP)
(GPE Bland/NNP)
--using stanford--
(u'Dorothy', u'PERSON')
(u'Bland', u'PERSON')
(u'Dallas', u'LOCATION')

----------

NBA Trade Rumors: Chicago Bulls Moving Derrick Rose Within This Season?: Will point guard Derrick Rose https://t.co/aJf9rQ8bmn #Kpopstarz
--using nltk--
(ORGANIZATION NBA/NNP)
(PERSON Season/NNP)
(PERSON Will/NNP)
--using stanford--
(u'Chicago', u'ORGANIZATION')
(u'Bulls', u'ORGANIZATION')
(u'Derrick', u'PERSON')
(u'Rose', u'PERSON')
(u'Derrick', u'PERSON')
(u'Rose', u'PERSON')

----------



The results were pretty much consistent across posts. So I continued with the Stanford NER tagger. The tagger follows the BIO format, so I collected consecutive non-outside objects as a single phrase.

In [5]:
def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if (tag != "O"):
            current_chunk.append((token, tag))
        else:
            if current_chunk:
                continuous_chunk.append(current_chunk)
                current_chunk = []
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

In [13]:
ne_tagged_sent = tagger.tag(nltk.word_tokenize(tweets[1]))[0]
named_entities = get_continuous_chunks(ne_tagged_sent)
print [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

[(u'Chicago Bulls', u'ORGANIZATION'), (u'Derrick Rose', u'PERSON')]


For every tweet and facebook post, I constructed entites and grouped them by entity type(PERSON, ORGANIZATION, LOCATION). If you have stored the data on mongodb, run *twitter_ner.py* and *facebook_ner.py* to perform NER. This will save two pickles with entites grouped by (PERSON, ORGANIZATION, LOCATION).

## popular and common entites

Once I had the entities grouped by entity_type, I "grouped" together similar entites. Consider:
- Stephen Curry, Curry, Steph
- LeBron James, James
- D Rose, Derric Rose, Rose

I ran a loop across all entities for a entity_type. 
- The program searched for an exact match; 
- - if found, it increases the count of the entity by 1;
- - else, it searches for the closest match above a certain threshold; 
- - - if found, it increases the count of the **closes_match** by 1;
- - - else, it initializes the entity with a count of 1;

For finding similar strings, I used **difflib.get_close_matches(n=1, cutoff=0.5)**. Difflib is a default library, the  idea behind its algorithm is to find the longest contiguous matching subsequence that contains no "junk" elements.

In [15]:
import pickle, difflib

def find_popular(entity_dict):
    popular = {}
    for entity_type, list_of_entities in entity_dict.items():
        popular[entity_type] = {}
        for entity in list_of_entities:
            if popular[entity_type].get(entity, None):
                popular[entity_type][entity]+=1
            else:
                closest_match = difflib.get_close_matches(entity, set(popular[entity_type].keys()), n=1, cutoff=0.5)            
                if len(closest_match)==1:
                    popular[entity_type][closest_match[0]]+=1
                else:
                    popular[entity_type][entity] =1
    return popular

def find_common(fb, tw):
    fb_popular, tw_popular = find_popular(fb), find_popular(tw)
    print "--from facebook--"
    for entity_type, list_of_entities in fb_popular.items():
        print entity_type.upper(), sorted(list_of_entities.items(), key=lambda x:x[1], reverse=True)[:5], "\n"

    print "--from twitter--"
    for entity_type, list_of_entities in tw_popular.items():
        print entity_type.upper(), sorted(list_of_entities.items(), key=lambda x:x[1], reverse=True)[:5], "\n"

    print "--common--"
    common_entity_types = set(fb_popular.keys()).intersection(set(tw_popular.keys()))
    for entity_type in common_entity_types:
        common_entities = set(fb_popular[entity_type].keys()).intersection(set(tw_popular[entity_type].keys()))
        if len(common_entities)>0:
            print entity_type.upper(), sorted([(e, fb_popular[entity_type][e]+tw_popular[entity_type][e]) for e in common_entities], reverse=True, key=lambda tup:tup[1])[:5]

if __name__ == '__main__':
    fb = pickle.load(open("facebook_ners_pkl", "r"))
    tw = pickle.load(open("twitter_ners_pkl", "r"))

    print "\n********************************Using Stanford NER**************************************\n"
    find_common(fb, tw)


********************************Using Stanford NER**************************************



KeyError: u'derrick rose'