<a href="https://colab.research.google.com/github/hhaeri/The-New-York-Social-Graph/blob/main/social_network_graph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import seaborn as sns
sns.set()

In [None]:
from static_grader import grader

# The New York Social Graph


[New York Social Diary](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/) provides a
fascinating lens onto New York's socially well-to-do.  The data forms a natural [social graph](https://en.wikipedia.org/wiki/Social_graph) for New York's social elite.  Take a look at this page of a recent [run-of-the-mill holiday party](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers). Please note that these links point to the internet archive, as the original website has recently removed most of its archives. Many of the images no longer load, but all the HTML is still there.

Besides the brand-name celebrities, you will notice the photos have carefully annotated captions labeling those that appear in the photos.  We can think of this as implicitly implying a social graph: there is a connection between two individuals if they appear in a picture together.

For this project, I will assemble the social graph from photo captions for parties dated December 1, 2014, and before.  Using this graph, one can make guesses at the most popular socialites, the most influential people, and the most tightly coupled pairs.

I will approach the project in the following three phases.:
1. Get a list of all the photo pages to be analyzed.
2. Parse all of the captions on a sample page.
3. Parse all of the captions on all pages, and assemble the graph.

## Phase One


The first step is to crawl the data:
* photos from parties on or before December 1st, 2014 that we can find on the [Party Pictures Archive](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures)


In [3]:
# Import necessary packages
import requests # Used the request package to download the HTML pages
import dill
from bs4 import BeautifulSoup # Used to process the HTML
from datetime import datetime
import datetime as dt
from dateutil.parser import parse

Let's start by getting the [first page](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures).

In [5]:
url = "https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures"
page = requests.get(url)
print(page.url) # Use requests.get to download the page.

https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures


Now, I process the text of the page with BeautifulSoup. This page has links to 50 party pages. By looking at the structure of the page and determining how to isolate those links, I try to find a pattern to use BeautifulSoup's select or find_all methods to get those elements. [Hint: I quite often use my browser's developer tools (usually Cmd-Option-I on Mac, Ctrl-Shift-I on others) to explore the structure of the HTML page. ]


In [10]:
# Use BeautifulSoup to process the text of the page
soup = BeautifulSoup(page.text, "lxml")
#Let's look at the structure of the page
print(soup.prettify())

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]-->
<!--[if !IE]><!-->
<html class="no-js" dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#">
 <!--<![endif]-->
 <head>
  <script cha

 Looking at the structure of the page I learned I can use attrs 'class':'views-row'
 to grab information for each party page

In [13]:
parent = soup.find_all('div',attrs = {'class':'views-row'})

print('There are {} party pages linked in this page'.format( len(parent)))

There are 50 party pages linked in this page


Within each party page information block, we can find other informations such as the date associated with the party and the actual link that directs us to that specific party.  

In [15]:
parent[0].find_all('span',attrs = {'class':'field-content'})[1].text

'Friday, September 11, 2015'

In [14]:
post_date = parse(parent[0].find_all('span',attrs = {'class':'field-content'})[1].text)
post_date

datetime.datetime(2015, 9, 11, 0, 0)

Using list comprehension we can collect the link to the party pages in a list

In [31]:
links=['https://web.archive.org/'+parent[i].find('span',attrs = {'class':'views-field'}).find('a')['href'] for i in range(len(parent))]
print('There is {} link per each party page, there are total of {} links in the first page'.format(int(len(parent)/len(links)),len(links)))

There is 1 link per each party page, there are total of 50 links in the first page


Let's see if there is any party page dated prior 12/2/2014 in the first page:

In [34]:
links_test =[]
for i in range(len(parent)):
    post_date = parse(parent[i].find_all('span',attrs = {'class':'field-content'})[1].text)
    if post_date<dt.datetime(2014, 12, 2):
      links_test.append('https://web.archive.org/'+parent[i].find('span',attrs = {'class':'views-field'}).find('a')['href'])
print('There is {} party pages dated prior to 12/2/2014'.format(len(links_test)))

There is 0 party pages dated prior to 12/2/2014


Let's take a look at that first link.  Figure out how to extract the URL of the link, as well as the date.  You probably want to use `datetime.strptime`.  See the [format codes for dates](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) for reference.

In [35]:
link = links[0]
# Check that the title and date match what you see visually.
link

'https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams'

For purposes of code reuse, let's put that logic into a function.  It should take the link element and return the URL and date parsed from it.

In [36]:
def get_link_date(el):
    date = parse(el.find_all('span',attrs = {'class':'field-content'})[1].text)
    url = 'https://web.archive.org/'+el.find('span',attrs = {'class':'views-field'}).find('a')['href']
    return url, date

In [37]:
get_link_date(parent[3])

('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers',
 datetime.datetime(2015, 8, 20, 0, 0))

In [38]:
get_link_date(parent[3])[1]

datetime.datetime(2015, 8, 20, 0, 0)

You may want to check that it works as you expected.

Once that's working, let's write another function to parse all of the links on a page.  Thinking ahead, we can make it take a Requests [Response](https://requests.readthedocs.io/en/master/api/#requests.Response) object and do the BeautifulSoup parsing within it.

In [39]:
def get_links(response):
    links = []
    soup = BeautifulSoup(response.text, "lxml")
    parent = soup.find_all('div',attrs = {'class':'views-row'})
    for i in range(len(parent)):
        l1 = get_link_date(parent[i])
        links.append(l1)
    return links# A list of URL, date pairs

In [40]:
get_links(page)

[('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams',
  datetime.datetime(2015, 9, 11, 0, 0)),
 ('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show',
  datetime.datetime(2015, 9, 1, 0, 0)),
 ('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/riders-spectators-horses-and-more',
  datetime.datetime(2015, 8, 26, 0, 0)),
 ('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers',
  datetime.datetime(2015, 8, 20, 0, 0)),
 ('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/garden-parties-kickoffs-and-summer-benefits',
  datetime.datetime(2015, 8, 17, 0, 0)),
 ('https://web.archive.org//web/20150913224145/http://www.newyorksocialdiary.com/party-pic

If we run this on the previous response, we should get 50 pairs.

In [44]:
# These should be the same links from earlier
len(get_links(page)) == 50

True

But we only want parties with dates on or before the first of December, 2014.  Let's write a function to filter our list of dates to those at or before a cutoff.  Using a keyword argument, we can put in a default cutoff, but allow us to test with others.

In [51]:
links_filtered=[]
def filter_by_date(links, cutoff=datetime(2014, 12, 1)):
    links_filtered=[]
    # Return only the elements with date <= cutoff
    for i in range(len(links)):
        if links[i][1]<=cutoff:
            links_filtered.append(links[i])
    return(links_filtered)
print('There is {} links to parties with dates before 12/1/2014 in this page'.format(len(filter_by_date(get_links(page)))))

There is 0 links to parties with dates before 12/1/2014 in this page


Now we should be ready to get all of the party URLs.  Click through a few of the index pages to determine how the URL changes.  Figure out a strategy to visit all of them.

HTTP requests are generally IO-bound.  This means that most of the time is spent waiting for the remote server to respond.  If you use `requests` directly, you can only wait on one response at a time.  [requests-futures](https://github.com/ross/requests-futures) lets you wait for multiple requests at a time.  You may wish to use this to speed up the downloading process.

In [55]:
#pip install requests_futures

In [59]:
from requests_futures.sessions import FuturesSession
LIMIT = 26 # total number of index pages in the archive
link_list = []

def get_page_args(i):
    return {"url": url,
            "params": {"page": i}}
# You can use link_list.extend(others) to add the elements of others
# to link_list.
session = FuturesSession(max_workers=5) # max_workers should not be too high so you don't make too many requests per second and get blocked
futures = [session.get(**get_page_args(i)) for i in range(LIMIT)]

# Use list comprehension to collect all the links from all pages
link_list = [filter_by_date(get_links(future.result())) for future in futures]
# flattening the list of lists
link_list_flatten = [item for sublist in link_list for item in sublist]

print('There are total of {} party pages with date before 12/01/2014'.format(len(link_list_flatten)))


There are total of 1193 party pages with date before 12/01/2014


In [60]:
link_list_flatten

[('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/the-thanksgiving-day-parade-from-the-ground-up',
  datetime.datetime(2014, 12, 1, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests',
  datetime.datetime(2014, 11, 24, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/equal-justice',
  datetime.datetime(2014, 11, 20, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/celebrating-the-treasures',
  datetime.datetime(2014, 11, 18, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/associates-and-friends',
  datetime.datetime(2014, 11, 17, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/michaels-25th',
  datetime.datetime(2014, 11, 13, 0, 

In case we need to restart the notebook, we should save this information to a file.  There are many ways you could do this; here's one using `dill`.

In [65]:
dill.dump(link_list_flatten, open('nysd-links.pkd', 'wb'))

To restore the list, we can just load it from the file.  When the notebook is restarted, you can skip the code above and just run this command.

In [66]:
link_list_flatten = dill.load(open('nysd-links.pkd', 'rb'))
print(len(link_list_flatten))
link_list_flatten[1][0]

1193


'https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests'

In [67]:
link_list_flatten[1192]

('https://web.archive.org//web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2007/orchids-growing-wild',
 datetime.datetime(2007, 2, 21, 0, 0))

## Question 1: histogram


Get the number of party pages for the 95 months (that is, month-year pair) in the data (12 months * 8 years - 1 month.  Represent this histogram as a list of 95 tuples, each of the form `("Dec-2014", 1)`.  Note that you can convert `datetime` objects into these sort of strings with `strftime` and the [format codes for dates](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

Examining the histogram depicting the occurrences of parties across various years uncovers a declining trend in the number of parties, or at the very least, a decrease in party links on the New York Social Diary archive. However, when we examine the histograms at a more detailed, monthly level, it becomes challenging to discern this downward trend, but it becomes evident when analyzing the annual histogram.

In [101]:
dates_Y = [item[1].strftime("%Y") for item in link_list_flatten]
date_dict_Y = dict()
for date in dates_Y:
    if date in date_dict_Y:
        date_dict_Y[date] = (date_dict_Y[date] + 1)
    else:
        date_dict_Y[date] = (1)
annual_party_pages = pd.DataFrame.from_dict(date_dict_Y,orient='index',columns=['party count'])
annual_party_pages[::-1]


Unnamed: 0,party count
2007,147
2008,176
2009,152
2010,201
2011,156
2012,133
2013,130
2014,98


In [100]:
dates = [item[1].strftime("%b-%Y") for item in link_list_flatten]
date_dict = dict()
for date in dates:
    if date in date_dict:
        date_dict[date] = (date_dict[date] + 1)
    else:
        date_dict[date] = (1)
monthly_party_pages = pd.DataFrame.from_dict(date_dict,orient='index',columns=['party count'])
monthly_party_pages[::-1]


Unnamed: 0,party count
Feb-2007,1
Mar-2007,18
Apr-2007,17
May-2007,20
Jun-2007,10
...,...
Aug-2014,6
Sep-2014,5
Oct-2014,9
Nov-2014,10


## Phase Two


In this phase, we concentrate on getting the names out of captions for a given page.  We'll start with [the benefit cocktails and dinner](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood) for [Lenox Hill Neighborhood House](http://www.lenoxhill.org/), a neighborhood organization for the East Side.

Take a look at that page.  Note that some of the text on the page is captions, but others are descriptions of the event.  Determine how to select only the captions.

In [102]:
import seaborn as sns
sns.set()

In [104]:
import requests
import dill
from bs4 import BeautifulSoup
from datetime import datetime
import datetime as dt
from dateutil.parser import parse

In [105]:
url2 = "https://web.archive.org/web/20151114014941/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood"
page2 = requests.get(url2)
print(page2.url) # Use requests.get to download the page.

https://web.archive.org/web/20151114014941/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood


Let's encapsulate this in a function.  As with the links pages, we want to avoid downloading a given page the next time we need to run the notebook.  While we could save the files by hand, as we did before, a checkpoint library like [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) can handle this for you.  (Note, though, that you may not want to enable this until you are sure that your function is working.)

You should also keep in mind that HTTP requests fail occasionally, for transient reasons.  You should plan how to detect and react to these failures.   The [retrying module](https://pypi.python.org/pypi/retrying) is one way to deal with this.

In [110]:
def get_captions(path):
    soup2 = BeautifulSoup(requests.get(path).text,"lxml")
    captions = [item.text for item in (soup2.find_all('div',attrs = {'class':'photocaption'}) +\
                                      soup2.find_all('td',attrs = {'class':'photocaption'}) + \
                                      soup2.find_all('font'))]
    return captions

In [114]:
print('There are {} captions in the examined page'.format(len(get_captions(url2))))

There are 110 captions in the examined page


In [117]:
captions = get_captions(url2)
captions

["Glenn Adamson, Simon Doonan, Victoire de Castellane, Craig Leavitt, Jerome Chazen, Andi Potamkin, Ralph Pucci, Kirsten Bailey, Edwin Hathaway, and Dennis Freedman at the Museum of Art and Design's annual MAD BALL. ",
 ' Randy Takian ',
 ' Kamie Lightburn and Christopher Spitzmiller ',
 ' Christopher Spitzmiller and Diana Quasha ',
 ' Mariam Azarm, Sana Sabbagh, and Lynette Dallas ',
 ' Christopher Spitzmiller, Sydney Shuman, and Matthew Bees',
 ' Christopher Spitzmiller and Tom Edelman ',
 ' Warren Scharf and Sydney Shuman ',
 ' Amory McAndrew and Sean McAndrew ',
 ' Sydney Shuman, Mario Buatta, and Helene Tilney ',
 ' Katherine DeConti and Elijah Duckworth-Schachter ',
 ' John Rosselli and Elizabeth Swartz ',
 ' Stephen Simcock, Lee Strock, and Thomas Hammer ',
 ' Searcy Dryden, Lesley Dryden, Richard Lightburn, and Michel Witmer ',
 ' Jennifer Cacioppo and Kevin Michael Barba ',
 ' Virginia Wilbanks and Lacary Sharpe ',
 ' Valentin Hernandez, Yaz Hernandez, Chele Farley, and James 

In [None]:
#with open ("captions","w") as f:
#    for c in captions:
#        f.write(c)
#        f.write("\n")

Now that we have some sample captions, let's start parsing names out of those captions.  There are many ways of going about this, and we leave the details up to you.  Some issues to consider:

  1. Some captions are not useful: they contain long narrative texts that explain the event.  Try to find some heuristic rules to separate captions that are a list of names from those that are not.  A few heuristics include:
    - Look for sentences (which have verbs) and as opposed to lists of nouns. For example, [`nltk` does part of speech tagging](http://www.nltk.org/book/ch05.html) but it is a little slow. There may also be heuristics that accomplish the same thing.
    - Similarly, spaCy's [entity recognition](https://spacy.io/docs/usage/entity-recognition) could be useful here, but like `nltk` using `spaCy` will add to processing time.
    - Look for commonly repeated threads (e.g. you might end up picking up the photo credits or people such as "a friend").
    - Long captions are often not lists of people.  The cutoff is subjective, but for grading purposes *we set that cutoff at 250 characters*.
  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`. **Note**: The reference solution uses regex exclusively for name parsing.
  3. You might find a person named "ra Lebenthal".  There is no one by this name.  Can anyone spot what's happening here?
  4. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out because they ultimately refer to the same person: "Michael Bloomberg."
  5. There is a special case you might find where couples are written as e.g. "John and Mary Smith". You will need to write some extra logic to make sure this properly parses to two names: "John Smith" and "Mary Smith".
  6. When parsing names from captions, it can help to look at your output frequently and address the problems that you see coming up, iterating until you have a list that looks reasonable. This is the approach used in the reference solution. Because we can only asymptotically approach perfect identification and entity matching, we have to stop somewhere.
  
**Questions worth considering:**
  1. Who is Patrick McMullan and should he be included in the results? How would you address this?
  2. What else could you do to improve the quality of the graph's information?

In [116]:
import re
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [121]:
text = nltk.word_tokenize(captions[0])
caption_tag=nltk.pos_tag(text)
caption_tag

[('Glenn', 'NNP'),
 ('Adamson', 'NNP'),
 (',', ','),
 ('Simon', 'NNP'),
 ('Doonan', 'NNP'),
 (',', ','),
 ('Victoire', 'NNP'),
 ('de', 'NNP'),
 ('Castellane', 'NNP'),
 (',', ','),
 ('Craig', 'NNP'),
 ('Leavitt', 'NNP'),
 (',', ','),
 ('Jerome', 'NNP'),
 ('Chazen', 'NNP'),
 (',', ','),
 ('Andi', 'NNP'),
 ('Potamkin', 'NNP'),
 (',', ','),
 ('Ralph', 'NNP'),
 ('Pucci', 'NNP'),
 (',', ','),
 ('Kirsten', 'NNP'),
 ('Bailey', 'NNP'),
 (',', ','),
 ('Edwin', 'NNP'),
 ('Hathaway', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('Dennis', 'NNP'),
 ('Freedman', 'NNP'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Museum', 'NNP'),
 ('of', 'IN'),
 ('Art', 'NNP'),
 ('and', 'CC'),
 ('Design', 'NNP'),
 ("'s", 'POS'),
 ('annual', 'JJ'),
 ('MAD', 'NNP'),
 ('BALL', 'NNP'),
 ('.', '.')]

In [123]:
names=[]
current_name=[]
for name in range(len(caption_tag)):
    if caption_tag[name][1]=='NNP':
        current_name.append(caption_tag[name][0])
        #print(len(current_name))
    else:
        if len(current_name)>1:
            names.append(' '.join(current_name))
        current_name=[]

In [125]:
def get_text(captions_input,i):
    text = []
    text=nltk.word_tokenize(captions_input[i].replace("\n",","))#.replace("\n"," ")
    text.append(',')
    return(text)

In [None]:
# def get_text(captions_input,i):
# #removing the new lines in the middle of captions in two cases:
# #   1) when the newline in in between one person's name
# #       (at the end and begining of the caption string)
# #   2) when the newline is in between two names
# # my code still can't handle the combination of the two at the same time in one caption
#     caption = captions_input[i]
#     if len(caption.split()) <2 or (len(caption.split()) <4 and 'and' in caption):
#         caption =','
#     elif nltk.word_tokenize(caption.replace('\n',","))[-2] in (',', 'and') or \
#         nltk.word_tokenize(caption.replace('\n',","))[1] in (',', 'and'):

#         caption_nonewl = caption.split('\n')
#         caption_nonewl.append(',')
#         caption = ' '.join([item for item in caption_nonewl])
#     else:
#         caption = caption.replace('\n',",") + ','


#     text = nltk.word_tokenize(caption)
#     return(text)

In [137]:
def get_name(text_input):
    # Here I try to Hardwire some filters to the names
    filter = ["Academy","School","''","Bloody","Dance","Company","Tribute","Street","Happening","A","NYSD","Halloween","Library","Scouts",\
              "BROADWAY","Benefit","Gala","League","CELEBRATION","Halloween","Library","Scouts","Chairman","Girls"\
              "Seasonal","Guests","Guest's","party","Cooking","Entertaining","Legal","Equal","NAACP","Director","Executive",\
              "Mayor","Award","National","Princess","Dr.","Dr","Mr.","Mrs.","Mr","Mrs","Dinner","MAD","Ballroom","Ball","zombie",\
              "Zombie","Memorial","Restaurant","Dr.","Director","Fund","Muesum",\
              "Board","member","president","chair","co_chair","Actor","Actress",'New','Knowledge','Management','Institutional',\
              'Advancement','Curator','consul', 'General','Ambassador','Photography','/','Images','youth','foster',\
              '''Children's ''','Director','honoree','CNN', 'News', 'Anchor','Associates','Committee','’ s',\
              'bottom','row','top','front','Surgery','Special','Los','Publisher','Professor','Madame', 'Foundation',\
              'Scholarship','Fashion','Baby','Producer','Anniversary','Founder','Program','Pajama','wife','Co-Chair', \
              'CEO','CEO', 'PBS', 'host','Agency', 'Mount','Photography','Properties','co-chair','co-chairs', 'event','HRH', 'Trustee',\
              'President–Elect','Avenue','Council','Botanical', 'Garden', 'Editor', 'plant', 'kingdom',\
              'Honorees', 'Emeritus', 'Vice', 'Chief', 'Deputy', 'Associate', 'Arts', 'Decorative', 'Initiative','City',\
              'Carnegie', 'Commission','Premiere','Union','Yugoslavia','night','pianist','choreographer',\
              'composer','Father','Govenor','USA','Miss','Auctioneer','Emerita', 'Lady','Museum', 'American', 'Natural',\
              'History','chairs','Co-chairman','signature','sculpture','Group','amongst','Nutcracker','Family','Ballet',\
              'bidding','Theater','Chef','Chairmen','Co-Chairmen','Chairmen','Hollywood','Décor', 'book','community','design',\
              'Ceremonies','daughter','Correspondent','Commissioner','NYC','Department','Authority', 'Police']

    filter = [each_string.lower() for each_string in filter]

    missing_names = ['Wyclef','Silver','de','Diandra','Kanavos','Gillian','GIllian','le','Somers','Hamish','Bowles','Alexander',\
                     'Johannes', 'Huebl','Zugazagoitia','Yaz','Agnes']
    names=[]
    current_name=[]
    caption_tag=nltk.pos_tag(text_input)
    for name in range(len(caption_tag)):
        caption_tag[name][0].strip('.')
        if caption_tag[name][0].strip().lower() not in (filter) and (caption_tag[name][1].strip() in ('NN','NNP') or caption_tag[name][0].strip() in missing_names):
            current_name.append(caption_tag[name][0].strip())#.replace('GIllian','Gillian'))

        else:
            #trying to parse the husband and wife 1st and last names
            if len(current_name)==1 and name<len(caption_tag)-1:
                if caption_tag[name][0].strip()=="and" and (caption_tag[name+1][1].strip() in ('NN','NNP') or caption_tag[name][0].strip() in missing_names) and caption_tag[name+2][1].strip() in ('NN','NNP'):
                    current_name.append(caption_tag[name+2][0].strip())

            #if len(current_name)>1:
            if len(current_name)>1:
                if ("Magazine" in current_name and "Susan" not in current_name):
                    current_name=[]
                names.append(' '.join(current_name))
            current_name=[]
            #[x.replace('Jeanne Shafiroff','Jean Shafiroff') for x in names]

    #some photographers names
    if 'Patrick McMullan' in names:
        names.remove('Patrick McMullan')

#    if 'Rob Rich' in names:
#        names.remove('Rob Rich')
#    if 'Annie Watt' in names:
#        names.remove('Annie Watt')
    names = [item.strip(',').strip('.').strip(',').strip('') for item in names]
    if len(names)>30: names = []
    return(names)

In [138]:
get_name(get_text(captions,0))

['Glenn Adamson',
 'Simon Doonan',
 'Victoire de Castellane',
 'Craig Leavitt',
 'Jerome Chazen',
 'Andi Potamkin',
 'Ralph Pucci',
 'Kirsten Bailey',
 'Edwin Hathaway',
 'Dennis Freedman']

Next step is my attempt ot Hardwire some filters to the names




Once you feel that your algorithm is working well on these captions, parse all of the captions and extract all the names mentioned.  Sort them alphabetically, by first name, and return the first hundred.

In [139]:
def names_from_page(captions_input):
    names=[]
    for i in range(len(captions_input)):
        names.extend(get_name(get_text(captions_input,i)))
    unique_names=list(dict.fromkeys(names))
    #filter = ["MAD","Ballroom","BALL","zombie","Zombie","silver","CNN News Anchor","Memorial","Restaurant","Dr.","Director"]
    #Final_Names = [s for s in unique_names if not any(filter in s for filter in filter)]
    Sorted_Final_Names = sorted(unique_names, key=lambda x: x.split(" ")[0])
    Final_Names = [s.strip("/") for s in Sorted_Final_Names]
    return(Final_Names)

In [141]:
names_from_page(get_captions(link_list_flatten[2][0]))

['Aaron Tighe',
 'Aaron Malinsky',
 'Adrianne Silver',
 'Alice Levine',
 'Amelia Ogunleis',
 'Amsale Aberra',
 'Andee Radu',
 'Andrea Klein',
 'Angela Vallot',
 'Ann Marie Mirabile',
 'Anne-Marie Olson Kahn',
 'Aziz Friedrich',
 'Barbara Cohen',
 'Barbara Murphy',
 'Bart Tiernan',
 'Ben Malinsky',
 'Benjamin Talton',
 'Bernard Tyson',
 'Beverly Bell',
 'Bob Hussey',
 'Bob Mackay',
 'Bonnie Williamson',
 'Brendan FitzGerald',
 'Bruce Gordon',
 'Byron Pitts',
 'Carol Large',
 'Carol Clarke',
 'Caroline Gerry',
 'Carolyn Malinsky',
 'Charles Atkins',
 'Charles Holcomb',
 'Cheryl Brown Henderson',
 'Chris Fox',
 'Chris Nottonson',
 'Christine Gachot',
 'Clint Finlayson',
 'Cynthia Jay',
 'Cyrus Izzo',
 'Dan Barry',
 'Daniel Parker',
 'Darren Walker',
 'David Reich',
 'David Hidalgo',
 'David Greenbaum',
 'Debbie Hussey',
 'Debby Greenberg',
 'Deborah Roberts',
 'Deborah Kern',
 'Debra Lee',
 'Debra Martin Chase',
 'Defense Fund',
 'Denise Tyson',
 'Dennis Brownlee',
 'Denyse Duval Pugsley'

Now, run this sort of test on a few other pages.  You will probably find that other pages have a slightly different HTML structure, as well as new captions that trip up your caption parser.  But don't worry if the parser isn't perfect -- just try to get the easy cases.

## Phase Three


Once you are satisfied that your caption scraper and parser are working, run this for all of the pages.  If you haven't implemented some caching of the captions, you probably want to do this first.

In [144]:
#pip install ediblepickle

In [145]:
from ediblepickle import checkpoint
import os
from urllib.parse import quote

cache_dir = 'cache'
if not os.path.exists(cache_dir):
    os.mkdir(cache_dir)

#@checkpoint(key=lambda args,kargs: quote(args[0].split('/')[-1])+'.p', work_dir=cache_dir, refresh=False)
@checkpoint(key=lambda args,kargs: quote(args[0].split('/')[-2]+"_"+args[0].split('/')[-1])+'.p', work_dir=cache_dir, refresh=False)
# @checkpoint(key=lambda args,kargs: quote(args[0]).replace("/","_") +'.p', work_dir=cache_dir, refresh=False)
# checkpoint is called decorator and writes the results to file
def caption_caching(path):
    #print('Caching the photo captures')
    result = get_captions(path)
    #print('Caching complete.')
    return result

In [146]:
def get_all_captions(link_list):
    caption_list = []
    caption_lenght =[]
    zero_list=[]
    for i in range(len(link_list)):
        #print('Caching the photo captures from site ',i)
        caption_list += caption_caching(link_list[i][0])
        caption_lenght += [len(caption_caching(link_list[i][0]))]
        if len(caption_caching(link_list[i][0]))==0:
            zero_list += [link_list[i][0]]
    return zero_list,caption_lenght,caption_list

In [148]:
%%time
all_caption_list=get_all_captions(link_list_flatten)

CPU times: user 1min 55s, sys: 1.39 s, total: 1min 57s
Wall time: 39min 26s


In [None]:
count = get_all_captions(link_list)[1]

In [None]:
sorted_count = sorted(zip(count,link_list))

In [None]:
sorted_count[0:100]

[(0,
  ('https://web.archive.org//web/20150910153045/http://www.newyorksocialdiary.com/party-pictures/2009/notable-occasions',
   datetime.datetime(2009, 6, 22, 0, 0))),
 (0,
  ('https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2009/home-for-the-holidays',
   datetime.datetime(2009, 12, 23, 0, 0))),
 (0,
  ('https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2010/all-about-me-and-the-rest-of-new-york',
   datetime.datetime(2010, 3, 23, 0, 0))),
 (0,
  ('https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2010/fashion-week-marches-on',
   datetime.datetime(2010, 2, 18, 0, 0))),
 (0,
  ('https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2010/fundraisers-and-opening-nights',
   datetime.datetime(2010, 2, 1, 0, 0))),
 (0,
  ('https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2010/

In [None]:
 s = sorted((get_all_captions(link_list)[1][i], link_list[i][0]) for i in range(len(link_list)))

KeyboardInterrupt: 

In [None]:
s

[(301,
  'https://web.archive.org//web/20151024112547/http://www.newyorksocialdiary.com/party-pictures/2009/fall-fundraising'),
 (277,
  'https://web.archive.org//web/20150910104534/http://www.newyorksocialdiary.com/party-pictures/2011/opening-night-preview-of-the-winter-antiques-show'),
 (272,
  'https://web.archive.org//web/20151024004253/http://www.newyorksocialdiary.com/party-pictures/2010/summer-saturday-afternoons'),
 (262,
  'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/rolling-out-the-red-carpet'),
 (261,
  'https://web.archive.org//web/20150910153045/http://www.newyorksocialdiary.com/party-pictures/2009/next-generation-has-arrived'),
 (260,
  'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/the-museum-dance'),
 (257,
  'https://web.archive.org//web/20150910152250/http://www.newyorksocialdiary.com/party-pictures/2011/midsummer-party-galas'),
 (254,
  'https://web.archive.org//

In [None]:
all_caption_list[0]

['https://web.archive.org//web/20151024004253/http://www.newyorksocialdiary.com/party-pictures/2010/dog-days-afternoons-and-evenings',
 'https://web.archive.org//web/20151024004253/http://www.newyorksocialdiary.com/party-pictures/2010/independence-and-verve',
 'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/summer-season-in-full-swing',
 'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/new-york-restoration',
 'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/babes-bags-and-bubbles',
 'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/more-from-the-28th-annual-hat-luncheon',
 'https://web.archive.org//web/20151024020634/http://www.newyorksocialdiary.com/party-pictures/2010/the-art-of-giving',
 'https://web.archive.org//web/20151024001249/http://www.newyorksocialdiary.com/party-pictures/2010/a

In [None]:
len(all_caption_list[0])

30

In [None]:
all_caption_list[2][10]

'\n\n\n\n'

In [None]:
count = 0
for item in all_caption_list[2]:
    if item == []:
        count +=1
count

0

In [None]:
for i in range(len(link_list)): print(link_list[i][0],all_caption_list[1][i])

https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/the-thanksgiving-day-parade-from-the-ground-up 182
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests 74
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/equal-justice 83
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/celebrating-the-treasures 64
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/associates-and-friends 67
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/michaels-25th 108
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/new-york-lifelines 53
https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/legends-and-leaders 107
https://web.ar

In [None]:
len(all_caption_list[2])

128509

In [None]:
all_caption_list[2]

['\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n',
 '',
 '\n\n\n\n

In [None]:
get_all_captions(link_list)[2]

' Chuck Grodin '

In [None]:
link_list

[('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/the-thanksgiving-day-parade-from-the-ground-up',
  datetime.datetime(2014, 12, 1, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests',
  datetime.datetime(2014, 11, 24, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/equal-justice',
  datetime.datetime(2014, 11, 20, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/celebrating-the-treasures',
  datetime.datetime(2014, 11, 18, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/associates-and-friends',
  datetime.datetime(2014, 11, 17, 0, 0)),
 ('https://web.archive.org//web/20150918040703/http://www.newyorksocialdiary.com/party-pictures/2014/michaels-25th',
  datetime.datetime(2014, 11, 13, 0, 

In [None]:
names_from_all_pages=names_from_page(all_caption_list[2])
print(len(names_from_all_pages))
names_from_all_pages

105170


["'Jessica Rabbit",
 "'Roger Rabbit",
 "'The Chick Magnet",
 ',Alan Wilkinson',
 ',Alana McCarthy',
 ',Alana Zonan',
 ',Alejandra Saralegui',
 ',Alex Fanjul.',
 ',Alexis Zoullas',
 ',Alina Cho',
 ',Alina Cho.',
 ',Alison Mazzola',
 ',Amanda Ross',
 ',Amanda Hearst',
 ',Andrea Correale',
 ',Angie Harmon',
 ',Ann Ligouri',
 ',Annalise Peterson',
 ',Audra McDonald',
 ',BSO Overseer Kim Taylor',
 ',Barbara Feldon',
 ',Barbara Slifka',
 ',Barry Diller',
 ',Bebe Monnahan',
 ",Beth O'Donnell",
 ',Beth Pine',
 ',Beth Ostrosky Stern',
 ',Beth Stern',
 ',Bettina Anderson',
 ',Bonnie Comley',
 ',Byron Lewis.',
 ',Calvin Klein',
 ',Cameron Preston',
 ',Campion Platt',
 ',Cantata Stagionale',
 ',Carol Beckwith',
 ',Carol Gertz.',
 ',Caroline Loevner',
 ',Catherine Sabino',
 ',Celerie Kemble',
 ',Charity Wright',
 ',Chris Brown',
 ',Connie Krueger.',
 ',Curtis Young',
 ',Cynthia Frank',
 ',Darlene Love.',
 ',David Koch.',
 ',Dendy Engelman',
 ',Devon Windsor',
 ',Diane Habig',
 ',Doug McCorkmick',
 

In [None]:
textfile = open("all_names.txt", "w")
for element in names_from_all_pages:
    textfile.write(element + "\n")
textfile.close()

In [None]:
list1 = names_from_page(caption_caching(link_list[2][0]))

In [None]:
list1

['Aaron Tighe',
 'Aaron Malinsky',
 'Alice Levine',
 'Amelia Ogunleis',
 'Amsale Aberra',
 'Andee Radu',
 'Andrea Klein',
 'Angela Vallot',
 'Ann Marie Mirabile',
 'Anne-Marie Olson Kahn',
 'Aziz Friedrich',
 'Barbara Cohen',
 'Barbara Murphy',
 'Bart Tiernan',
 'Ben Malinsky',
 'Benjamin Talton',
 'Bernard Tyson',
 'Beverly Bell',
 'Bob Hussey',
 'Bob Mackay',
 'Bonnie Williamson',
 'Brendan FitzGerald',
 'Bruce Gordon',
 'Byron Pitts',
 'Carol Large',
 'Carol Clarke',
 'Caroline Gerry',
 'Carolyn Malinsky',
 'Charles Atkins',
 'Charles Holcomb',
 'Cheryl Brown Henderson',
 'Chris Fox',
 'Chris Nottonson',
 'Christine Gachot',
 'Clint Finlayson',
 'Cynthia Jay',
 'Cyrus Izzo',
 'Dan Barry',
 'Daniel Parker',
 'Darren Walker',
 'David Reich',
 'David Hidalgo',
 'David Greenbaum',
 'Debbie Hussey',
 'Debby Greenberg',
 'Deborah Roberts',
 'Deborah Kern',
 'Debra Lee',
 'Debra Martin Chase',
 'Defense Fund',
 'Denise Tyson',
 'Dennis Brownlee',
 'Denyse Duval Pugsley',
 'Devon Carroll',


In [None]:
# Scraping all of the pages could take 10 minutes to an hour

For the remaining analysis, we think of the problem in terms of a
[network](http://en.wikipedia.org/wiki/Computer_network) or a
[graph](https://en.wikipedia.org/wiki/Graph_%28discrete_mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  What we have described is more appropriately called an (undirected)
[multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops, but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).  

In the remainder of this miniproject, we will analyze the social graph of the New York social elite.  We recommend using python's [`networkx`](https://networkx.github.io/) library to build this social graph.

In [None]:
import itertools  # itertools.combinations may be useful
import networkx as nx

All in all, you should end up with over 100,000 captions and more than 110,000 names, connected in about 200,000 pairs.

**Note:** If you have a significantly smaller number of names or name pairs, verify that you are correctly identifying the caption(s) from each party page.  Are there some pages where your methods return no names?  Examine those pages more closely to determine why that is the case and modify your methods appropriately.  

Here are some useful commands from networkx

In [None]:
#add nodes from any iterable container, such as a list
#G.add_nodes_from([2, 3])

In [None]:
#Nodes from one graph can be incorporated into another
#G.add_nodes_from(H)  #G now contains the nodes of H as nodes of G.
#G.add_node(H)        #In contrast, you could use the graph H as a node in G

In [None]:
def Reverse(tuples):
    new_tup = tuples[::-1]
    return new_tup

In [None]:
#get_name(get_text(caption_caching(link_list[9][0])))

In [None]:
def grow_graph(Existing_Graph,new_caption_names):

    for s in itertools.combinations(new_caption_names, 2):
        if (s or Reverse(s)) in Existing_Graph.edges:
            Existing_Graph.edges[s]['weight'] += 1
        else:
            Existing_Graph.add_edges_from([s], weight=1)

In [None]:
%%time
###NOw I get all the names from each caption and call the grow_graph to build the graph incrementally
#Create an empty graph with no nodes and no edges.
H = nx.Graph()
i = 0
for i in range(len(all_caption_list[2])):
    grow_graph(H,get_name(get_text(all_caption_list[2],i)))
    if i%10000==0:
        print("adding names from caption to graph",i)
    #i +=1


adding names from caption to graph 0
adding names from caption to graph 10000
adding names from caption to graph 20000
adding names from caption to graph 30000
adding names from caption to graph 40000
adding names from caption to graph 50000
adding names from caption to graph 60000
adding names from caption to graph 70000
adding names from caption to graph 80000
adding names from caption to graph 90000
adding names from caption to graph 100000
adding names from caption to graph 110000
adding names from caption to graph 120000
CPU times: user 1min, sys: 1.36 s, total: 1min 1s
Wall time: 1min 1s


In [None]:
H.number_of_nodes()

96633

In [None]:
degree_list=[n for n in H.degree(weight='weight')]

In [None]:
degree_list

[('Les Lieberman', 84),
 ('Barri Lieberman', 43),
 ('Isabel Kallman', 17),
 ('Trish Iervolino', 25),
 ('Ron Iervolino', 43),
 ('Diana Rosario', 5),
 ('Ali Sussman', 5),
 ('Sarah Boll', 5),
 ('Jen Zaleski', 9),
 ('Alysse Brennan', 13),
 ('Lindsay Macbeth', 5),
 ('Kelly Murro', 1),
 ('Tom Murro', 38),
 ('Russ Middleton', 3),
 ('Lisa Middleton', 6),
 ('Barbara Loughlin', 28),
 ('Gerald Loughlin', 133),
 ('Debbie Gelston', 2),
 ('Heather Robinson', 6),
 ('Kiwan Nichols', 4),
 ('Jimmy Nichols', 5),
 ('Melanie Carbone', 4),
 ('Nancy Brown', 17),
 ('Bill Mack', 24),
 ('David Lyden', 10),
 ('Patricia Sorenson', 1),
 ('Jimmy Cayne', 2),
 ('Vince Tese', 3),
 ('Pat Cayne', 3),
 ('Stuart Oran', 2),
 ('Hilary Oran', 2),
 ('Chuck Grodin', 5),
 ('Dwight Gooden', 1),
 ('Amy Cunningham-Bussel', 2),
 ('Ray Mirra', 4),
 ('Tyler Janovitz', 2),
 ('Dan Shedrick', 1),
 ('Samara Heafitz', 1),
 ('Cass Adelman', 54),
 ('Jason Adelman', 8),
 ('Bart Scott', 6),
 ('Mark Laplander', 1),
 ('Mitch Rubin', 9),
 ('Audr

In [None]:
popular_names=sorted(degree_list,key=lambda x:-x[1])[0:100]
popular_names

[('Jean Shafiroff', 712),
 ('Gillian Miniter', 581),
 ('Mark Gilbertson', 500),
 ('Alexandra Lebenthal', 441),
 ('Geoffrey Bradfield', 404),
 ('Somers Farkas', 349),
 ('Yaz Hernandez', 341),
 ('Debbie Bancroft', 319),
 ('Eleanora Kennedy', 315),
 ('Jamee Gregory', 315),
 ('Andrew Saffir', 301),
 ('Sharon Bush', 298),
 ('Alina Cho', 291),
 ('Muffie Potter Aston', 288),
 ('Kamie Lightburn', 287),
 ('Michael Bloomberg', 287),
 ('Liliana Cavendish', 269),
 ('Bonnie Comley', 259),
 ('Lydia Fenet', 257),
 ('Allison Aston', 252),
 ('Barbara Tober', 243),
 ('Mario Buatta', 235),
 ('Lucia Hwong Gordon', 234),
 ('Deborah Norville', 231),
 ('Bettina Zilkha', 229),
 ('Karen LeFrak', 228),
 ('Ellen V. Futter', 225),
 ('Liz Peek', 213),
 ('Dennis Basso', 206),
 ('Grace Meigher', 206),
 ('Stewart Lane', 203),
 ('Daniel Benedict', 197),
 ('Elizabeth Stribling', 194),
 ('Michele Herbert', 192),
 ('Nicole Miller', 191),
 ('Fernanda Kellogg', 188),
 ('Leonard Lauder', 187),
 ('Evelyn Lauder', 186),
 ('Sy

In [None]:
import statistics

In [None]:
print('mean = ', statistics.mean([i[1] for i in popular_names]))
print('std = ', statistics.stdev([i[1] for i in popular_names]))
print('mode = ', statistics.mode([i[1] for i in popular_names]))
print('maximum = ', popular_names[0][1])
print('minimum = ', popular_names[99][1])
print('quantile = ', statistics.quantiles([i[1] for i in popular_names],n=4))


mean =  207.22
std =  93.37862321070745
mode =  156
maximum =  712
minimum =  133
quantile =  [154.25, 174.0, 228.75]


In [None]:
degree = popular_names

In [None]:
n=0
for i in range(len(all_caption_list[2])):
    if 'Chuck Close' in all_caption_list[2][i]:
        n +=1
        print(all_caption_list[2][i])
        print(get_name(get_text(all_caption_list[2],i)))



 Jacques Marcon, Jean Francois Bruel, Maura Haynes, Bill Kronenberg, Chuck Close, Daniel Boulud, and Ernie Thrasher 
['Jean Francois Bruel', 'Maura Haynes', 'Bill Kronenberg', 'Chuck Close', 'Daniel Boulud', 'Ernie Thrasher']
 Régis Marcon, Chuck Close, and Daniel Boulud 
['Régis Marcon', 'Chuck Close', 'Daniel Boulud']
Chuck Close 
['Chuck Close']
 Genevieve Bahrenburg and Chuck Close 
['Genevieve Bahrenburg', 'Chuck Close']
Donna Karan and Chuck Close
['Donna Karan', 'Chuck Close']
 Katherine Gage, Daniel Boulud, Chuck Close, and Emma McCormick-Goodheart 
['Katherine Gage', 'Daniel Boulud', 'Chuck Close', 'Emma McCormick-Goodheart']
 Patrick McMullan and Chuck Close with the New York Observer group
['Chuck Close', 'York Observer']
 Agnes Gund and Chuck Close 
['Agnes Gund', 'Chuck Close']
 Chuck Close, Terrie Sultan, Patricia Birch, Barbara Goldsmith, Paul Taylor, Taylor Barton-Smith, Tony Ingrao, and Randy Kemper 
['Chuck Close', 'Terrie Sultan', 'Patricia Birch', 'Barbara Goldsmith

In [None]:
'Ceremonies','daughter','Correspondent','Commissioner','NYC','Department','Authority', 'Police'

('Ceremonies',
 'daughter',
 'Correspondent',
 'Commissioner',
 'NYC',
 'Department',
 'Authority',
 'Police')

In [None]:
name = 'Chuck Close'
for i in range(len(all_caption_list[2])):
    if (name in all_caption_list[2][i]) and (name not in get_name(get_text(all_caption_list[2],i))) and (len(all_caption_list[2][i])>1):
        print(all_caption_list[2][i])
        print(get_name(get_text(all_caption_list[2],i)))

In [None]:
#Notes about Gillian: 1) nltk doesnt count Gillian as a name, it thinks it is a JJ which is adjective, /
# however it captures it as name at some captions I cant understand why is that happening
#2) I see that in some captures it is written as GIllian Miniter
# I have to come up with a solution to force nltk to return NNP for Gillian or think of some workaround. Also need to change all
#GIllian to Gillian
print(nltk.pos_tag(['Gillian ', 'Miniter']))
print(nltk.pos_tag(['GIllian', 'Miniter']))
print(get_name(get_text(['Martin  and Audrey Gruss\nJean Shafiroff, Martin Shafiroff, and Elizabeth Shafiroff'],0)))

[('Gillian ', 'NNP'), ('Miniter', 'NNP')]
[('GIllian', 'JJ'), ('Miniter', 'NNP')]
['Martin Gruss', 'Audrey Gruss', 'Jean Shafiroff', 'Martin Shafiroff', 'Elizabeth Shafiroff']


In [None]:
#bad function because it is looping over a lot elements unnecessarily

#def grow_graph(Existing_Graph,new_caption_names):

#     for s in list(itertools.combinations(new_caption_names, 2)):
#         if s in list(Existing_Graph.edges):
#             Existing_Graph.add_edges_from([s], weight=1)
#             Existing_Graph.edges[s]['weight'] += 1
#         else:
#             Existing_Graph.add_edges_from([s], weight=1)

In [None]:
dict(popular_names)['Alec Baldwin']

153

## Question 3: degree


The simplest question to ask is "who is the most popular"?  The easiest way to answer this question is to look at how many connections everyone has.  Return the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.

**Checkpoint:** Some aggregate stats on the solution:

    "count": 100.0
    "mean": 189.92
    "std": 87.8053034454
    "min": 124.0
    "25%": 138.0
    "50%": 157.0
    "75%": 195.0
    "max": 666.0

In [None]:
import heapq  # Heaps are efficient structures for tracking the largest
              # elements in a collection.  Use introspection to find the
              # function you need.
#degree = [('Alec Baldwin', 144)] * 100

grader.score('graph__degree', degree)

Your score: 0.9400


## Question 4: PageRank


A similar way to determine popularity is to look at their
[PageRank](http://en.wikipedia.org/wiki/PageRank).  PageRank is used for web ranking and was originally
[patented](http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6285999) by Google and is essentially the stationary distribution of a [Markov
chain](http://en.wikipedia.org/wiki/Markov_chain) implied by the social graph. You can implement this yourself or use the version in `networkx`.

Use 0.85 as the damping parameter so that there is a 15% chance of jumping to another vertex at random.

**Checkpoint:** Some aggregate stats on the solution:

    "count": 100.0
    "mean": 0.0001841088
    "std": 0.0000758068
    "min": 0.0001238355
    "25%": 0.0001415028
    "50%": 0.0001616183
    "75%": 0.0001972663
    "max": 0.0006085816

In [None]:
pr=nx.pagerank(H,alpha=0.85)
pr_list = [(k, v) for k, v in pr.items()]
pagerank = sorted(pr_list,key=lambda x:-x[1])[0:100]
pagerank

[('Jean Shafiroff', 0.0006293757717143743),
 ('Mark Gilbertson', 0.00047084678105814993),
 ('Gillian Miniter', 0.0004442774934743545),
 ('Geoffrey Bradfield', 0.0003713176629438266),
 ('Alexandra Lebenthal', 0.00036112236344949296),
 ('Andrew Saffir', 0.00030950174846571796),
 ('Yaz Hernandez', 0.0002998918891202153),
 ('Somers Farkas', 0.00028728587937411587),
 ('Sharon Bush', 0.00028134906684297714),
 ('Michael Bloomberg', 0.0002781471905100825),
 ('Debbie Bancroft', 0.0002750261989354889),
 ('Mario Buatta', 0.0002697791250562959),
 ('Kamie Lightburn', 0.0002661108456568866),
 ('Barbara Tober', 0.0002591814990305864),
 ('Alina Cho', 0.0002571426186901626),
 ('Eleanora Kennedy', 0.00025311677834375084),
 ('Liliana Cavendish', 0.00024004204663980087),
 ('Jamee Gregory', 0.00023578792748171046),
 ('Lydia Fenet', 0.00023179450392214447),
 ('Lucia Hwong Gordon', 0.00022747575176132752),
 ('Muffie Potter Aston', 0.00022265522724860986),
 ('Bonnie Comley', 0.00021544896095283578),
 ('Christ

In [None]:
print('mean = ', statistics.mean([i[1] for i in pagerank]))
print('std = ', statistics.stdev([i[1] for i in pagerank]))
print('mode = ', statistics.mode([i[1] for i in pagerank]))
print('maximum = ', pagerank[0][1])
print('minimum = ', pagerank[99][1])
print('quantile = ', statistics.quantiles([i[1] for i in pagerank],n=4))


mean =  0.00019228423049388546
std =  7.780638897317819e-05
mode =  0.0006293757717143743
maximum =  0.0006293757717143743
minimum =  0.00013026576800013144
quantile =  [0.00014452562756251173, 0.0001658691370551604, 0.00020750989893830095]


In [None]:
#pagerank = [('Martha Stewart', 0.00019312108706213307)] * 100

grader.score('graph__pagerank', pagerank)

Your score: 0.9700


## Question 5: best_friends


Another interesting question is who tend to co-occur with each other.  Give us the 100 edges with the highest weights.

Google these people and see what their connection is.  Can we use this to detect instances of infidelity?

**Checkpoint:** Some aggregate stats on the solution:

    "count": 100.0
    "mean": 25.84
    "std": 16.0395470855
    "min": 14.0
    "25%": 16.0
    "50%": 19.0
    "75%": 29.25
    "max": 109.0

In [None]:
edge_data = H.edges.data()

In [None]:
edge_data_2 = [((item[0],item[1]),item[2]['weight']) for item in edge_data]
best_friends = sorted(edge_data_2,key=lambda x:-x[1])[0:100]
best_friends

[(('Gillian Miniter', 'Sylvester Miniter'), 119),
 (('Jamee Gregory', 'Peter Gregory'), 78),
 (('Bonnie Comley', 'Stewart Lane'), 77),
 (('Geoffrey Bradfield', 'Roric Tobin'), 69),
 (('Daniel Benedict', 'Andrew Saffir'), 65),
 (('Barbara Tober', 'Donald Tober'), 59),
 (('Campion Platt', 'Tatiana Platt'), 56),
 (('Jean Shafiroff', 'Martin Shafiroff'), 55),
 (('Somers Farkas', 'Jonathan Farkas'), 55),
 (('Alexandra Lebenthal', 'Jay Diamond'), 48),
 (('Yaz Hernandez', 'Valentin Hernandez'), 45),
 (('Eleanora Kennedy', 'Michael Kennedy'), 45),
 (('Peter Regna', 'Barbara Regna'), 44),
 (('Grace Meigher', 'Chris Meigher'), 44),
 (('Jonathan Tisch', 'Lizzie Tisch'), 42),
 (('Margo Catsimatidis', 'John Catsimatidis'), 42),
 (('Guy Robinson', 'Elizabeth Stribling'), 40),
 (('Sessa von Richthofen', 'Richard Johnson'), 39),
 (('Deborah Norville', 'Karl Wellner'), 39),
 (('David Koch', 'Julia Koch'), 37),
 (('Hilary Geary Ross', 'Wilbur Ross'), 35),
 (('Fernanda Kellogg', 'Kirk Henckels'), 32),
 (

In [None]:
[item for item in edge_data if ('Chuck Close' in item[0] or 'Chuck Close' in item[1])]

[('Ann Hohenhaus', 'Chuck Close', {'weight': 1}),
 ('John Guare', 'Chuck Close', {'weight': 1}),
 ('Randy Kemper', 'Chuck Close', {'weight': 1}),
 ('Tony Ingrao', 'Chuck Close', {'weight': 1}),
 ('Michael Kempner', 'Chuck Close', {'weight': 1}),
 ('Ellsworth Kelly', 'Chuck Close', {'weight': 1}),
 ('Agnes Gund', 'Chuck Close', {'weight': 5}),
 ('Barbara Goldsmith', 'Chuck Close', {'weight': 1}),
 ('Paul Beirne', 'Chuck Close', {'weight': 1}),
 ('Rachel Moore', 'Chuck Close', {'weight': 1}),
 ('Andrea Glimcher', 'Chuck Close', {'weight': 2}),
 ('Jeff Koons', 'Chuck Close', {'weight': 3}),
 ('Justine Koons', 'Chuck Close', {'weight': 1}),
 ('Marnie Pillsbury', 'Chuck Close', {'weight': 1}),
 ('Charles Rockefeller', 'Chuck Close', {'weight': 1}),
 ('Leonard Lauder', 'Chuck Close', {'weight': 2}),
 ('Ross Bleckner', 'Chuck Close', {'weight': 1}),
 ('Clifford Ross', 'Chuck Close', {'weight': 1}),
 ('Donna Karan', 'Chuck Close', {'weight': 1}),
 ('Arne Glimcher', 'Chuck Close', {'weight': 1}

In [None]:
#best_friends = [(('Michael Kennedy', 'Eleanora Kennedy'), 41)] * 100

grader.score('graph__best_friends', best_friends)

Your score: 0.9300


*Copyright &copy; 2021 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*