# IS620 Project 1
## Robert Sellers | October 3, 2016

### Guardian API Article Scraping and Gephi Network Analysis

**[LINK TO VIDEO WALKTHROUGH](https://www.youtube.com/watch?v=HXhjbF6_ZZI&)**

The following code makes use of a connection to the [Guardian API]([http://open-platform.theguardian.com/) in Python, a recursive scraping function into network nodes and edges, and an export into Gephi data format .GDF. The following approach was heavily inspired by the following [GIST](https://gist.github.com/psychemedia/1283684). 

The logic behind the queries is to export the "most relevant" national news for a few nations and compare the general coverage between the results.

Importing the libraries. 

In [1]:
import simplejson
import urllib
import csv
import sys
from itertools import combinations

In [2]:
#My guardian API key. This is very easy to acquire. Feel free to use this. 
APIKEY='d8d6f3e6-8a96-46b9-8681-d39b146def8c'

The following is now a block of code that takes in a query (i.e. "Olympics"), encodes the query string, extracts the json from the API query result, and recursively generates nodes/edges into a Gelphi file based on the keyword -> article search. 

In [3]:
def searchGuardian(searchterm):
    term='"'+' '.join([searchterm])+'"'
    enc=urllib.urlencode({'q':term})
    API = 'api-key='+APIKEY
    show = '&page-size=25&'
    tags = '&show-tags=keyword&'
    pagesize = '&order-by=relevance&'
    queryURL='http://content.guardianapis.com/search?q='+searchterm+tags+show+pagesize+API
    #prints the URL for reference
    print queryURL
    #load the data using simplejson
    data = simplejson.load(urllib.urlopen(queryURL))
    #size the results 
    print data['response']['total'], " records available for", searchterm
    #create 2 unique files named after the searchterm
    filename='_'.join(searchterm)
    f2=open(filename+'.gdf','wb')
    writer2 = csv.writer(f2)
    #get all of the results
    dr=data['response']['results']
    edges=[]
    edges2=[]
    nodes={}
    nodes2={}
    for result in dr:
        # Collect a list of tags associated with the current article
        tags=[]
        # Build up a list of unique node IDs, firstly using article IDs for the article-tag graph
        if result['id'] not in nodes:
            nodes[result['id']]=( result['id'],result["webTitle"].encode('utf-8') )
        # Now handle the article tags
        for tag in result['tags']:
            edges.append((result['id'],tag['id']))
            # Build up a list of tags associated with this article
            tags.append(tag['id'])
            # Add the tags to the unique list of node IDs
            if tag['id'] not in nodes:
                nodes[tag['id']]= ( tag['id'], tag['webTitle'].encode('utf-8') )
                nodes2[tag['id']]= ( tag['id'], tag['webTitle'].encode('utf-8') )
        # For the tag-tag graph, we need to list the various tag combinations for this article
        combos=map(list, combinations(tags, 2))
        for c in combos:
            edges2.append((c[0],c[1]))


    # Print out the tag-tag nodelist
    writer2.writerow(['nodedef>name VARCHAR','label VARCHAR'])
    for node in nodes2:
        n1,n2=nodes[node]
        writer2.writerow([ n1, n2 ])

    # Print out the tag-tag edgelist
    writer2.writerow(['edgedef>from VARCHAR','to VARCHAR'])
    for e1,e2 in edges2:
        writer2.writerow([ e1, e2 ])

I decided on three relatively obscure and distinct geographies. Kamchatka Siberia, Baja Mexico, and Rabat Morocco.

In [4]:
searchGuardian('Kamchatka')

http://content.guardianapis.com/search?q=Kamchatka&show-tags=keyword&&page-size=25&&order-by=relevance&api-key=d8d6f3e6-8a96-46b9-8681-d39b146def8c
176  records available for Kamchatka


In [5]:
searchGuardian('Baja')

http://content.guardianapis.com/search?q=Baja&show-tags=keyword&&page-size=25&&order-by=relevance&api-key=d8d6f3e6-8a96-46b9-8681-d39b146def8c
280  records available for Baja


In [6]:
searchGuardian('Rabat')

http://content.guardianapis.com/search?q=Rabat&show-tags=keyword&&page-size=25&&order-by=relevance&api-key=d8d6f3e6-8a96-46b9-8681-d39b146def8c
378  records available for Rabat


---

## Gelphi Analysis

The .gdf files have been exported and are loaded into the Gephi Environment. If you ran this analysis yourself you should see:

K_a_m_c_h_a_t_k_a.gdf 

R_a_b_a_t.gdf

B_a_j_a.gdf

In Gephi:

Step 1: Yifan Hu Layout transform

Step 2: Symbolize by Eigenvector and degree centrality. Node size / text size by eigenvector and node color by degree. 

Step 3: Add Labels and perform label adjust filter.

A link to the .gephi file [click here](https://github.com/RobertSellers/620_WEB/blob/master/Project_1.gephi).

A link to a copy of the centrality analysis [click here](http://delineator.org/docs/Project1_gephi.html).

### Baja Network Graph

![Baja Graph](https://raw.githubusercontent.com/RobertSellers/620_WEB/master/img/baja_graph.png)

### Kamchatka Network Graph

![Kamchatka Graph](https://github.com/RobertSellers/620_WEB/raw/master/img/kamchatka_graph.png)

### Rabat Network Graph

![Rabat Graph](https://github.com/RobertSellers/620_WEB/raw/master/img/rabat_graph.png)