1. Data

In this homework, you will work on a dataset that contains information about a group of papers and their citation relationships.

Graphs setup
Based on the available data, you will create two graphs to model our relationships as follows:

Citation graph: This graph should represent the paper's citation relationships. We want this graph to be unweighted and directed. The citation should represent the citation given from one paper to another. For example, if paper A has cited paper B, we should expect an edge from node A to B.

Collaboration graph: This graph should represent the collaborations of the paper's authors. This graph should be weighted and undirected. Consider an appropriate weighting scheme for your edges to make your graph weighted.

Data pre-processing
The dataset is quite large and may not fit in your memory when you try constructing your graph. So, what is the solution? You should focus your investigation on a subgraph. You can work on the most connected component in the graph. However, you must first construct and analyze the connections to identify the connected components.

As a result, you will attempt to approximate that most connected component by performing the following steps:

Identify the top 10,000 papers with the highest number of citations.

Then the nodes of your graphs would be as follows:

Citation graph: you can consider each of the papers as your nodes

Collaboration graph: the authors of these papers would be your nodes

For the edges of the two graphs, you would have the following cases:

Citation graph: only consider the citation relationship between these 10,000 papers and ignore the rest.

Collaboration graph: only consider the collaborations between the authors of these 10,000 papers and ignore the rest.

The data we are working with is very large, so we needed to find a way to work with it without loading the entire file into memory at once. We made get_objects function that reads and processes large JSON files incrementally. 

The other function we worte was the one that takes id of an article as an input and returns its number of citation.

Using these two functions, we found 10000 articles with most citations.

In [1]:
import ijson
import heapq

def get_objects(filename):
    with open(filename, 'r') as f:
        objects = ijson.items(f, 'item')
        for obj in objects:
            yield obj

#a function that calculates how many citation an input article has
def number(id):
    num=0
    try:
        num=len(id['references'])
    except:
        num=0
    return num

filename = "/Users/petraudovicic/Desktop/adm/adm dz 5/data.json"
objects = get_objects(filename)

#top_items is a list of 10000 articles with most citations
top_items = heapq.nlargest(10000, objects, key=number)




Extracting information relevant for making a graph: first 10000 articles with most citations and their refrences among these 10000 articles

Citation graph is a list of dictionaries. Each dictionary has two keys: id and references. Id tells us which article it is and references gives us a list of ids that it cited.

In [2]:
#citation graph
rel_art = []
#the set of first 10000 articles
ids = set(i['id'] for i in top_items)

for i in top_items:
    obj = {}
    obj['id'] = i['id']
    #we are not interested if these articles have cited an article that is not in top_articles
    obj['references'] = [j for j in i['references'] if j in ids]  
    rel_art.append(obj)
print(rel_art)

[{'id': 2076024657, 'references': [2154332114, 2615873723]}, {'id': 2052326664, 'references': [2066308114]}, {'id': 2072748471, 'references': [1966730923, 2076024657]}, {'id': 47957325, 'references': [1518369453, 1985039455, 1987902506, 2053384008, 2063727779, 2096307462, 2097366413, 2142785340, 2159080219, 2166279570, 2337098149]}, {'id': 2615873723, 'references': [2041355974, 2052326664]}, {'id': 2071204548, 'references': []}, {'id': 2620342231, 'references': []}, {'id': 2154930971, 'references': [1994115836, 2034674049, 2038638586, 2046699259, 2076265641, 2096307462, 2128456629, 2142785340, 2340735175]}, {'id': 1978831484, 'references': [1967005434, 2071204548]}, {'id': 2614167197, 'references': [1981233261, 2050576295, 2087208017]}, {'id': 2895896816, 'references': [47957325, 1528676759, 1636210188, 1845972764, 1975244201, 1977655452, 1986014385, 1990334093, 2039048406, 2073384958, 2076063813, 2097381042, 2099618002, 2108646579, 2122646361, 2126316555, 2128569883, 2147492008, 21617

Making external .txt file needed for the command line question:

In [3]:
with open('citation_graph.txt', 'w') as f:
    f.write(f'{rel_art}')


To make a collaboration graph, we had to make one dictionary and a function.

Dictionary has author ids as keys and lists of their articles as values.

The function uses information from that dictionary to find all the authors who made an article along with an author from the input. It also gives an information about the number of articles that these authors wrote with the one from an input

In [4]:
from collections import defaultdict

# Creating a dictionary where the keys are author ids and the values are lists of their articles
articles_by_author = defaultdict(list)
#We iterate over top 10000 articles
for item in top_items:
    #we iterate over authors of these top articles
    for author in item['authors']:
        # Using the 'id' component of the author dictionary as the key
        author_id = author['id']
        articles_by_author[author_id].append(item['id'])

# Converting lists of articles to sets so we can find their intersection
for author_id in articles_by_author:
    articles_by_author[author_id] = set(articles_by_author[author_id])

#function that counts how many elements 2 sets have in common
def common(set1, set2):
    return len(set1.intersection(set2))

#function that takes an author from an input and returns dictionary with authors as keys and the number of articles they made with the author from an input as values
def collaborators(aut): 
    col = {}
    # Use the 'id' component of the input author dictionary as the key
    aut_id = aut
    for author_id, articles in articles_by_author.items():
        c = common(articles, articles_by_author[aut_id])
        if (c != 0 and articles!=articles_by_author[aut_id]):
            col[author_id] = c
    return col


Collaboration graph:

In [5]:
collaboration_graph=dict()
#we iterate over author ids
for i in articles_by_author.keys():
    #author ids are keys and the output of collaborators function for these ids are values
    collaboration_graph[i]=collaborators(i)

To check the accuracy of the graph, we decided to see if its outcome looks as expected and check if the graph is undirected.

First, we checked how values of the graph look like:

In [6]:
collaboration_graph.values()

dict_values([{692150028: 1, 2106222798: 1, 2984987840: 1, 2953857231: 1}, {2118765137: 1, 2156731892: 1}, {2252207586: 1, 2156731892: 1}, {2252207586: 1, 2118765137: 1, 2242438494: 1, 2079525123: 1, 1894395822: 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {2623494704: 1, 2765995116: 1, 2667035442: 1, 2141747120: 1, 310458393: 1, 2095738856: 1, 2231378115: 2, 2663712704: 2, 2120544271: 2, 2131767548: 2}, {2096665950: 1}, {2096665950: 1}, {}, {}, {}, {}, {}, {2201197304: 2, 2022224841: 3, 1987606741: 3, 2137727465: 3, 2199647293: 2, 2312159239: 1, 2072768018: 1, 2317819813: 1, 2780315194: 1}, {2153964020: 2, 2234633539: 1, 2059115463: 1, 2665898859: 1, 2582044994: 1, 2152817234: 1, 2213686429: 1, 2104818889: 1, 1968236645: 1}, {}, {2420827538: 1, 2099352078: 1, 2149056991: 1}, {1941780408: 1, 2883048977: 1, 2095961539: 1, 355346436: 1, 2159397485: 1, 2145857515: 1, 2109699881: 1, 2143841336: 1, 2099352078: 1, 2149056991: 1, 2810892695: 2, 1992867684: 2, 2991097952: 2, 

By looking at the output printed, we randomly picked the id that had number 2 as one value and checked what id has these 2 articles in common with it:

In [7]:
collaboration_graph[2131767548]

{2096665950: 2}

Here we saw that collaboration_graph[2131767548] gives us information that author with id 2131767548 has two articles with author with an id 2096665950. We wondered if collaboration_graph[2096665950] will provide the same information, aware that if not, our collaboration graph is wrong.

In [8]:
collaboration_graph[2096665950]

{2623494704: 1,
 2765995116: 1,
 2667035442: 1,
 2141747120: 1,
 310458393: 1,
 2095738856: 1,
 2231378115: 2,
 2663712704: 2,
 2120544271: 2,
 2131767548: 2}

It provided the same information, which made us confident that accurate information is saved in our collaboration_graph.

Functionality 3 - Shortest ordered walk

Input:

The graph data

A sequence of authors_a = [a_2, ..., a_{n-1}]

Initial node a_1 and an end node a_n

N: denoting the top authors whose data should be considered
 
Output:

The shortest walk of collaborations you need to read to get from author a_1 to author a_n and the papers you need to cross to realize this walk.
Considerations: For this functionality, you must implement an algorithm that returns the shortest walk that goes from node a_j to a_n, which visits in order the nodes in a. The choice of a_j and a_n can be made randomly (or if it improves the performance of the algorithm, you can also define it in any other way)

Important Notes:

This algorithm should be run only on the collaboration graph.

The algorithm needs to handle the case that the graph is not connected. Thus, only some nodes in a are reachable from a_1. In such a scenario, it is enough to let the program give in the output the string "There is no such path."

Since we are dealing with walks, you can pass on the same node a_i more than once, but you must preserve order. It means you can go back to any author node any time you want, assuming that the order in which you visit the required nodes is still the same.

Once you completed your implementation, ask chatGPT for a different one leveraging another approach in solving the shortest path and prove whether this implementation is correct.


We made a function 'walk()' that takes two nodes as input and returns the list of papers between them.

In [9]:
def walk (author1, author2):
    #list of authors that we can get to from author1
    authors=list()
    authors.append(author1)
    while(author2 not in authors):
        a=set(authors)
        for i in authors:
            #we include all the paths we can choose 
            authors.append(collaboration_graph[i].keys())
        if(set(authors)==a): #case in which there are no more authors collaborating with ones from 'authors', so author2 is not connected to author1
            return("There is no such path")
            break
    # temp is our position in the graph
    # we are walking in opposite direction
    temp=author2
    #the list of papers we pass in our walk (in the opposite order)
    papers=list()
    authors.remove(author2)
    while(temp!=author1):
        for i in authors:
            #we search for the author that collaborates with author on our position (with temp)
            if(i in collaboration_graph[temp].keys()):
                #we choose the first paper that appears in the intersection as the one we will consider
                #articles_by_author is a a dictionary made before, where the keys are author ids and the values are lists of their articles
                papers.append(set(articles_by_author[temp]).intersection(set(articles_by_author[i]))[0])
                temp=i
                # removing authors so we don't go in circles with our algorithm
                authors.remove(i)
    #reversing order of the papers
    return(papers.reverse(), len(papers))
