### MY470 Computer Programming

### Problem Set 1 (SUMMATIVE), AT 2023

#### \*\*\* Due 12:00 noon on Monday, October 16 \*\*\*

---
### Working with data files

For this problem set, we will use data from the file `ca-GrQc.txt`, which you can locate in the `data` repository. The file contains the co-authorship links for articles in the ArXiv category General Relativity. Each line in the file includes the ids of two authors who have worked together on at least one article. In network analysis parlance, this is known as an "edge list." The data are obtained from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/index.html) and you can find more information about them on https://snap.stanford.edu/data/ca-GrQc.html.

#### Hints

The problems below need to be done in sequence because objects (lists, dictionaries, etc.) you create in early problems may be needed for a later problem. However, if you don't manage to obtain these objects at the beginning, just hard-code fictitious ones, e.g. `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]` or `{13: [13, 7596, 11196, 19170], 14: [14171]}`. 

### Problem 1: Get all coauthorships in a list of lists

Create a list that contains all edges included in the data file as lists of the two authors' ids, where the ids are saved as integers. Your list should look like `[[3466, 937], [3466, 5233], ...]`. To achieve this, use a `for` loop to iterate over each line in the file. One way to do this is as follows:

```
for line in open('../data/ca-GrQc.txt', 'r'):
    do something with line
```

⚡️ Notice that this is a more efficient way to read data than `file.read()`, which we used in the formative problem set, as you don't load all data in memory but stream them line by line. 

Print the first 10 entries in your list. 

#### Hints

It is a good practice to write and test your initial code using a smaller version of the dataset. This will help you debug faster. It will also allow you to manually check for possible errors. 

You need to ignore the first four lines of the file that contain explanatory text.

In the file, the two author ids are separated with tabs and the tab character is encoded as `'\t'`.


In [1]:
# Enter your answer to Problem 1 here.

edges_list = [] # create empty list to store all edges in data
line_counter = 0 # initialise counter to skip first 4 lines
for line in open('../data/ca-GrQc.txt', 'r'):
    if line_counter < 4:  
        line_counter += 1  
    else: 
        line = line.strip().replace('\t',',').replace('\n','').split(',') # clean and split by comma 
        line = [int(id) for id in line] # convert ids to integer  
        edges_list.append(line) 

# print first 10 entries
print(edges_list[:10])

[[3466, 937], [3466, 5233], [3466, 8579], [3466, 10310], [3466, 15931], [3466, 17038], [3466, 18720], [3466, 19607], [10310, 1854], [10310, 3466]]


### Problem 2: Who are the authors in the data?

Create a sorted list with the integer ids for all of the unique authors in the dataset. Print the first 10 authors in the list. Then print how many authors there are in total.

Then, using a dictionary comprehension, create a dictionary in which the keys are the author integer ids and the values are empty lists. The dictionary should look something like: `{13: [], 14: [], 22: [], ...}`. To confirm, print the dictionary values for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Note that if the edge *i–j* is in the data, then the edge *j–i* is also there. This means that for this task you don't need to consider the second author in the line. You can get all authors by collecting just the first author in each line in the file.

In [2]:
# Enter your answer to Problem 2 here. 
first_authors = [] # create empty list to store flattened data

# iterate over nested list 
for pair in edges_list:
    first_authors.append(pair[0]) # slices first author in each i-j pair and adds to first_authors

# get unique authors
unique_ids = list(set(first_authors))

# print first 10 authors
print(unique_ids[:10])

# total authors
print('There are {} authors.'.format(len(unique_ids)))

# create dictionary 
d = {x: [] for x in unique_ids}

# confirm 
authors = [13, 14, 22, 24, 25, 26, 27, 28, 29, 45]
[list(d[author]) for author in authors]

[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]
There are 5242 authors.


[[], [], [], [], [], [], [], [], [], []]

---
### Problem 3: Get each author's coauthors

Enter each author's unique coauthors in the empty dictionary you created in Problem 2. The dictionary should now look something like: `{13: [7596, 11196, 19170], 14: [14171], ...}`.

Print the list of coauthors for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Notice that the data contain errors. For example, I noticed that the data say that author 13 coauthored with himself/herself, which is meaningless. To get the maximum number of points, your code should exclude oneself in one's list of coauthors.

In [3]:
# Enter your answer to Problem 3 here. 

for pair in edges_list:
    if pair[1] != pair[0] and pair[1] not in d[pair[0]]: # excludes oneself from list of coauthors and duplicated coauthors
        d[pair[0]].append(pair[1])
print(d)

{13: [7596, 11196, 19170], 14: [14171], 22: [106, 11183, 15793, 19440, 22618, 25043], 24: [3858, 15774, 19517, 23161], 25: [22891], 26: [1407, 4550, 11801, 13096, 13142], 27: [11114, 19081, 24726, 25540], 28: [7916], 29: [20243], 45: [570, 773, 1186, 1653, 2212, 2741, 2952, 3372, 4164, 4180, 4511, 4513, 6179, 6610, 6830, 7956, 8879, 9785, 11241, 11472, 12365, 12496, 12679, 12781, 12851, 14540, 14807, 15003, 15659, 17655, 17692, 18719, 18866, 18894, 19423, 19961, 20108, 20562, 20635, 21012, 21281, 21508, 21847, 22691, 22887, 23293, 24955, 25346, 25758], 46: [570, 773, 1653, 2212, 2741, 2952, 3372, 4164, 4511, 6179, 6610, 6830, 7956, 8879, 9785, 11241, 11472, 12365, 12496, 12781, 12851, 14540, 14807, 15003, 15659, 17655, 17692, 18894, 19423, 19961, 20108, 20562, 20635, 21012, 21281, 21508, 21847, 22887, 23293, 24955, 25346, 25758], 62: [2710, 6575, 7579, 13190, 16148, 23751, 25469], 65: [357, 358, 11609, 22100, 23300, 24960, 25777], 70: [4727, 15559], 71: [3316], 74: [2298, 16129], 75: [

In [4]:
# print list of coauthors 
coauthors = [d[x] for x in authors]
print(coauthors)

[[7596, 11196, 19170], [14171], [106, 11183, 15793, 19440, 22618, 25043], [3858, 15774, 19517, 23161], [22891], [1407, 4550, 11801, 13096, 13142], [11114, 19081, 24726, 25540], [7916], [20243], [570, 773, 1186, 1653, 2212, 2741, 2952, 3372, 4164, 4180, 4511, 4513, 6179, 6610, 6830, 7956, 8879, 9785, 11241, 11472, 12365, 12496, 12679, 12781, 12851, 14540, 14807, 15003, 15659, 17655, 17692, 18719, 18866, 18894, 19423, 19961, 20108, 20562, 20635, 21012, 21281, 21508, 21847, 22691, 22887, 23293, 24955, 25346, 25758]]


---
### Problem 4: Who has the most coauthors?

Find the author who has the most coauthors. Print the id of that author and the number of coauthors they have. 

Solve this problem using iteration and conditionals; you are not allowed to use external modules. 


In [5]:
# Enter your answer to Problem 4 here. 

k_max, v_max = None, [] # initialises two variables to store author (key) with longest list of coauthors, and list of coauthors (value) associated with them
for author,coauthors in d.items():
    if len(coauthors) > len(v_max): # checks if coauthor list is bigger than current max
        k_max, v_max = author, coauthors # assign new maximum to current author and coauthor list 
print("ID of author with most coauthors:", k_max)
print("Number of coauthors:", len(v_max))

ID of author with most coauthors: 21012
Number of coauthors: 81


---

### Evaluation

| Problem | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   3/3    |              
| 2       |   3/4    | Do not assume that the set will be sorted - it is an unordered type. There is no need to convert the empty list to a list since you defined it to be a list.
| 3       |   3/3    | 
| 4       |   4/4    | 
| Code legibility       |   0/2    | Avoid in-line comments. Do not print large outputs (e.g. the whole dictionary in P3). Use more informative variable names.
| Code efficiency      |   3/4    | In P1 you can split on tab, instead of replacing it with comma and then splitting in comma.
|**Total**|**16/20**  | Good work!
