### MY470 Computer Programming

### Problem Set 1

#### \*\*\* Example Answers \*\*\*

---
### Working with data files

For this problem set, we will use data from the file `ca-GrQc.txt`, which you can locate in the `data` repository. The file contains the co-authorship links for articles in the ArXiv category General Relativity. Each line in the file includes the ids of two authors who have worked together on at least one article. In network analysis parlance, this is known as an "edge list." The data are obtained from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/index.html) and you can find more information about them on https://snap.stanford.edu/data/ca-GrQc.html.

#### Hints

The problems below need to be done in sequence because objects (lists, dictionaries, etc.) you create in early problems may be needed for a later problem. However, if you don't manage to obtain these objects at the beginning, just hard-code fictitious ones, e.g. `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]` or `{13: [13, 7596, 11196, 19170], 14: [14171]}`. 

### Problem 1: Get all coauthorships in a list of lists

Create a list that contains all edges included in the data file as lists of the two authors' ids, where the ids are saved as integers. Your list should look like `[[3466, 937], [3466, 5233], ...]`. To achieve this, use a `for` loop to iterate over each line in the file. One way to do this is as follows:

```
for line in open('../data/ca-GrQc.txt', 'r'):
    do something with line
```

⚡️ Notice that this is a more efficient way to read data than `file.read()`, which we used in the formative problem set, as you don't load all data in memory but stream them line by line. 

Print the first 10 entries in your list. 

#### Hints

It is a good practice to write and test your initial code using a smaller version of the dataset. This will help you debug faster. It will also allow you to manually check for possible errors. 

You need to ignore the first four lines of the file that contain explanatory text.

In the file, the two author ids are separated with tabs and the tab character is encoded as `'\t'`.


In [1]:
coauthors = []
for line in open('../data/ca-GrQc.txt', 'r'):
    if line[0] != '#':    # Ignore the comment lines at the beginning of the file
        strlst = line.strip().split('\t')
        coauthors.append([int(i) for i in strlst])

print('The first ten edges in the data are', coauthors[:10])


The first ten edges in the data are [[3466, 937], [3466, 5233], [3466, 8579], [3466, 10310], [3466, 15931], [3466, 17038], [3466, 18720], [3466, 19607], [10310, 1854], [10310, 3466]]


### Problem 2: Who are the authors in the data?

Create a sorted list with the integer ids for all of the unique authors in the dataset. Print the first 10 authors in the list. Then print how many authors there are in total.

Then, using a dictionary comprehension, create a dictionary in which the keys are the author integer ids and the values are empty lists. The dictionary should look something like: `{13: [], 14: [], 22: [], ...}`. To confirm, print the dictionary values for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Note that if the edge *i–j* is in the data, then the edge *j–i* is also there. This means that for this task you don't need to consider the second author in the line. You can get all authors by collecting just the first author in each line in the file.

In [2]:
# Remove all repeats and sort
# Note that sorted() can take a set (and any iterable) and returns a list

authors = sorted(set([i[0] for i in coauthors]))
print('The first ten authors in the list:', authors[:10])
print('The total number of authors in the data:', len(authors))

author_dic = {i: [] for i in authors}
for i in [13, 14, 22, 24, 25, 26, 27, 28, 29, 45]:
    print(author_dic[i])


The first ten authors in the list: [13, 14, 22, 24, 25, 26, 27, 28, 29, 45]
The total number of authors in the data: 5242
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


---
### Problem 3: Get each author's coauthors

Enter each author's unique coauthors in the empty dictionary you created in Problem 2. The dictionary should now look something like: `{13: [7596, 11196, 19170], 14: [14171], ...}`.

Print the list of coauthors for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Notice that the data contain errors. For example, I noticed that the data say that author 13 coauthored with himself/herself, which is meaningless. To get the maximum number of points, your code should exclude oneself in one's list of coauthors.

In [3]:
# Iterate over the list of coauthors and add to the appropriate dictionary entry

for i, j in coauthors:
    if j != i:
        author_dic[i].append(j)
        
for i in [13, 14, 22, 24, 25, 26, 27, 28, 29, 45]:
    print(i, 'has', author_dic[i], 'as coauthors')
    

13 has [7596, 11196, 19170] as coauthors
14 has [14171] as coauthors
22 has [106, 11183, 15793, 19440, 22618, 25043] as coauthors
24 has [3858, 15774, 19517, 23161] as coauthors
25 has [22891] as coauthors
26 has [1407, 4550, 11801, 13096, 13142] as coauthors
27 has [11114, 19081, 24726, 25540] as coauthors
28 has [7916] as coauthors
29 has [20243] as coauthors
45 has [570, 773, 1186, 1653, 2212, 2741, 2952, 3372, 4164, 4180, 4511, 4513, 6179, 6610, 6830, 7956, 8879, 9785, 11241, 11472, 12365, 12496, 12679, 12781, 12851, 14540, 14807, 15003, 15659, 17655, 17692, 18719, 18866, 18894, 19423, 19961, 20108, 20562, 20635, 21012, 21281, 21508, 21847, 22691, 22887, 23293, 24955, 25346, 25758] as coauthors


---
### Problem 4: Who has the most coauthors?

Find the author who has the most coauthors. Print the id of that author and the number of coauthors they have. 

Solve this problem using iteration and conditionals; you are not allowed to use external modules. 


In [4]:
# We will create a variable to keep the author with max number of coauthors
# and as we iterate over the dictionary, we will update this variable
# every time we see someone with higher number of coauthors.
# Notice that we assume that there is one such author, which happens
# to be true in this case. However, if we want to make our code more general,
# it will be better to keep a list of author ids with max, not just a single id.

current_max = [0, 0]  # [author_id, num_coauthors]
for auth, coauth in author_dic.items():
    if len(coauth) > current_max[1]:
        current_max = [auth, len(coauth)]
print('The most collaborative author', current_max[0], 'has', current_max[1], 'coauthors.')
        

The most collaborative author 21012 has 81 coauthors.


---

### Evaluation

| Problem | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   /3    |              
| 2       |   /4    | 
| 3       |   /3    | 
| 4       |   /4    | 
| Code legibility       |   /2    | 
| Code efficiency      |   /4    | 
|**Total**|**/20**  | 
