<a href="https://colab.research.google.com/github/pnavada/DigiMiner/blob/main/Aminer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem 1

### Data Gathering

First, lets download the dataset.

In [None]:
import requests
import zipfile
import io

# Making a HTTPS call to retrieve the remote zip file
response = requests.get("https://lfs.aminer.cn/lab-datasets/citation/acm.v9.zip")

# Creating the zip file object for extraction
aminer_zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# Extracting all the files and folders into the current directory
aminer_zip_file.extractall()

### Observations

Now, let's take a look at the dataset.

Took around a minute for the file to open on my system. Probably, the data is huge.

The initial observations are listed below:
1. The raw data is present in a text file.
2. Each publication data is in a paragraph. Publications are separated by 2 newlines.
3. Each publication has different attributes on separate lines.
4. For each publication, there is a pattern to identify each attribute. Below are the inferences:<br>
a. Line starts with #* -> Title<br>
b. Line starts with #@ -> Authors separated by a comma<br>
c. Line starts with #t -> Publication year<br>
d. Line starts with #c -> Venue / Conference name<br>
e. Line starts with #! -> Abstract of the paper<br>
f. Line starts with #index -> ?<br>
g. Line starts with #% -> ?

f and g are unknown. Found out the meaning of these attributes after doing some online research ([Reference](https://www.aminer.org/citation))

f. Line starts with #index -> index id of this paper<br>
g. Line starts with #% -> the id of references of this paper (there are multiple lines, with each indicating a reference)

Lets transform this data into a tabular format for analysis.

### Data Parsing

Given the fact that a publication can be associated with multiple authors and references, there are atleast three approaches we can take to represent this data.
1. Using a list to represent authors and references.
2. Creating a separate row for each author and reference combination.
3. Using separate tables for authors, references and their association with the publication.

The first approach maintains the relationship in a more compact form. While the second does result in data redundancy, it allows us to perform analysis at a more granular level. The third approach seems to present a normalized view of the data and also seems suitable for analysis. Hence, choosing the third approach for data representation.

Now, lets write our parser which converts the raw data into ML-friendly format.

In [None]:
def get_attribute_name(symbol):
  if symbol == "*":
    return "title"
  elif symbol == "@":
    return "authors"
  elif symbol == "t":
    return "year_of_publication"
  elif symbol == "c":
    return "venue"
  elif symbol == "!":
    return "abstract"
  elif symbol == "i":
    return "id"
  elif symbol == "%":
    return "references"

# Read the raw data file
with open("acm.txt") as aminer_raw_data:

  # Creating the required tables
  publications = list()
  references_by_publication = list()
  authors_by_publication = list()

  # Creating a dictionary which is an intermediary data structure here to contain the data for each publication
  publication = dict()

  # Read each line
  for line in aminer_raw_data:

    # Identify the lines that contain the data
    if line.startswith("#"):

      # Get the symbol identifying the attributes
      attribute_denoting_symbol = line[1]

      # Get the corresponding attribute name and value
      attribute_name = get_attribute_name(attribute_denoting_symbol)
      attribute_value = line[2:].strip()

      # Populating the publication data in the intermediary data structure
      if attribute_name == "authors":
        publication["authors"] = list(map(lambda x: x.strip(), attribute_value.split(",")))
      elif attribute_name == "references":
        if "references" not in publication:
          publication["references"] = list()
        publication["references"].append(attribute_value)
      elif attribute_name == "id":
        publication["id"] = line[6:].strip()
      else:
        publication[attribute_name] = attribute_value    
    
    else:
      # Populate the data in the tables
      if "authors" in publication:
        authors_by_publication.extend([{"publication_id": publication["id"], "name": author} for author in publication["authors"]])
      if "references" in publication:
        references_by_publication.extend([{"publication_id": publication["id"], "reference": reference} for reference in publication["references"]])
      # Removing unnecessary keys before pushing the data into the publications table
      for key in ["authors", "references"]:
        publication.pop(key, None)
      publications.append(publication)
      # Reset the data structure representing the row
      publication = dict()

import pandas as pd

# Transform the list of dictionaries into csv
pd.DataFrame(publications).to_csv("publications.csv")
pd.DataFrame(authors_by_publication).to_csv("authors_by_publication.csv")
pd.DataFrame(references_by_publication).to_csv("references_by_publication.csv")

In [None]:
publications = pd.read_csv("publications.csv")

In [None]:
publications.head()

Unnamed: 0.1,Unnamed: 0,title,year_of_publication,venue,id,abstract
0,0,MOSFET table look-up models for circuit simula...,1984.0,"Integration, the VLSI Journal",1,
1,1,The verification of the protection mechanisms ...,1984.0,International Journal of Parallel Programming,2,
2,2,Another view of functional and multivalued dep...,1984.0,International Journal of Parallel Programming,3,
3,3,Entity-relationship diagrams which are in BCNF,1984.0,International Journal of Parallel Programming,4,
4,4,The computer comes of age,1984.0,The computer comes of age,5,


In [None]:
authors_by_publication = pd.read_csv("authors_by_publication.csv")

In [None]:
authors_by_publication.head()

Unnamed: 0.1,Unnamed: 0,publication_id,name
0,0,2,Virgil D. Gligor
1,1,3,M. Gyssens
2,2,3,J. Paredaens
3,3,4,Sushil Jajodia
4,4,4,Peter A. Ng


In [None]:
references_by_publication = pd.read_csv("references_by_publication.csv")

In [None]:
references_by_publication.head()

Unnamed: 0.1,Unnamed: 0,publication_id,reference
0,0,9,289258
1,1,9,2135000
2,2,10,2135000
3,3,11,289023
4,4,11,408637


A. Compute the number of distinct authors, publication venues, publications, and citations/references

In [None]:
print(authors_by_publication["name"].nunique())
print(len(authors_by_publication))
print(publications["venue"].nunique())
print(publications["id"].nunique())
print(references_by_publication["reference"].nunique())

1651588
273328
2385057
1007495


B. Are these numbers likely to be accurate? As an example look up all the publications venue names associated with the conference “Principles and Practice of Knowledge Discovery in Databases” – what do you notice?

In [None]:
publications[publications["venue"].str.contains("Principles and Practice of Knowledge Discovery in Databases", na=False)]

Unnamed: 0.1,Unnamed: 0,title,year_of_publication,venue,id,abstract
799595,799595,Summarization of dynamic content in web collec...,2004.0,PKDD '04 Proceedings of the 8th European Confe...,799596,This paper describes a new research proposal o...
799732,799732,Proceedings of the 8th European Conference on ...,2004.0,PKDD '04 Proceedings of the 8th European Confe...,799733,
799733,799733,Random matrices in data analysis,2004.0,PKDD '04 Proceedings of the 8th European Confe...,799734,We show how carefully crafted random matrices ...
799734,799734,Data privacy,2004.0,PKDD '04 Proceedings of the 8th European Confe...,799735,There is increasing need to build information ...
799735,799735,Breaking through the syntax barrier: searching...,2004.0,PKDD '04 Proceedings of the 8th European Confe...,799736,The next wave in search technology will be dri...
...,...,...,...,...,...,...
1673617,1673617,Speeding up logistic model tree induction,2005.0,PKDD'05 Proceedings of the 9th European confer...,1673618,Logistic Model Trees have been shown to be ver...
1673618,1673618,A random method for quantifying changing distr...,2005.0,PKDD'05 Proceedings of the 9th European confer...,1673619,In applications such as fraud and intrusion de...
1673619,1673619,Deriving class association rules based on leve...,2005.0,PKDD'05 Proceedings of the 9th European confer...,1673620,Most approaches of Class Association Rule (CAR...
1673620,1673620,An incremental algorithm for mining generators...,2005.0,PKDD'05 Proceedings of the 9th European confer...,1673621,This paper presents an efficient algorithm for...


C. For each author, construct the list of publications. Plot a histogram of the number of publications per author (use a logarithmic scale on the y axis)

In [None]:
publication_counts_by_author = authors_by_publication.groupby("name").count()

In [None]:
import seaborn as sns

In [None]:
sns.histplot(authors_by_publication, x='name')

KeyboardInterrupt: ignored

Error in callback <function _draw_all_if_interactive at 0x7f0902f79000> (for post_execute):


KeyboardInterrupt: ignored

Error in callback <function flush_figures at 0x7f0902f78280> (for post_execute):


KeyboardInterrupt: ignored

D. Calculate the mean and standard deviation of the number of publications per author. Also calculate the Q1 (1st quartile14), Q2 (2nd quartile, or median) and Q3 (3rd quartile) values. Compare the median to the mean and explain the difference between the two values based on the standard deviation and the 1st and 3rd quartiles.

In [None]:
publication_counts_by_author["publication_id"].describe()

count    1.651588e+06
mean     3.462492e+00
std      1.277139e+01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      3.000000e+00
max      8.878000e+03
Name: publication_id, dtype: float64

E. Now plot a histogram of the number of publications per venue, as well as calculate the mean, standard deviation, median, Q1, and Q3 values. What is the venue with the largest number of publications in the dataset?

F. Plot a histogram of the number of references (number of publications a publication refers to) and citations (number of publications referring to a publication) per publication. What is the publication with the largest number of references? What is the publication with the largest number of citations? Do these make sense?

G. Calculate the so called “impact” factor for each venue. To do so, calculate the total number of citations for the publications in the venue, and then divide this number by the number of publications for the venue. Plot a histogram of the results

H. What is the venue with the highest apparent impact factor? Do you believe this number?(http://mdanderson.libanswers.com/faq/26159)

I. Now repeat the calculation from item C, but restrict the calculation to venues with at least 10 publications. How does your histogram change? List the citation counts for all publications from the venue with the highest impact factor. How does the impact factor (mean number of citations) compare to the median number of citations?

J. Finally, construct a list of publications for each publication year. Use this list to plot the average number of references and average number of citations per publication as a function of time. Explain the differences you see in the trends.