Take the raw output from the MongoDB queries, and make them valid Json that can be read. This is done by:
- Remove "ObjectID(" from the start of the _id value. 
- Remove the ")" from the end of the _id value.

We also want to correct the ProjectIDs that were stored in the WoC database so they can be used for GithubAPI lookups.
The WoC database puts the author account name first, then the project name seperated by an '_'. The GithubAPI expects the account and project name to be seperated by an '/'
Ex: 'AccountName_ProjectName' -> 'AccountName/ProjectName'
However, the projectName could also have underscores in it, so we must only replace the first _ we see for any give ProjectID.


In [7]:
import json
import re

languages = ["go", "java", "python", "ruby"]
project_details_folder = "project_details"

for language in languages:
  fin = open(f'{project_details_folder}/raw/{language}_projects.json', "rt")
  fout = open(f'{project_details_folder}/filtered/{language}_projects_filtered.json', "wt")

  for line in fin:

    if "_id" in line:
      updatedLine = line.replace('ObjectId(', '')
      updatedLine = updatedLine.replace(')', '')
      fout.write(updatedLine)
    elif "ProjectID" in line:
      updatedLine = re.sub("_", "/", line, 1)
      fout.write(updatedLine)
    else:
      fout.write(line)
    
  fin.close()
  fout.close()

With the filtered results, we can now read the files as valid JSON and interact with them easily.

In [8]:
import json

languages = ["go", "java", "python", "ruby"]
project_details_folder = "project_details"


for language in languages:
  f = open(f'{project_details_folder}/filtered/{language}_projects_filtered.json')
  projects_data = json.load(f)

  print(f'Number of {language} projects: {len(projects_data)}')

  # for i in project_data:
  #   print(i["ProjectID"])

  f.close()

Number of go projects: 927
Number of java projects: 2358
Number of python projects: 4663
Number of ruby projects: 1141
