# Data Science and Big Data Analytics - Group Projects

For the project, we give you access to a database we actively use for our research on open source software development. The database contains information collected from the version control system, the issue tracker, and the mailing lists of the projects. [You can find a documentation of the available data online](https://smartshark2.informatik.uni-goettingen.de/documentation/). 

## Your task

The database contains data for many Apache Projects. For your group projects, only the following 39 projects are relevant:
- ant-ivy, archiva, calcite, cayenne, commons-bcel, commons-beanutils, commons-codec, commons-collections, commons-compress, commons-configuration, commons-dbcp, commons-digester, commons-io, commons-jcs, commons-jexl, commons-lang, commons-math, commons-net, commons-rdf, commons-scxml, commons-validator, commons-vfs, deltaspike, eagle, giraph, gora, jspwiki, kylin, lens, mahout, manifoldcf, nutch, opennlp, parquet-mr, santuario-java, systemml, tika, wss4j

Your task is to develop an automated model for the classification of issues as bugs. The model should be designed in such a way that it could be applied upon creation of an issue, i.e., as a recommendation system for users of the issue tracking system. 

You have to frame these questions into an analytic problem. Then, you have to create models that can be used to answer the questions. You have to choose appropriate features from the available data for this and decide which kind of analytic approach to use. Finally, you have to evaluate how well your approach performs. 

## Group registrations

You have to work in groups of three people on this task. **One member of each group must register the whole group via an email to Johannes Erbel (johannes.erbel@cs.uni-goettingen.de) until Thursday, Dec 5th**. If you are fewer than three people, you register the same way and we will assign you additional group members from other groups with less than three people on Dec 7th. **This registration, as well as the successful participation of your group is a mandatory requirement for the participation in the exam!**


## Presentation and voting

**All results must be presented in the final lecture on Feb 4th, which will already start at 14:00 o'clock.** Each group must give four minute presentation. Within this presentation, you should briefly describe which data you analyzed, how you treated it, which models you used, and your key findings. The time available for the presentation depends on the number of groups and may be increased. This will be announced in January.

Afterwards, everybody in attendance will vote to determine the best project. Each group votes for the best project (3 points), second best (2 points), and third best (1 point). The one with the most points wins a price. 

## Submission of the presentation

In order to allow the presentation session to run smoothly, each group must submit their presentation beforehand. 
**Send the presentation latest on Feb 4th, 10:00 o'clock to Steffen Herbold via Email (herbold@cs.uni-goettingen.de)**. The presentations must be either PDF, PPT, PPTX, or ODP. Other formats are not allowed. 

## Success criteria

The following criteria must be fulfilled by a project, such that the group members can participate in the exam:
- A model for automated issue type classification.
- An evaluation of the quality of the issue type classification model.
- A recommendation on how the model could be used, including key benefits and risks.
- The model must be applied to at least 30 projects from the database, which must all be considered for the evaluation.
- A presentation of the results is given in the final lecture on Feb 4th.


## Getting started

The cell below shows how to access the data with Python. Please note that the database is behind a firewall and can only be accessed from within the Goenet. If you cannot reach the database, just establish a connection to the [VPN of the GWDG](https://info.gwdg.de/docs/doku.php?id=en:services:network_services:vpn:anyconnect) and it should be reachable. 

**WARNING:
Because we also actively use the MongoDB in our research, there can sometimes be heavy load on the database. The database currently contains 6.5 Terabytes of data. Do not just randomly query the database, but fetch only the data you need. Otherwise you might easily try to download several Gigabytes of data. For example, if you want to fetch all information that is stored in the ```commit``` collection, you will download 114 Gigabytes of data.**

### Accessing the database with Python

You can use the [pycoSHARK](https://github.com/smartshark/pycoshark) library for accessing the MongoDB. The pycoSHARK provides an ORM layer based on the mongoengine library. Alternatively, you can also access the database with native MongoDB queries using the [pymongo](https://api.mongodb.com/python/current/) API. 

The code below shows how to use the database with the pycoSHARK. 

In [1]:
# code for installing our own library for accessing the MongoDB through a ORM engine
import sys
!{sys.executable} -m pip install pycoshark

Collecting pycoshark
  Downloading https://files.pythonhosted.org/packages/2d/79/6e26d638be8f24722259424e8ec104ae0d820360e35cd3e030c1988d6500/pycoSHARK-1.2.6.tar.gz
Collecting mongoengine
[?25l  Downloading https://files.pythonhosted.org/packages/a7/1c/0749992c7a2b6a2f1879ad44ba5285f907d2024838459b4cd635c5e5effd/mongoengine-0.18.2.tar.gz (151kB)
[K     |████████████████████████████████| 153kB 2.1MB/s eta 0:00:01
[?25hCollecting pymongo
[?25l  Downloading https://files.pythonhosted.org/packages/23/23/7666537adafcd232c88c156aa9382c859791d79bf12094005e009c2b6a3d/pymongo-3.9.0-cp37-cp37m-manylinux1_x86_64.whl (447kB)
[K     |████████████████████████████████| 450kB 2.6MB/s eta 0:00:01
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
[K     |████████████████████████████████| 51kB 3.0MB/s eta 0:00:011
Building wheels for collected packages: 

In [2]:
from mongoengine import connect
from pycoshark.mongomodels import People, Commit, Project, VCSSystem
from pycoshark.utils import create_mongodb_uri_string

# Database credentials
user = 'datascience2019'
password = 'zE3qHdeJtdVJYznf'
host = '134.76.81.151'
port = '27017'
authentication_db = 'smartshark'
database = "smartshark"
ssl_enabled = None

# Establish connection
uri = create_mongodb_uri_string(user, password, host, port, authentication_db, ssl_enabled)
connect(database, host=uri)

# Fetch project id and version control system id for the 'kafka' project
# The only() decides the data that is actually retrieved from the MongoDB. Always restrict this to the field that you require!
project = Project.objects(name='commons-math').only('id').get()
vcs_system = VCSSystem.objects(project_id=project.id).only('id','url').get()
print('url of VCS system of the project: %s' % vcs_system.url)

# determine latest commit of the commons-math project
last_commit = None
max_date = None
# loop over all commits of kafka
for commit in Commit.objects(vcs_system_id=vcs_system.id).only('committer_date', 'committer_id','revision_hash').timeout(False):
    if max_date is None:
        last_commit = commit
        max_date = commit.committer_date
    if max_date<commit.committer_date:
        last_commit = commit
        max_date = commit.committer_date
        
print('revision hash of last commit in database: %s' % last_commit.revision_hash)
print('date of last commit in database: %s' % last_commit.committer_date)
print('link to commit on Github: https://github.com/apache/commons-math/commit/%s' % last_commit.revision_hash)

# fetch committer from People
last_committer = People.objects(id=last_commit.committer_id).only('name','email').get()
print('last commit by %s (%s)' % (last_committer.name,last_committer.email))

ServerSelectionTimeoutError: 134.76.81.151:27017: [Errno 111] Connection refused