## The GitHub History of the Scala Language





## Project Description

Open source projects contain entire development histories, such as who made changes, the changes themselves, and code reviews. In this project, you'll be challenged to read in, clean up, and visualize the real-world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). With almost 30,000 commits and a history spanning over ten years, Scala is a mature language. You will find out who has had the most influence on its development and who are the experts.

The dataset includes the project history of Scala retrieved from Git and GitHub as a set of CSV files.


## Project Tasks

    1. Scala's real-world project repository data
    2. Preparing and cleaning the data
    3. Merging the DataFrames
    4. Is the project still actively maintained?
    5. Is there camaraderie in the project?
    6. What files were changed in the last ten pull requests?
    7. Who made the most pull requests to a given file?
    8. Who made the last ten pull requests on a given file?
    9. The pull requests of two special developers
    10. Visualizing the contributions of each developer
    

## 1. Scala's real-world project repository data

With almost 30k commits and a history spanning over ten years, Scala is a mature programming language. It is a general-purpose programming language that has recently become another prominent language for data scientists.

Scala is also an open source project. Open source projects have the advantage that their entire development histories -- who made changes, what was changed, code reviews, etc. -- are publicly available.

We're going to read in, clean up, and visualize the real world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). We will find out who has had the most influence on its development and who are the experts.

The dataset we will use, which has been previously mined and extracted from GitHub, is comprised of three files:

    'pulls_2011-2013.csv' contains the basic information about the pull requests, and spans from the end of 2011 up to (but not including) 2014.
    'pulls_2014-2018.csv' contains identical information, and spans from 2014 up to 2018.
    'pull_files.csv' contains the files that were modified by each pull request.



In [3]:
import pandas as pd


pulls_11_13 = pd.read_csv('pulls_2011-2013.csv')
print(pulls_11_13.head())


pulls_14_18 = pd.read_csv('pulls_2014-2018.csv')
print(pulls_14_18.head())



pull_files = pd.read_csv('pull_files.csv')
print(pull_files.head())

        pid         user                  date
0  11166973  VladimirNik  2013-12-31T23:10:55Z
1  11161892      Ichoran  2013-12-31T16:55:47Z
2  11153894      Ichoran  2013-12-31T02:41:13Z
3  11151917      rklaehn  2013-12-30T23:45:47Z
4  11131244        qerub  2013-12-29T17:21:01Z
         pid       user                  date
0  163314316     hrhino  2018-01-16T23:29:16Z
1  163061502   joroKr21  2018-01-15T23:44:52Z
2  163057333  mkeskells  2018-01-15T23:05:06Z
3  162985594      lrytz  2018-01-15T15:52:39Z
4  162838837  zuvizudar  2018-01-14T19:16:16Z
         pid                                   file
0  163314316        test/files/pos/t5638/Among.java
1  163314316       test/files/pos/t5638/Usage.scala
2  163314316             test/files/pos/t9291.scala
3  163314316             test/files/run/t8348.check
4  163314316  test/files/run/t8348/TableColumn.java


## 2. Preparing and cleaning the data

First, we will need to combine the data from the two separate pull DataFrames.

Next, the raw data extracted from GitHub contains dates in the ISO8601 format. However, pandas imports them as regular strings. To make our analysis easier, we need to convert the strings into Python's DateTime objects. DateTime objects have the important property that they can be compared and sorted.

The pull request times are all in UTC (also known as Coordinated Universal Time). The commit times, however, are in the local time of the author with time zone information (number of hours difference from UTC). To make comparisons easy, we should convert all times to UTC.
