<a href="https://colab.research.google.com/github/muratakkas/icssp2018/blob/master/Similarty_Predicting_Story_Points.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reading the dataset

In [115]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [116]:
with open('/content/gdrive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat /content/gdrive/My\ Drive/foo.txt

Hello Google Drive!

In [117]:
import pandas as pd
import numpy as np

from __future__ import division 

# reading issues
df = pd.read_csv("/content/gdrive/My Drive/CustomData/jiradataset_issues.csv")
# Reading the changelog
changelog = pd.read_csv("/content/gdrive/My Drive/CustomData/jiradataset_changelog.csv")

print 'Dataset size: {0}'.format(len(df))

Dataset size: 15155


In [118]:
summary = df.pivot_table(index='project', columns=['fields.issuetype.name'], values='key', aggfunc='count', fill_value=0, margins=True)
summary

fields.issuetype.name,Bug,Documentation,Epic,Improvement,New Feature,Patch submission,Story,Sub-task,Task,Technical Debt,Technical task,Wish,All
project,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
apstud,456,0,7,159,1,0,202,0,0,0,61,0,886
dnn,1129,0,0,315,10,0,278,92,70,0,0,0,1894
mesos,517,82,9,377,0,0,21,0,462,0,0,4,1472
mule,595,0,2,252,121,2,33,2,274,0,0,0,1281
nexus,705,0,0,302,0,0,31,0,25,8,0,0,1071
timob,1143,0,17,268,218,0,289,34,0,0,21,0,1990
tistud,1450,0,0,536,37,0,618,20,0,0,209,0,2870
xd,598,0,111,307,0,0,2590,0,0,0,85,0,3691
All,6593,82,146,2516,387,2,4062,148,831,8,376,4,15155


# Cleaning steps

0 - Remove all the issues whose assignee is null

In [119]:
xd1 = df[(df['fields.assignee.name'].notnull())]

print len(xd1)

12586


1- Story points must have been assigned once and never updated afterward. In fact, if the story points estimate gets updated, it may mean that the initial version of the issue report had misleading information, which would confuse the classifier. This explains why we filter out such issue reports. (Porru et al.)

In [120]:
# Filtering all the user stories that have been updated in the story points field
print "Original size: {0}".format(len(xd1))

remove = changelog[((changelog['field'] == 'Story Points') | ( changelog['field'] == 'Actual Story Points' )
                       | ( changelog['field'] == 'Story Size' ) | ( changelog['field'] == 'QA Story Points' ) 
                       | ( changelog['field'] == 'Effort points' ) | ( changelog['field'] == 'Value/Effort' )
                       | ( changelog['field'] == 'Effort' ) | ( changelog['field'] == 'Points' ) )  
                       & (changelog['fromString'].notnull()) ]

xd2 = xd1[ ~xd1['key'].isin(remove['key']) ]

print "After removing: {0} ({1:4.2f}%)".format(len(xd2), len(xd2)/len(xd1)*100)

Original size: 12586
After removing: 12357 (98.18%)


2- The issue must be addressed. We consider an issue addressed when its Status is set to Closed (or similar, e.g. Fixed, Completed) and its resolution field is set to Fixed (or similar, e.g. Done, Completed). Note that fields such as Story Points and Description may be adjusted or updated at any given time. However, once the issue is addressed updates rarely happen. For instance, in the industrial project this event happens for less than 4% (49/1368) of the issues. Here as for the other projects, we filter out issue reports not ad- dressed, because they are likely to be unstable, hence they might confuse the classifier.

In [121]:
xd3 = xd2[((xd2['fields.status.name'] == 'Done') | (xd2['fields.status.name'] == 'Closed') 
           | (xd2['fields.status.name'] == 'Resolved') | (xd2['fields.status.name'] == 'Accepted') ) 
          & ((xd2['fields.resolution.name'] == 'Complete') | (xd2['fields.resolution.name'] == 'Fixed') 
          | (xd2['fields.resolution.name'] == 'Done') | (xd2['fields.resolution.name'] == 'Resolved') 
          | (xd2['fields.resolution.name'] == 'Completed'))]

print "After removing: {0} ({1:4.2f}%)".format(len(xd3), len(xd3)/len(xd1)*100)

After removing: 10257 (81.50%)


3- Once the story points are assigned, the informative fields of the issue (i) must be already set and (ii) their value must not have been changed afterward. We define informative fields: Issue Type, Description, Summary, and Component/s. We filter out issues whose informative fields are updated after story points initialization because they, again, are likely to represent unstable issues.

In [122]:
# (i) the informative fields of the issue must be already set

# check if the fields are null or empty
xd4 = xd3[(xd3['fields.issuetype.name'] != '') &
    (xd3['fields.description'].notnull()) &
    (xd3['fields.summary'].notnull())]

# only US with components
keys = []
for i in range(len(xd4)):
    components = xd4.iloc[i]['fields.components']
    
    if (components != '[]'):
        keys.append(xd4.iloc[i]['key'])

len(keys)

xd5 = xd4[xd4['key'].isin(keys)]

print "After removing: {0} ({1:4.2f}%)".format(len(xd5), len(xd5)/len(xd1)*100)

After removing: 7280 (57.84%)


In [123]:
# We filter out issues whose informative fields are updated after story points initialization
#  ['fields.issuetype.name', 'fields.description', 'fields.summary', 'fields.components']

# get story points initialization date

sp = changelog[((changelog['field'] == 'Actual Story Points') | (changelog['field'] == 'Story Points')) 
          & (changelog['fromString'].isnull())]

ifields = changelog[ (changelog['field'] == 'issuetype') |
                    (changelog['field'] == 'description') |
                    (changelog['field'] == 'summary') |
                    (changelog['field'] == 'Component') ]

to_remove = []
for i in range(len(ifields)):
    key = ifields.iloc[i]['key']
    #print key
    
    original_date = pd.to_datetime(xd5[xd5.key == key]['fields.created'])
    
    #print original_date
    
    #print sp[ sp.key == key ]
    # story points initialization date
    spinit = pd.to_datetime(sp[ sp.key == key ].created)
    
    # update date of the informative field
    updatedate = pd.to_datetime(ifields.iloc[i]['created'])
    
    if not spinit.empty:
        if updatedate > pd.to_datetime(spinit.iloc[0]):
            to_remove.append(key)
    elif not original_date.empty:
        if updatedate > pd.to_datetime(original_date.iloc[0]):
            to_remove.append(key)

xd6 = xd5[~xd5['key'].isin(to_remove)]

print "After removing: {0} ({1:4.2f}%)".format(len(xd6), len(xd6)/len(xd1)*100)

After removing: 7222 (57.38%)


4- Take the user stories which have points according to the fibonacci series

In [124]:
fibonacci = [0.5, 1, 2, 3, 5, 8, 13, 20, 40, 100]

xd7 = xd6[ xd6['storypoints'].isin(fibonacci)]

print "After removing: {0} ({1:4.2f}%)".format(len(xd7), len(xd7)/len(xd1)*100)

After removing: 6757 (53.69%)


## Choosing the filter
x0 ... x7 are the filters we applied before. 

In [125]:
xdf = xd7

print 'Filtered dataset size: {0}'.format(len(xdf))

Filtered dataset size: 6757


# Adding new features
I will create new sets of features according to three categories: 
* Developer's features
* Issues features
* Textual features

## Developer's features
Developers' features depend on the dataset since they are mostly percentages (the total number of issues for the project is used). Then, I define the following functions that I will call later. 

In [0]:
# Reporter reputation 

def get_reputation(developer, dataset):
    opened = len(dataset[dataset['fields.creator.name'] == developer])
    opened_and_fixed = len(dataset[(dataset['fields.creator.name'] == developer) 
                            & ((dataset['fields.status.name'] == 'Done') | (dataset['fields.status.name'] == 'Closed') 
           | (dataset['fields.status.name'] == 'Resolved') | (dataset['fields.status.name'] == 'Accepted') ) 
                            & (dataset['fields.assignee.name'] == developer)])
    return opened_and_fixed/(opened+1)   

In [0]:
def get_reputations(dataset):
    devs = dataset['fields.creator.name'].unique()
    devs = np.append(devs, dataset['fields.assignee.name'].unique())
    devs = np.append(devs, dataset['fields.reporter.name'].unique())

    # remove dupplicates
    devs = np.unique(devs)

    print "Total number of devs: ", len(devs)

    reputations = []
    for d in devs:
        reputations.append(get_reputation(d, dataset))

    reputations_df = pd.DataFrame({"developer": devs, "reputation": reputations})
    return reputations_df

#reputations_df[reputations_df.reputation > 0].sort_values(['reputation'], ascending=False).head()

In [0]:
# Total developer workload
from __future__ import division 
    
#     
def get_dev_workload(developer, dataset, percentual=True):
    if percentual:
        df = dataset[(dataset['fields.assignee.name'] == developer)]
        return len(df)/len(dataset)
    else:
        df = dataset[(dataset['fields.assignee.name'] == developer)]
        return len(df)

def get_devs_workload(dataset, percentual=True):
    ws = []
    
    devs = dataset['fields.creator.name'].unique()
    devs = np.append(devs, dataset['fields.assignee.name'].unique())
    devs = np.append(devs, dataset['fields.reporter.name'].unique())

    # remove dupplicates
    devs = np.unique(devs)
    
    for d in devs:
        ws.append(get_dev_workload(d, dataset, percentual))
    return pd.DataFrame({"developer":devs, "workload": ws})

def get_workload(dataset, developer="", percentual=True):
    if developer == "":
        return get_devs_workload(dataset, percentual)
    else:
        return get_dev_workload(developer, dataset, percentual)

In [0]:
# current workload 
def get_current_workload(dataset):
    undone_issues = dataset[((dataset['fields.status.name'] == 'Done') | (dataset['fields.status.name'] == 'Closed') 
           | (dataset['fields.status.name'] == 'Resolved') | (dataset['fields.status.name'] == 'Accepted') )]
    grouped = undone_issues.groupby('fields.assignee.name').size().reset_index(name='workload')
    grouped['workload'] = grouped['workload']/sum(grouped['workload'])
    #developers = developers.merge(grouped, on='fields.assignee.name')
    #developers.head()
    grouped.columns = ['developer', 'current_workload']
    return grouped

In [0]:
def get_velocity(dataset):
    velocity = dataset[['fields.assignee.name', 'storypoints']].groupby(['fields.assignee.name']).sum()
    
    velocity = velocity.reset_index()
    velocity.columns = ['developer', 'velocity']
    velocity['velocity'] = velocity['velocity'] / sum(velocity['velocity'])
    return velocity
    
#velocity.sort_values(ascending=False).head()

In [0]:
# Number of developer's comments
def get_comment_number(dataset):
    
    ch = changelog[changelog['key'].isin(dataset['key'])]
    comments_times = ch[ (ch['field'] == 'Comment')].groupby(['author']).size()

    comments_times = comments_times.reset_index()
    comments_times['comments'] = comments_times[0]/sum(comments_times[0]) 

    comments_times.columns = ['developer', 'comment_absolute', 'comments_relative']
    #print "number of developers with comments: ", len(comments_times)
    return comments_times

In [0]:
# Number of developer's comments
def get_summary_similarity(dataset):
    
    ch = changelog[changelog['key'].isin(dataset['key'])]
    ch = ch[ ((ch['field'] == 'summary') | (ch['field'] == 'description' ))]
    ch = ch[['author','fromString','from']]
    ch = ch.groupby(['author'],as_index=False).sum()
    ch.columns = ['developer', 'issue','from'] 
    sm = []
    keys = []
    for index, row in dataset.iterrows(): 
       
    
      ds_context = context[ (context['key'] == row['key'])]
      ds_context = ds_context[['context']]
      issue_summary = ch[ (ch['developer'] == row['fields.assignee.name'])]
      if(issue_summary.size > 0): 
        issue_context=ds_context.iloc[0]['context']
        developer_summary= issue_summary.iloc[0]['issue']
 
        corpus = [issue_context,developer_summary]
        vectorizer = TfidfVectorizer()
        trsfm=vectorizer.fit_transform(corpus)
        similarty = pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names(),index=['issue_context','developer_summary'])
        mean_similarty =issue_summary['from'] = similarty.mean(numeric_only=True).mean()
        sm.append(mean_similarty)
        keys.append(row['key'])
      else:
        sm.append(0)
        keys.append(row['key'])
            
    
    return pd.DataFrame({"key":keys, "similarity": sm})
       

## Features from the issues
The issue features do not depend on the total number of issues of the project, so they can be computed for the entire dataset.

In [0]:
# Discussion time
def get_discussiontime(dataset):
    discussiontime = pd.to_datetime(dataset['fields.resolutiondate']).subtract(pd.to_datetime(dataset['fields.created']))
    return discussiontime

In [0]:
# Number of times the issue was reopened
reopened_times = changelog[ (changelog['field'] == 'status') 
            & (changelog['fromString'] == 'Done') 
            & (changelog['toString'] != 'Done')].groupby(['key']).size()
reopened_times = reopened_times.reset_index()
reopened_times.columns = ['key', 'reopened_times']

In [0]:
# Number of times the priority was changed
priority_times = changelog[ changelog['field'] == 'priority' ].groupby(['key']).size()
priority_times = priority_times.reset_index()
priority_times.columns = ['key', 'priority_times']

In [0]:
# Number of times the fix version was changed
fixversion_times = changelog[ changelog['field'] == 'Fix Version' ].groupby(['key']).size()
fixversion_times = fixversion_times.reset_index() 
fixversion_times.columns = ['key', 'fixversion_times']

In [0]:
# Number of fix versions
d = []
for i in range(len(xdf['fields.fixVersions'])):
    d.append({ 'key' : xdf['key'].iloc[i] , 'fix_versions' : len(pd.Series(xdf['fields.fixVersions'].iloc[i]))})
    
fix_versions = pd.DataFrame(d)

In [0]:
# Number of affect versions
# at least one version is affected
d = []
for i in range(len(xdf['fields.versions'])):
    d.append({ 'key' : xdf['key'].iloc[i], 
              'affect_versions' : 1 if len(pd.Series(xdf['fields.versions'].iloc[i])) == 0 else len(pd.Series(xdf['fields.versions'].iloc[i]))})
    
affect_versions = pd.DataFrame(d)

### Features from component and issue type

#### Issue type dummies

In [0]:
# issue type 
issue_type = pd.get_dummies(xdf[['key', 'fields.issuetype.name']], columns=['fields.issuetype.name'])

#### Components dummies

In [140]:
xdf['fields.components'] = xdf['fields.components'].apply(lambda x: [v.replace('[', '').replace(']', '').strip() for v in x.split(',')])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [141]:
#new_df.head()
components = pd.get_dummies(xdf['fields.components'].apply(pd.Series).stack()).sum(level=0)
components['key'] = xdf['key']
components.head()

Unnamed: 0,360,Acceptance Testing,Acegi,Admin - Event Viewer,Admin - Extensions,Admin - File Manager,Admin - Google Analytics,Admin - Languages,Admin - Newsletters,Admin - Pages,...,security,slave,statistics,stout,technical debt,test,testing,tests,webui,key
37,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,XD-3716
44,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,XD-3709
62,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,XD-3691
63,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,XD-3690
68,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,XD-3685


In [142]:
# components + issue_type
print len(issue_type)
print len(components)

components_issuetype = pd.merge(issue_type, components, on='key',  how='left')

6757
6757


### All the features together

In [0]:
# Issue-related features
usfeatures = pd.DataFrame(xdf['key'])

usfeatures['discussion_time'] = pd.to_datetime(xdf['fields.resolutiondate']).subtract(pd.to_datetime(xdf['fields.created']))
usfeatures = pd.merge(usfeatures, reopened_times, on='key', how='left')
usfeatures = pd.merge(usfeatures, priority_times, on='key', how='left')
usfeatures = pd.merge(usfeatures, fixversion_times, on='key', how='left')
usfeatures = pd.merge(usfeatures, fix_versions, on='key', how='left')
usfeatures = pd.merge(usfeatures, affect_versions, on='key', how='left')
#usfeatures = pd.merge(usfeatures, context[['key', 'context_characters', 'context_code_characters']], on='key', how='left')
usfeatures = usfeatures.fillna(0)
usfeatures['discussion_time'] = usfeatures['discussion_time'].dt.total_seconds()

### Full issue features ( issue + components + issuetype )

In [144]:
print len(usfeatures)
print len(components_issuetype)
fullissuefeatures = pd.merge(usfeatures, components_issuetype, on='key', how='left')

6757
6757


## Text features
I create a new variable context to store only the textual info.

In [145]:
# Summary and description merged into one text (Porru 2014)

context = xdf[['key', 'fields.summary', 'fields.description','fields.assignee.name']].copy()
context["context"] = context["fields.summary"] + ". " + context["fields.description"]

print len(context)

context.head()

6757


Unnamed: 0,key,fields.summary,fields.description,fields.assignee.name,context
37,XD-3716,Support Configuring the RabbitMessageBus Messa...,http://stackoverflow.com/questions/34053997/pa...,grussell,Support Configuring the RabbitMessageBus Messa...
44,XD-3709,Duplicate MBean Names With router Sink,"For some reason, the Integration {{MBeanExport...",grussell,Duplicate MBean Names With router Sink. For so...
62,XD-3691,Ensure Job definitions are escaped in UI,If using the definition <aaa || bbb> where the...,hillert,Ensure Job definitions are escaped in UI. If u...
63,XD-3690,"Improve ""Server Configuration - Database Confi...",Make it more clear what drivers need to be cop...,thomas.risberg,"Improve ""Server Configuration - Database Confi..."
68,XD-3685,Job Definitions page fails to display definiti...,In this scenario we created 30 jobs that can b...,hillert,Job Definitions page fails to display definiti...


In [0]:
# Separate natural language and the code in context
import re

for ix, line in context.iterrows():
    m = re.search('{code}(.*){code}', line.context, flags=re.DOTALL)
    if m:
        context.loc[ix, 'context_code'] = line.context[m.start(0):m.end(0)]
        context.loc[ix, 'context'] = line.context[:m.start(0)] + line.context[m.end(0):]
    else:
        context.loc[ix, 'context_code'] = ""
        
    context.loc[ix, 'context'] = re.sub(r"\s+", " ", context.loc[ix, 'context'])
    context.loc[ix, 'context_code'] = re.sub(r"\s+", " ", context.loc[ix, 'context_code'])

In [0]:
# Number of characters in the code
context['context_code_characters'] = context['context_code'].str.len()

# Number of characters in context
context['context_characters'] = context['context'].str.len()

## Metrics

In [0]:
# MMRE : difference between the actual effort and the estimated effort divided by the actual effort
import numpy as np
from numpy import inf

def mmre(labels, predictions):
    assert len(labels) == len(predictions)
    
    mre = np.abs(labels - predictions) / labels
    mre[mre == inf] = 0
    return np.sum(mre) / len(labels)

## Cross-validation SVM
This is just an auxiliar function to run SVM for the different projects

In [0]:
# Obtaining predictions by cross-validation

from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn import metrics

from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.svm import LinearSVC 
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

def SVM(X, Y, comment, results):
    C = 1.0  # SVM regularization parameter
    clf = Pipeline([
      ('feature_selection', SelectFromModel(LinearSVC())),
      ('classification', LinearSVC())
    ])
    scores = cross_val_score(clf, X, Y, cv=10)

    predicted = cross_val_predict(clf, X, Y, cv=10)
    
    diff = np.abs(Y.astype(float) - predicted.astype(float))
    mre = diff / Y.astype(float)
    
    both = pd.DataFrame({'Actual': Y, 'Predicted': predicted, 'diff': diff, 'mre': mre })
    
    result = {
        'Classifier': comment, 
        'Rows': X.shape[0], 
        'Features': X.shape[1],
        'Accuracy': scores.mean(),
        'Accuracy SD': scores.std()*2,
        'MAE' : metrics.mean_absolute_error(Y.astype(float), predicted.astype(float)),
        'MMRE': mmre(Y.astype(float), predicted.astype(float)),
    }
    
    results = results.append(result, ignore_index=True)
    return results

In [150]:
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp 

fullresults = pd.DataFrame()

for p in xdf['project'].unique():
    results = pd.DataFrame()
    
    # Set the project
    df = xdf[xdf['project'] == p]
    
    
    if (p != 'mule'):
        continue
    # Cross validation doesn't work if there are few instances
    if (len(df) < 14):
        continue
    print "Project: ", p 
    print "Getting comments..."
    comments = get_summary_similarity(df) 
    #if (hasattr(comments, 'columns') & comments.columns.size > 0):
         #print comments
    print comments
    
    print "Done."
    print

Project:  mule
Getting comments...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


           key  similarity
0    MULE-9527    0.023852
1    MULE-9443    0.025674
2    MULE-9393    0.080060
3    MULE-9371    0.086741
4    MULE-9164    0.025894
5    MULE-9115    0.081562
6    MULE-9113    0.081562
7    MULE-9112    0.081562
8    MULE-9060    0.029526
9    MULE-9038    0.021173
10   MULE-8939    0.028708
11   MULE-8693    0.083863
12   MULE-8582    0.048196
13   MULE-8533    0.047628
14   MULE-8518    0.060576
15   MULE-8497    0.083502
16   MULE-8482    0.072119
17   MULE-8481    0.068317
18   MULE-8435    0.026441
19   MULE-8301    0.023067
20   MULE-8291    0.083603
21   MULE-8290    0.081873
22   MULE-8273    0.083593
23   MULE-8248    0.071960
24   MULE-8220    0.025202
25   MULE-8199    0.029706
26   MULE-8177    0.072358
27   MULE-8173    0.044420
28   MULE-8156    0.087078
29   MULE-8142    0.071216
..         ...         ...
80   MULE-7291    0.080326
81   MULE-7289    0.084963
82   MULE-7288    0.085145
83   MULE-7280    0.070686
84   MULE-7261    0.086555
8

# Run SVM for all the projects

In [171]:
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp 

fullresults = pd.DataFrame()

for p in xdf['project'].unique():
    results = pd.DataFrame()
    
    # Set the project
    df = xdf[xdf['project'] == p]
    
    if (p == 'timob'):
      continue
    
    
    # Cross validation doesn't work if there are few instances
    if (len(df) < 14):
        continue
    print "Project: ", p
    
    # Compute the dev features
    print "Getting reputations..."
    reps = get_reputations(df)
    print "Getting workload..."
    workload_df = get_workload(df)
    print "Getting current workload..."
    current_workload = get_current_workload(df)
    print "Getting comments..."
    comments = get_comment_number(df)
    print "Getting velocity..."
    velocity = get_velocity(df)
    
    # Put together all the dev features
    devfeatures = pd.merge(reps, workload_df, on='developer', how='left')
    devfeatures = pd.merge(devfeatures, current_workload, on='developer', how='left')
    devfeatures = pd.merge(devfeatures, comments[['developer', 'comments_relative']], on='developer', how='left')
    devfeatures = pd.merge(devfeatures, velocity, on='developer', how='left')

    devfeatures = devfeatures.fillna(0)
    
    print "Getting text features..."
    
    ctx = context[context['key'].isin(df['key'])]
    
    print "ctx size", len(ctx)
    
    v = TfidfVectorizer(ngram_range=(1, 2), analyzer='word', min_df=.0025, max_df=.1, stop_words='english')
    x = v.fit_transform(ctx['context'])

    if ( np.all(ctx['context_code']<>'') ):
        v2 = TfidfVectorizer(ngram_range=(1, 2), analyzer='word', min_df=.0025, max_df=.1, stop_words='english')
        y = v2.fit_transform(ctx['context_code'])
        textfeatures = sp.hstack((x, y))
    else:
        textfeatures = x
    
    similarity = get_summary_similarity(df)
    # textfeatures = pd.merge(similarity, textfeatures, on='key', how='left')
    #textfeatures = sp.hstack((similarity.drop(['key'], axis=1) ,textfeatures))
        
    print "Text features: ", textfeatures.shape
    
    print "Getting issue features..."
    
    # link btw key and developer
    usdev = df[['key', 'fields.assignee.name']]
    usdev.columns = ['key', 'developer']
    
    us_dev = pd.merge(usdev, devfeatures, on='developer', how='left')
    
    issue_features = fullissuefeatures[fullissuefeatures['key'].isin(df['key'])]
    
    dev_issue_features = pd.merge(us_dev, fullissuefeatures, on='key', how='left')
    dev_issue_features = dev_issue_features.fillna(0)
    
    text_dev_issue_features = sp.hstack((dev_issue_features.drop(['key', 'developer'], axis=1), textfeatures))
    print "Dev+text+issues features: ", text_dev_issue_features.shape

    # Issue + text features
    text_issue_features = sp.hstack((issue_features.drop(['key'], axis=1), textfeatures))
    
    # dev + text features
    text_dev_features = sp.hstack((us_dev.drop(['key', 'developer'], axis=1), textfeatures))
    
    print "Training SVMs..."

    Y = df['storypoints'].astype(str)

    results = SVM(text_dev_issue_features, Y, "Issue+Dev+Text", results)
    results = SVM(text_dev_features, Y, "Text+Dev", results)
    results = SVM(issue_features.drop(['key'], axis=1), Y, "Issue", results)
    results = SVM(textfeatures, Y, "Text", results)
    results = SVM(us_dev.drop(['key', 'developer'], axis=1), Y, "Dev", results)
    results = SVM(dev_issue_features.drop(['key','developer'], axis=1), Y, "Dev+Issue", results)
    results = SVM(text_issue_features, Y, "Text+Issue", results)
    results = SVM(similarity.drop(['key'], axis=1), Y, "Similarity", results)
    
    results['project'] = p
    fullresults = fullresults.append(results, ignore_index=True)
    
    print "Done."
    print

Project:  xd
Getting reputations...
Total number of devs:  63
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 587
Text features:  (587, 4964)
Getting issue features...
Dev+text+issues features:  (587, 5320)
Training SVMs...
Done.

Project:  dnn
Getting reputations...
Total number of devs:  114
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 586
Text features:  (586, 5604)
Getting issue features...
Dev+text+issues features:  (586, 5960)
Training SVMs...
Done.

Project:  apstud
Getting reputations...
Total number of devs:  116
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 386


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Text features:  (386, 18160)
Getting issue features...
Dev+text+issues features:  (386, 18516)
Training SVMs...
Done.

Project:  mesos
Getting reputations...
Total number of devs:  87
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 555
Text features:  (555, 5243)
Getting issue features...
Dev+text+issues features:  (555, 5599)
Training SVMs...
Done.

Project:  mule
Getting reputations...
Total number of devs:  92
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 772
Text features:  (772, 4341)
Getting issue features...
Dev+text+issues features:  (772, 4697)
Training SVMs...
Done.

Project:  nexus
Getting reputations...
Total number of devs:  62
Getting workload...
Getting current workload...
Getting comments...
Getting velocity...
Getting text features...
ctx size 539
Text features:  (539, 7853)
Getting issue features...
Dev+text+issues featu

# Results

In [172]:
fr = fullresults[(fullresults['Classifier'] == 'Dev') | (fullresults['Classifier'] == 'Text') | (fullresults['Classifier'] == 'Text+Dev')  |  (fullresults['Classifier'] == 'Similarity')  ]
rrr = fr.pivot( index='project', columns='Classifier')[['Accuracy', 'MMRE', 'MAE']]
rrr.loc['Average']= rrr.mean()

rrr.round(3)

Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,Accuracy,MMRE,MMRE,MMRE,MMRE,MAE,MAE,MAE,MAE
Classifier,Dev,Similarity,Text,Text+Dev,Dev,Similarity,Text,Text+Dev,Dev,Similarity,Text,Text+Dev
project,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
apstud,0.348,0.335,0.343,0.351,0.852,0.822,0.783,0.752,3.738,3.642,3.64,3.65
dnn,0.504,0.504,0.479,0.458,0.335,0.335,0.361,0.378,0.661,0.661,0.706,0.735
mesos,0.318,0.283,0.32,0.322,0.597,0.755,0.578,0.561,1.486,1.427,1.458,1.42
mule,0.333,0.255,0.245,0.286,0.697,1.151,1.096,0.957,2.518,2.57,3.025,2.89
nexus,0.572,0.537,0.511,0.541,0.298,0.356,0.375,0.35,0.473,0.484,0.509,0.494
tistud,0.406,0.406,0.384,0.376,0.441,0.441,0.518,0.527,1.931,1.931,2.096,2.155
xd,0.295,0.278,0.291,0.289,0.58,0.496,0.767,0.753,2.049,2.213,1.944,1.937
Average,0.397,0.371,0.368,0.375,0.543,0.622,0.64,0.611,1.837,1.847,1.911,1.897


# Random guessing

# New Section

In [0]:
import random
from scipy import stats

rs = pd.DataFrame()

for p in xdf['project'].unique():
    
    df = xdf[xdf['project'] == p]
    
    df.loc[df['storypoints'] == 0, 'storypoints'] = 0.5
    
    mae_mean = np.sum(np.abs(df['storypoints'] - df['storypoints'].mean()))/len(df)
    #mre_mean = np.sum(np.abs(df['storypoints'] - df['storypoints'].mean())/df['storypoints'])/len(df)
    
    d1 = {
        "project": p,
        "Classifier": "Mean",
      #  "MMRE": mre_mean,
        "MAE": mae_mean,
        "Accuracy": None
    }
    rs = rs.append(d1, ignore_index=True)
    
    mae_median = np.sum(np.abs(df['storypoints'] - df['storypoints'].median()))/len(df)
    #mre_median = np.sum(np.abs(df['storypoints'] - df['storypoints'].median())/df['storypoints'])/len(df)
    acc_median = len(df[df['storypoints'] == df['storypoints'].median()])/len(df)
    
    d2 = {
        "project": p,
        "Classifier": "Median",
       # "MMRE": mre_median,
        "MAE": mae_median,
        "Accuracy": acc_median
    }
    rs = rs.append(d2, ignore_index=True)
    # Random Guess baseline 
    
    rguess = []
    for i in range(len(df)):
        rguess.append( random.choice(fibonacci) )

    mae_rguess = np.sum(np.abs(df['storypoints'] - rguess))/len(df)
    #mre_rguess = np.sum(np.abs(df['storypoints'] - rguess)/df['storypoints'])/len(df)
    acc_rguess = len(df[df['storypoints'] == rguess])/len(df)
    
    d3 = {
        "project": p,
        "Classifier": "Random Guess",
       # "MMRE": mre_rguess,
        "MAE": mae_rguess,
        "Accuracy": acc_rguess
    }
    rs = rs.append(d3, ignore_index=True)
    

In [0]:
rsp = rs.pivot(index='project', columns='Classifier')
rsp = rsp.drop(columns=[('Accuracy', 'Mean')])
rsp.round(3)

In [0]:
# rs + rrr
frr = pd.concat([rs,fr])[[ 'Classifier', 'project', 'Accuracy', 'MAE' ]]

frrr = frr.pivot(index='project', columns='Classifier')

#frrr.loc['Average'] = frr.pivot(index='project', columns='Classifier').mean()

frrr = frrr.astype(float)

frrr

In [0]:
d = {
    "Dev" : (1 - frrr['MAE']['Dev']/rsp['MAE']['Random Guess'])*100,
    "Text" : (1 - frrr['MAE']['Text']/rsp['MAE']['Random Guess'])*100,
    "Text+Dev" : (1 - frrr['MAE']['Text+Dev']/rsp['MAE']['Random Guess'])*100,
    "Median" : (1 - frrr['MAE']['Median']/rsp['MAE']['Random Guess'])*100,
    "Mean" : (1 - frrr['MAE']['Mean']/rsp['MAE']['Random Guess'])*100
    }
sa = pd.DataFrame(d)

In [0]:
sa

In [0]:
w = pd.concat([frrr, sa], axis=1)

In [0]:
w[('SA','Dev')] = w['Dev']
w[('SA','Text')] = w['Text']
w[('SA','Text+Dev')] = w['Text+Dev']
w[('SA','Mean')] = w['Mean']
w[('SA','Median')] = w['Median']

In [0]:
evalresults = w[[('Accuracy', 'Dev'),('Accuracy', 'Text'),('Accuracy', 'Text+Dev'),('Accuracy', 'Median'),('Accuracy', 'Random Guess'),
  ('MAE', 'Dev'),('MAE', 'Text'),('MAE', 'Text+Dev'),('MAE', 'Mean'),('MAE', 'Median'),('MAE', 'Random Guess'),
  ('SA','Dev'),('SA', 'Text'),('SA', 'Text+Dev'),('SA', 'Mean'),('SA', 'Median')
  ]]

In [0]:
evalresults.loc['Average'] = evalresults.mean()
evalresults.round(3)