# Introduction

The goal of this project is to analyze non-normal distributed data of open source GitHub projects from a dataset provided by .

The project should show an ability to explore data, handle ambiguity, and weave insights into a narrative. It should have well-formed observations and understanding of the nuances of the data.

In [2]:
# Import required modules
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
import seaborn as sns
from datetime import timedelta

# Data Extraction

The dataset is provided from two files "usage.csv" and "projects.csv". They are selected usage records of open source GitHub projects.

In [5]:
usage = pd.read_csv("usage.csv")

# To view initial structure of dataset
usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566094 entries, 0 to 566093
Data columns (total 6 columns):
id            566094 non-null object
actor_id      444652 non-null object
project_id    566094 non-null object
account_id    566094 non-null object
started_at    566094 non-null object
ended_at      566094 non-null object
dtypes: object(6)
memory usage: 25.9+ MB


There are 566094 observations/rows in usage and 6 columns/variables. Every column, except for the actor_id has non-null values.

In [6]:
project = pd.read_csv("projects.csv")
project.columns=['project_id','project_size']
project.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12907 entries, 0 to 12906
Data columns (total 2 columns):
project_id      12907 non-null object
project_size    12906 non-null float64
dtypes: float64(1), object(1)
memory usage: 201.8+ KB


The initial view of the projects data file shows that there is 1 null value.

In [7]:
# Merging two datasets
merged = pd.merge(usage, project, on='project_id', how='outer')

In [8]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 566094 entries, 0 to 566093
Data columns (total 7 columns):
id              566094 non-null object
actor_id        444652 non-null object
project_id      566094 non-null object
account_id      566094 non-null object
started_at      566094 non-null object
ended_at        566094 non-null object
project_size    563510 non-null float64
dtypes: float64(1), object(6)
memory usage: 34.6+ MB


# Data Wrangling/Munging

In [9]:
merged['started_at'] = pd.to_timedelta('00:' + merged['started_at'], errors='coerce') 
merged['ended_at'] = pd.to_timedelta('00:' + merged['ended_at'], errors='coerce')   

Due to the non-powerful nature of my tablet-laptop, I have subsetted the data to 384 observations. This number was determined by calculating the sample size with a population of 566,094 (the total number of observations), a confidence level of 95%, and a margin of error of 5%. Since I am limiting the dataset, I also removed all null values.

In [10]:
no_na = merged[pd.notnull(merged['actor_id'])]
no_na = no_na[pd.notnull(merged['project_size'])]
small = no_na.sample(384)
small = small.reset_index(drop=True)

  


In [11]:
small.head()

Unnamed: 0,id,actor_id,project_id,account_id,started_at,ended_at,project_size
0,a728cc5a62c4b5d6130f0662ac03c15f6ddce72f31eb2b...,cb03bf60e166ed64d3ec4a6730c023384e645e7180c33e...,79543f158ad9c27582a6e5517ab5ef91deca2be002d5e9...,f332c66eae839044f0476a26b83367426dbc04beab90aa...,00:48:16.500000,00:50:51,44441.0
1,85406eda28c3da86d1d073b37e177daddf3fc56816acd1...,c3e2d9e519da2a615b9d83669116d822f0d30795cfe364...,2d0e24be7b4c7406c4efd0772f1351707eba06d201326f...,16876b96b6b8b2ebc48c543c18080cdc8a84a87eb1c515...,00:49:10.600000,00:51:26,10383.0
2,4ca69276b6d9ae85ce07235bbdc7a00685d4754df39fd6...,47acc019ec56ad25e6cd44ddbe2ea513702ceeb6a1c17a...,878e1e82ff0e64486463fbe10886931cd60b67d70f4dfa...,892bf9181e5034cbc80b08ee0dd5598bdefa7dc4880b06...,00:56:24,00:59:39.700000,212719.0
3,48a2f4b9710d464c4e9b1f39442bdb745e1b8be40fab30...,467a3f755fafd35efc28ff5ebacd193c61c5e9ba388a84...,f09396375c9a706d291c6264cdcfd441182d9d422f2b7d...,d2fefe1774b77114507fdce4a9c63cf13a6bede7afff9d...,00:37:03,00:38:02.900000,71.0
4,cebfe80528039f7b25c90d2c9de20583b69bb72c61e1b0...,6c83fc5ec41ead6ea3336a98e7d3614b98876a2972de11...,dba05c9d71b2b3fb1a3a94bf80cccdb7ecd4776cc63776...,eeccd1b657ba995781fb315f165e9be7949dbadf88a42c...,00:20:57.100000,00:22:28.300000,640.0


Since the imported data only measured time in minutes and dates, events such as the start time being after the end time occured. This is not possible. Therefore the next code segment checks that the laws of time have not been broken, if so, it makes adds an hour until the end date is after the start date. This results in a more accurate and realistic time of, for example 2 minutes, than 23 hours and 58 minutes.

In [12]:
i=0
while i < len(small):
    try:
        while small['ended_at'][i]<small['started_at'][i]:
            small['ended_at'][i] = small['ended_at'][i]+timedelta(0,60*60)
    except TypeError:
        pass
    i += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [13]:
# Calculate time_of_usage

calc = small.apply(lambda row: row['ended_at'] - row['started_at'], axis = 1)

small['time_of_usage'] = calc

# Check / Test
small

Unnamed: 0,id,actor_id,project_id,account_id,started_at,ended_at,project_size,time_of_usage
0,a728cc5a62c4b5d6130f0662ac03c15f6ddce72f31eb2b...,cb03bf60e166ed64d3ec4a6730c023384e645e7180c33e...,79543f158ad9c27582a6e5517ab5ef91deca2be002d5e9...,f332c66eae839044f0476a26b83367426dbc04beab90aa...,00:48:16.500000,00:50:51,44441.0,00:02:34.500000
1,85406eda28c3da86d1d073b37e177daddf3fc56816acd1...,c3e2d9e519da2a615b9d83669116d822f0d30795cfe364...,2d0e24be7b4c7406c4efd0772f1351707eba06d201326f...,16876b96b6b8b2ebc48c543c18080cdc8a84a87eb1c515...,00:49:10.600000,00:51:26,10383.0,00:02:15.400000
2,4ca69276b6d9ae85ce07235bbdc7a00685d4754df39fd6...,47acc019ec56ad25e6cd44ddbe2ea513702ceeb6a1c17a...,878e1e82ff0e64486463fbe10886931cd60b67d70f4dfa...,892bf9181e5034cbc80b08ee0dd5598bdefa7dc4880b06...,00:56:24,00:59:39.700000,212719.0,00:03:15.700000
3,48a2f4b9710d464c4e9b1f39442bdb745e1b8be40fab30...,467a3f755fafd35efc28ff5ebacd193c61c5e9ba388a84...,f09396375c9a706d291c6264cdcfd441182d9d422f2b7d...,d2fefe1774b77114507fdce4a9c63cf13a6bede7afff9d...,00:37:03,00:38:02.900000,71.0,00:00:59.900000
4,cebfe80528039f7b25c90d2c9de20583b69bb72c61e1b0...,6c83fc5ec41ead6ea3336a98e7d3614b98876a2972de11...,dba05c9d71b2b3fb1a3a94bf80cccdb7ecd4776cc63776...,eeccd1b657ba995781fb315f165e9be7949dbadf88a42c...,00:20:57.100000,00:22:28.300000,640.0,00:01:31.200000
5,7a565fa07b10d9f7a66984a4c9ed13c8c49267357a700f...,b6179cfed84f3a33c9263d3df38b85b30d12e8e49b0178...,d30c190d59afc2fa18279d0f7cc3fbe7871352b8f8333e...,5951f0863727294532c8bc014f41a5b2057c1b14c1ca79...,00:28:20.100000,00:37:52.200000,6132.0,00:09:32.100000
6,b89d8170355802a3e00d0b656539d733a8276cf9ca7901...,4e0ac40f438138ad8058fd83ccca4e78da87ff402c6af7...,a8996d6feb004d56f8d9a2499e251d047abd71a6c2499d...,7d5a3f469590f5b536609b79e59ebc34cf3fe4f4be378c...,00:07:55.800000,00:09:47.800000,4134.0,00:01:52
7,516750b3d59134b5122609431c8f2399120661aae8fc7f...,a0d0c81185e49c643160e694a3dae0d044b4304dc7651b...,77595c3dd3ccf75332281ea20622f0711892fb7ac65634...,e9092495b57f81055d7fba885515b1478285d68683438c...,00:15:00.400000,00:26:33.300000,51806.0,00:11:32.900000
8,5499eb08e9ab2fc1500c0c563185748676643726e1c989...,467a3f755fafd35efc28ff5ebacd193c61c5e9ba388a84...,f09396375c9a706d291c6264cdcfd441182d9d422f2b7d...,d2fefe1774b77114507fdce4a9c63cf13a6bede7afff9d...,00:34:02.500000,00:34:53.800000,71.0,00:00:51.300000
9,7167d3f0a5bb647f45d908e421051876bfa201fcb6e2a9...,385174700cfd1001ad1cc8720ef14473eb0268e34033da...,67360ea54ff8437c5194c7e4dd9e5db5547e2e60fbe797...,ad8bd0750e045c488841e9436efeb62479780be844ed82...,00:43:23.500000,00:43:39.900000,14172.0,00:00:16.400000


#  Goals

#1 Identify the top 10 actors - by total usage, by accounts built against, and by projects built against including the scalar amounts

In [30]:
df = small
df.groupby('actor_id')

#.size().nlargest(10)

TypeError: 'DataFrameGroupBy' object is not callable

In [21]:
# By Acccounts Built Against

acc = pd.DataFrame(small.groupby(by=['actor_id'])['account_id'].value_counts())
acc.columns = ['scalar']
acc.sort_index(axis='scalar')



ValueError: No axis named scalar for object type <class 'pandas.core.frame.DataFrame'>

In [38]:
df = small
df_count = df.groupby('actor_id')
pd.DataFrame(df.groupby(by=['actor_id']))

ValueError: DataFrame constructor not properly called!

In [None]:
# By Total Usage

In [16]:
# By Projects Built Against

#2 Provide a five-number summary of usage for each week, and a five-number summary of total usage per account for each week

In [None]:
# The provided datasets do not specify weeks so it is not possible to know if these events
# happened on different weeks or the same week

merged.describe()

#3 Provide some comments on the following: Where can a t-test be applied to these collections? Provide some examples of where it can and cannot, insights on why, and other implications

In [None]:
A t-test can be us

#4 Provide an analysis that shows the relationship between the following variables, along with insights drawn out and comments on your choices:
- Total usage per account
- Unique Actors per account
- Two other variables

In [None]:
rho, pval = stats.spearmanr(small)

pvals = pd.DataFrame(pval)
#pvals = pvals.rename(index={0: 'outlook',1: 'net_income',2: 'ft_employee',3: 'div_yield',4: 'five_year',5: 'ten_year',6: 'alcohol',7: 'gambling',8: 'tobacco',9: 'controversial_weapons',10: 'small_arms',11: 'military_contracting',12: 'coal',13: 'dont_use',14: 'dont_trust',15: 'assets',16: 'market_value',17: 'revenue',18: 'net_per_emp'})
#round(pvals, 3)
pval

#5 Using these collections, determine what are the main drivers (if any) of usagegrowth for accounts.