# Week 7: Instructor Led Lab
Author: Jen Estes \
Course: BGEN 632 - Introduction to Python \
Term: Spring 2025 \
Due Date: April 14th, 2025 

This notebook contains code for the inspection and organization of data based on the requirements outlined by Dr. Newton in the BGEN632 Week 7 GitHub repo. While using  the github_teams.csv file, the program primarily makes use of pandas DataFrames to do this analysis.

----

### Importing Modules

In [348]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

### Setting Working Directory

In [351]:
os.getcwd()  # get current working directory, note that data is in "data" folder

'/Users/jenestes/Desktop/week7labs/data'

In [353]:
# want working directory to be within this data folder
os.chdir("/Users/jenestes/Desktop/week7labs/data")   # change the directory
os.getcwd()                                          # confirm the change 

'/Users/jenestes/Desktop/week7labs/data'

## Accessing Data

### Load Data

In [357]:
gh_teams = pd.read_csv("github_teams.csv")

### Inspect Columns
This method gives us more information than just the column headers, but this is useful for inspecting the data frame.

In [360]:
%%time

gh_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name_h                    608 non-null    object 
 1   Team_type                 608 non-null    object 
 2   Team_size_class           608 non-null    object 
 3   human_members_count       608 non-null    int64  
 4   bot_members_count         608 non-null    int64  
 5   human_work                608 non-null    int64  
 6   work_per_human            608 non-null    float64
 7   human_gini                608 non-null    float64
 8   human_Push                608 non-null    int64  
 9   human_IssueComments       608 non-null    int64  
 10  human_PRReviewComment     608 non-null    int64  
 11  human_MergedPR            608 non-null    int64  
 12  bot_work                  608 non-null    int64  
 13  bot_Push                  608 non-null    int64  
 14  bot_IssueC

### Number of Columns and Rows
The output above suggests there are 608 rows and 19 columns, but we can confirm this using the shape function, which will output the (# of rows, # of columns).

In [363]:
gh_teams.shape

(608, 19)

### Converting Columns
Our initial output using the info function shows that the columns that are categorical are name_h, Team_type, and Team_size_class. We can convert these from *object* to *category*. We can confirm this by using the info function again, after these changes are made.

In [366]:
gh_teams['name_h'] = gh_teams['name_h'].astype('category')               # making changes
gh_teams['Team_type'] = gh_teams['Team_type'].astype('category')            # making changes
gh_teams['Team_size_class'] = gh_teams['Team_size_class'].astype('category')  # making changes
gh_teams.info()                                                                 # confirm the changes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   name_h                    608 non-null    category
 1   Team_type                 608 non-null    category
 2   Team_size_class           608 non-null    category
 3   human_members_count       608 non-null    int64   
 4   bot_members_count         608 non-null    int64   
 5   human_work                608 non-null    int64   
 6   work_per_human            608 non-null    float64 
 7   human_gini                608 non-null    float64 
 8   human_Push                608 non-null    int64   
 9   human_IssueComments       608 non-null    int64   
 10  human_PRReviewComment     608 non-null    int64   
 11  human_MergedPR            608 non-null    int64   
 12  bot_work                  608 non-null    int64   
 13  bot_Push                  608 non-null    int64   

### Unique Values

In [370]:
# Unique values for Team Type - 2
pd.unique(gh_teams.Team_type)  

['human-bot', 'human']
Categories (2, object): ['human', 'human-bot']

In [372]:
# Unique values for Team Size Class - 3
pd.unique(gh_teams.Team_size_class)  

['Small', 'Large', 'Medium']
Categories (3, object): ['Large', 'Medium', 'Small']

### Indexing

In [375]:
# Value of the 63rd row and 6th column
gh_teams.iloc[62,5]

35

In [377]:
# Values for the 300th row
gh_teams.iloc[299]        

name_h                      IyfocAGfAHLncCVJUujqTA/A_QZ6HlUb5sRQHhPa7SGzQ
Team_type                                                       human-bot
Team_size_class                                                    Medium
human_members_count                                                     4
bot_members_count                                                       1
human_work                                                           1049
work_per_human                                                     262.25
human_gini                                                       0.448761
human_Push                                                            739
human_IssueComments                                                   213
human_PRReviewComment                                                  91
human_MergedPR                                                          6
bot_work                                                               52
bot_Push                              

### Three Methods: row with index value 595 with 1st, 2nd, 3rd columns

In [380]:
# Method 1
gh_teams.iloc[595, 0:3]  

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [382]:
# Method 2
gh_teams.iloc[595, :3]  

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [384]:
# Method 3
gh_teams.loc[595, 'name_h':'Team_size_class']

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

### Two methods: row with index value 46 with the 3rd and 7th columns

In [387]:
# Method 1
gh_teams.iloc[[46], [2, 6]]

Unnamed: 0,Team_size_class,work_per_human
46,Medium,31.833333


In [389]:
# Method 2
gh_teams.loc[46, ["Team_size_class", "work_per_human"]]

Team_size_class       Medium
work_per_human     31.833333
Name: 46, dtype: object

### Two Methods: New DataFrame for the column `bot_work` 

In [392]:
# Method 1 - The column of interest
gh_teams.bot_work

0        43
1         0
2         0
3      1972
4       302
       ... 
603      26
604       0
605       0
606       0
607       8
Name: bot_work, Length: 608, dtype: int64

In [394]:
# Method 1- Creating new data frame
just_bot_work = pd.DataFrame(gh_teams.bot_work)
just_bot_work                                    # confirming that this worked

Unnamed: 0,bot_work
0,43
1,0
2,0
3,1972
4,302
...,...
603,26
604,0
605,0
606,0


In [396]:
# Method 2 - A much longer way to do this 
gh_teams.columns  

Index(['name_h', 'Team_type', 'Team_size_class', 'human_members_count',
       'bot_members_count', 'human_work', 'work_per_human', 'human_gini',
       'human_Push', 'human_IssueComments', 'human_PRReviewComment',
       'human_MergedPR', 'bot_work', 'bot_Push', 'bot_IssueComments',
       'bot_PRReviewComment', 'bot_MergedPR', 'eval_survival_day_median',
       'issues_count'],
      dtype='object')

In [398]:
just_bot_work2 = gh_teams.drop(['name_h', 'Team_type', 'Team_size_class', 'human_members_count',
       'bot_members_count', 'human_work', 'work_per_human', 'human_gini',
       'human_Push', 'human_IssueComments', 'human_PRReviewComment',
       'human_MergedPR', 'bot_Push', 'bot_IssueComments',
       'bot_PRReviewComment', 'bot_MergedPR', 'eval_survival_day_median',
       'issues_count'], axis = 1)
just_bot_work2                                      # confirming that this worked

Unnamed: 0,bot_work
0,43
1,0
2,0
3,1972
4,302
...,...
603,26
604,0
605,0
606,0


## Sorting and Ordering Data

### Looking at Subsets of Data

In [402]:
# `human-bot` teams that have a `bot_members_count` value greater than and equal to 2
gh_teams[(gh_teams.Team_type == 'human-bot') & (gh_teams.bot_members_count >= 2)]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
42,2-scMrZv13F95YPZmfieww/4Zc56iUYjIZrZU06omFrJw,human-bot,Large,23,2,4648,202.086957,0.560241,864,2574,1174,36,1325,0,1325,0,0,11.0,1635.0
84,4YoH8row044yJjPIqWJw9Q/NSXj3i61X71lV0StTN71Ww,human-bot,Small,2,2,114,57.0,0.491228,114,0,0,0,37,0,37,0,0,0.0,14.0
89,5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ,human-bot,Large,17,4,7412,436.0,0.439621,4182,1257,1917,56,358,5,202,151,0,2.0,495.0
110,7sA-8-nyqr0Ri2CT4-FSZw/GJPQoUhHfvUsxKcdkHWLEw,human-bot,Small,3,2,244,81.333333,0.502732,171,73,0,0,136,0,136,0,0,1.0,41.0
146,bi5TY2Z4OSQq3PMs6JnKYA/5wtZcUUo1XmLHIra8NDtFQ,human-bot,Medium,4,2,170,42.5,0.717647,144,7,19,0,104,0,104,0,0,,
147,bi5TY2Z4OSQq3PMs6JnKYA/9b9IqkDK14ketwn88f3hKA,human-bot,Small,3,2,189,63.0,0.624339,174,10,5,0,125,0,125,0,0,35.0,9.0
149,bi5TY2Z4OSQq3PMs6JnKYA/kIiAIJpk6lOa6Nxf234KkQ,human-bot,Small,3,2,88,29.333333,0.636364,74,7,7,0,74,0,74,0,0,,
224,FAhkB4rsocfDW0vrM8U8NA/3KHgTzOwWtAxTXlp_mbqoA,human-bot,Large,15,2,4821,321.4,0.689096,2564,1801,386,70,270,90,116,52,12,13.0,1522.0


In [404]:
# `human` teams that are `Large` and have a `human_gini` value greater than and equal to 0.75
gh_teams[(gh_teams.Team_type == 'human') & (gh_teams.Team_size_class == 'Large') & (gh_teams.human_gini >= 0.75) ]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
138,ASYGR96YA91p3z7MNKjZCA/IB2pZ8ygcvNnlxUdysjSFA,human,Large,12,0,1655,137.916667,0.799446,793,684,178,0,0,0,0,0,0,4.0,190.0
285,IiUao8vA_zm_uEIVVLI-Sw/91ya8vlSP8qgwCllH_6BSw,human,Large,25,0,3599,143.96,0.863507,1249,2350,0,0,0,0,0,0,0,0.0,1245.0
505,uLHPO58cQefwrJUbyhYOKQ/7YWOP8uDEeKDHQMWKqOoYA,human,Large,48,0,5748,119.75,0.78204,1715,3891,142,0,0,0,0,0,0,0.0,1200.0
582,y8Jw59EHVSrsluSuhR5okg/V5vb074jNkzg4YCKforX1Q,human,Large,8,0,277,34.625,0.781137,275,2,0,0,0,0,0,0,0,,


### Counting Rows with Certain Conditions

In [407]:
# Count of teams  in the `Small` or `Large` category- 428
# One Method- output just these teams
gh_teams[(gh_teams.Team_size_class == 'Small')| (gh_teams.Team_size_class == 'Large')]


Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,_1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q,human-bot,Small,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
1,_9o07rGiC7DFyi-zm91Q0g/VOgMsrjYEwFAq0BY8kHqGQ,human,Small,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,0,0,,
2,_DzK53uaZXnAX3WcC0W28g/Epc4QWw5PNBQIIdvopEHDA,human,Large,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,0,0,37.0,46.0
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
601,zPjKmela4b_-cQYzslpJLQ/Y8b87jxGXuKYhNDrcC5AuA,human,Large,11,0,851,77.363636,0.267279,740,111,0,0,0,0,0,0,0,3.0,22.0
602,zro7Xud3Xy2f5CjF55l_jA/GChw8QQ_KUPepXGZGDWicQ,human-bot,Small,2,1,39,19.500000,0.397436,39,0,0,0,15,0,15,0,0,,
603,zTj5tlMWgotzJmQl7BP8wQ/iQ914_smScbUO8BI9JlE6A,human-bot,Small,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,0,0,,
604,zUBexdmYylGGpxiebXm6gg/sJXD2kulWzU35ijdY3SnBQ,human,Small,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,


In [409]:
#Another method - create subsample of data and count number of rows 
slteams = gh_teams[(gh_teams.Team_size_class == 'Small')| (gh_teams.Team_size_class == 'Large')]
len(slteams)

428

In [411]:
# Teams in the `Small` or `Large` cateogry with a `human_gini` value less than and equal to 0.20- 66
# Using the second method from above, so that just the count of these teams appears
slteams2 = gh_teams[((gh_teams.Team_size_class == 'Small')| (gh_teams.Team_size_class == 'Large')) & (gh_teams.human_gini <= 0.20)]
len(slteams2)

66

In [413]:
# Count of `human-bot` teams in the `Medium` category- 84
humanbotmedium = gh_teams[(gh_teams.Team_size_class == 'Medium')& (gh_teams.Team_type == 'human-bot')]
len(humanbotmedium)

84

### Sampling Data

In [416]:
# Subsample of 50% of the data.
gh_teams_subsample = gh_teams.sample(frac = 0.5, replace = False) 
gh_teams_subsample

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
604,zUBexdmYylGGpxiebXm6gg/sJXD2kulWzU35ijdY3SnBQ,human,Small,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,
479,t9G9CZ6ZZN3RCh8Gw2LUQA/t9G9CZ6ZZN3RCh8Gw2LUQA,human-bot,Large,7,1,597,85.285714,0.498684,90,299,208,0,11,0,11,0,0,57.0,109.0
519,V58hL76xAvyUnitZGjXsZg/CnMANWJYuper3o2pVxj5Ew,human-bot,Large,7,1,332,47.428571,0.512048,213,91,28,0,24,0,24,0,0,20.0,26.0
453,S8Mlts6voOAuRvrLEzUUwQ/6lwvrEheHXLoLzw2ohH5FA,human,Small,2,0,26,13.000000,0.384615,26,0,0,0,0,0,0,0,0,,
345,MC6oqT7o22Y_rULWJZllfA/MXyVzmYYom7cgybNB0CjFQ,human-bot,Large,7,2,1421,203.000000,0.504072,444,277,644,56,77,0,77,0,0,19.0,124.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,7aZIQZSTKAYawn7PY_hphA/pfe5MUSnWVWLDA5CI53JSw,human-bot,Medium,6,1,470,78.333333,0.536170,72,245,153,0,55,0,55,0,0,8.0,122.0
409,PWktOdCWNXXUO5xh29twIg/-RS4PkkkRPxs1qj5Ky0b4A,human,Small,2,0,36,18.000000,0.305556,36,0,0,0,0,0,0,0,0,,
235,FHn2oD12fabjlrDK8eaU4A/HCHZsZKYZF-iU02YKO4HeA,human-bot,Medium,6,1,461,76.833333,0.442878,175,227,50,9,6,0,0,0,6,3.0,84.0
507,uLvKXX_6CJ4zG3V_9-fGRw/YGzgdzxRGvwVZcszx5HyCA,human-bot,Small,3,1,193,64.333333,0.542314,70,101,22,0,40,0,40,0,0,9.0,71.0


In [418]:
#Samples for a 8-fold cross validation test
kf = KFold(n_splits = 8)  # K-Fold cross-validator with 8 folds (the default is 5)

for train, test in kf.split(gh_teams):
    print("%s %s" % (train, test))

[ 76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93
  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309
 310 311 312 313 314 315 316 317 318 319 320 321 32

In [420]:
gh_teams.iloc[train]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,_1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q,human-bot,Small,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
1,_9o07rGiC7DFyi-zm91Q0g/VOgMsrjYEwFAq0BY8kHqGQ,human,Small,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,0,0,,
2,_DzK53uaZXnAX3WcC0W28g/Epc4QWw5PNBQIIdvopEHDA,human,Large,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,0,0,37.0,46.0
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
527,vD1mLoQ_CzlsnXcM1E_WOA/XM83BYKoM4QtXWStPYssrw,human-bot,Small,2,1,106,53.000000,0.433962,99,7,0,0,18,0,18,0,0,,
528,vD1mLoQ_CzlsnXcM1E_WOA/Yk03OqUO6JLk05uYj9RK6Q,human-bot,Small,2,1,50,25.000000,0.440000,48,2,0,0,10,0,10,0,0,,
529,VemkfETAeqVyhw0s77AlUw/ERsm7IP5zmTUg2sWF5b-cQ,human-bot,Large,14,1,1878,134.142857,0.753461,1409,465,4,0,597,0,597,0,0,4.0,186.0
530,VGYUfaNcvjujHdwS_xv9xA/dts2QrkRvgHdrxZCJYXG6w,human-bot,Small,2,1,90,45.000000,0.244444,36,32,22,0,12,0,12,0,0,1.0,9.0


In [422]:
gh_teams.iloc[test]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
532,VjnX-IGsAsRz8EJKMp6LLA/HQ9CHS_vbj1CSd6OM0aCoA,human-bot,Large,12,1,3609,300.750000,0.703773,2964,559,76,10,34,0,24,0,10,30.0,518.0
533,VjnX-IGsAsRz8EJKMp6LLA/qZSqyEmsJJ_Rr7f9toRbwQ,human-bot,Medium,6,1,559,93.166667,0.710495,534,17,2,6,46,0,46,0,0,36.0,40.0
534,vjYCi8YxMpUj_LaqRdiCXw/wPCNTC9mtdWb8MKJVr439g,human,Large,28,0,10705,382.321429,0.743961,4230,2337,3997,141,0,0,0,0,0,4.0,21.0
535,vlLrA8LGOcUxkQuGbs4TqA/LbQfqlh-Ihko3_Yii02dhQ,human,Medium,5,0,657,131.400000,0.264231,483,134,40,0,0,0,0,0,0,3.0,74.0
536,vpAJthlySeoTSTCzS0iH9w/co9Uzr_rNRVxqxS0x1UpqA,human-bot,Medium,4,1,143,35.750000,0.456294,119,10,14,0,95,94,1,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,zTj5tlMWgotzJmQl7BP8wQ/iQ914_smScbUO8BI9JlE6A,human-bot,Small,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,0,0,,
604,zUBexdmYylGGpxiebXm6gg/sJXD2kulWzU35ijdY3SnBQ,human,Small,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,
605,zVSBi-iRKCzLiqFwVt6hbg/8SfUBIOeWUjDoQxeUCX7wQ,human,Medium,5,0,26,5.200000,0.446154,19,5,2,0,0,0,0,0,0,,
606,zVSBi-iRKCzLiqFwVt6hbg/fm_lDWwc8Uu-aZ24BjUNZg,human,Medium,5,0,13,2.600000,0.246154,8,4,1,0,0,0,0,0,0,8.0,10.0


In [424]:
# New DataFrame for numeric columns
# Since there are both integers and floats, exclude the category column
gh_teams_numeric = gh_teams.select_dtypes(exclude = ['category'])
gh_teams_numeric

Unnamed: 0,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
1,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,0,0,,
2,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,0,0,37.0,46.0
3,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,0,0,,
604,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,
605,5,0,26,5.200000,0.446154,19,5,2,0,0,0,0,0,0,,
606,5,0,13,2.600000,0.246154,8,4,1,0,0,0,0,0,0,8.0,10.0


### Removing Columns, Saving as New DataFrame

In [427]:
# Removing the columns `bot_PRReviewComment` and `bot_MergedPR` from the DataFrame.
# Checking columns first
gh_teams_numeric.columns

Index(['human_members_count', 'bot_members_count', 'human_work',
       'work_per_human', 'human_gini', 'human_Push', 'human_IssueComments',
       'human_PRReviewComment', 'human_MergedPR', 'bot_work', 'bot_Push',
       'bot_IssueComments', 'bot_PRReviewComment', 'bot_MergedPR',
       'eval_survival_day_median', 'issues_count'],
      dtype='object')

In [429]:
# Removing columns
gh_teams_numeric.drop(['bot_PRReviewComment', 'bot_MergedPR'], axis = 1)

Unnamed: 0,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,eval_survival_day_median,issues_count
0,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,87.0,8.0
1,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,,
2,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,37.0,46.0
3,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,1.0,4757.0
4,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,,
604,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,,
605,5,0,26,5.200000,0.446154,19,5,2,0,0,0,0,,
606,5,0,13,2.600000,0.246154,8,4,1,0,0,0,0,8.0,10.0


In [431]:
# Saving this as new Data Frame in case we want this
gh_teams_numeric_mod = gh_teams_numeric.drop(['bot_PRReviewComment', 'bot_MergedPR'], axis = 1)
gh_teams_numeric_mod

Unnamed: 0,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,eval_survival_day_median,issues_count
0,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,87.0,8.0
1,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,,
2,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,37.0,46.0
3,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,1.0,4757.0
4,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,,
604,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,,
605,5,0,26,5.200000,0.446154,19,5,2,0,0,0,0,,
606,5,0,13,2.600000,0.246154,8,4,1,0,0,0,0,8.0,10.0


In [433]:
# Saving the columns `Team_size_class` and `human_members_count` as a new DataFrame.
gh_teams.columns
gh_teams_final = gh_teams.drop(['name_h', 'Team_type',
       'bot_members_count', 'human_work', 'work_per_human', 'human_gini',
       'human_Push', 'human_IssueComments', 'human_PRReviewComment',
       'human_MergedPR', 'bot_work', 'bot_Push', 'bot_IssueComments',
       'bot_PRReviewComment', 'bot_MergedPR', 'eval_survival_day_median',
       'issues_count'], axis = 1)
gh_teams_final

Unnamed: 0,Team_size_class,human_members_count
0,Small,2
1,Small,2
2,Large,7
3,Large,234
4,Large,38
...,...,...
603,Small,3
604,Small,2
605,Medium,5
606,Medium,5


### Renaming Columns

In [436]:
# Used this method since only 2 columns in DataFrame
gh_teams_final.columns = ['Size_of_team', 'Num_humans']
gh_teams_final

Unnamed: 0,Size_of_team,Num_humans
0,Small,2
1,Small,2
2,Large,7
3,Large,234
4,Large,38
...,...,...
603,Small,3
604,Small,2
605,Medium,5
606,Medium,5
