#Class 3 - Processing Files with Python
###DATSF 19
####Written by Justin Breucop & Rob Hall

For a lot of data files in class we'll use functionality from various libraries to process data very quickly. However, for custom files, raw text, and data that is configured in a non-standard way, it is important to be able to extract data in a customized fashion. We'll go through this exercise using only libraries that come with the default python distribution. The first step will be to open the file in sublime.

Let's say that we are curious about the latest release of ScikitLearn, since we are (or soon will be) frequent users. Our goal is to take the raw commits, sort our authors alphabetically and also count the number of contributions they made. Let's first look at the file. You can do this via the command line but for simplicity's sake we can use the Jupyter cell magic.

In [5]:
# For Max/Linux users:
# ! more ../data/raw_commits.txt

# For windows users:
# ! more ..\data\raw_commits.txt

We see that each commit has an Author and a date. We need to be able to read the file line by line and add to a list of authors. Remember to use `with open('<filename>') as <variable>` where `<filename>` is the full path to the file and the `<variable>` is any identifier (such as `f`).

In [6]:
auths = []
with open('../data/raw_commits.txt') as f:
    for i,line in enumerate(f):
        if line[:6] == 'Author':
            author = line[8:].split('<')[0]
            if author[0] == '=':
                pass
            else:
                auths.append(author)
        else:
            pass
print len(auths)
print len(set(auths))

1259
177


Sort the authors to find the first and last authors, alphabetically. Make sure your data is clean! (No username should begin with an = sign, for example)

In [13]:
uniques = list(set(auths))
uniques.sort()
print uniques[0]
print uniques[-1]

Aaron Schumacher 
Óscar Nájera 


For this example, Oscar Najera is at the end because the O is represented as `/xc3`. Handling unicodes can be tricky but with this example, finding Oscar is a sufficient anser. uniques[-2] gives our true alphabetical end.

In [14]:
uniques[-2]

'zhai_pro '

To count out our data, we can loop over our list and construct a dictionary where the key is the commit author and the value increases whenever we match a key.

In [16]:
counts = {}
for auth in auths:
    if auth in counts.keys():
        counts[auth] += 1
    else:
        counts[auth] = 1

Find the contributor with the highest number of commits

In [17]:
max_key = '' 
for key in counts.keys():
    if counts[key] > counts.get(max_key):
        max_key = key
print max_key
    

Andreas Mueller 


Bonus: how do you handle a tie? Can you pull all authors with the lowest number of commits (without hardcoding the minimum).

In [23]:
i = 0
for key in counts.keys():
    if i == 0:
        key_list = [key]
        i = 1
    if counts.get(key_list[0]) > counts[key]:
        key_list = [key]
    elif counts.get(key_list[0]) == counts[key]:
        key_list.append(key)
print key_list
print '\nAll contributors made',counts[key_list[0]], 'contribution(s)'

['Arnaud Rachez ', 'Eric Larson ', 'MaryanMorel ', 'JeanKossaifi ', 'Rob Zinkov ', 'Raghav ', 'Timothy Hopper ', 'Jiali Mei ', 'Dmitry Spikhalskiy ', 'John Kirkham ', 'Jungkook Park ', 'Tian Wang ', 'Eduardo Caro ', 'Tiago Freitas Pereira ', 'Anish Shah ', 'Jean Kossaifi ', 'Christopher Erick Moody ', 'Omer Katz ', 'sseg ', 'akitty ', 'Erich Schubert ', 'Jeffrey04 ', 'Sam Zhang ', 'Frank C. Eckert ', 'Christoph Gohlke ', 'Jaidev Deshpande ', 'Theodore Vasiloudis ', 'banilo ', 'Dougal Sutherland ', 'Yucheng Low ', 'Ali Baharev ', 'Masafumi Oyamada ', 'Kyler Brown ', 'Christof Angermueller ', 'Ishank Gulati ', 'santi ', 'Kashif Rasul ', 'Joseph ', 'Dan Blanchard ', 'Aaron Schumacher ', 'Nikolay Mayorov ', 'Eric Martin ', 'Robert Layton ', 'David ', 'Nicolas ', 'Rohan Ramanath ', 'Valentin Stolbunov ', 'KamalakerDadi ', 'saurabh.bansod ', 'Alexey Grigorev ', 'benjaminirving ', 'Tom Dupr\xc3\xa9 la Tour ', 'Pauli Virtanen ', 'Yury Zhauniarovich ', 'Tom DLT ', 'Konstantin Shmelkov ', 'Ando 