#Class 3 - Processing Files with Python
###DATSF 19
####Justin Breucop - 12/7/2015

For a lot of data files in class we'll use functionality from various libraries to process data very quickly. However, for custom files, raw text, and data that is configured in a non-standard way, it is important to be able to extract data in a customized fashion. We'll go through this exercise using only libraries that come with the default python distribution. The first step will be to open the file in sublime.

Let's say that we are curious about the latest release of ScikitLearn, since we are (or soon will be) frequent users. Our goal is to take the raw commits, sort our authors alphabetically and also count the number of contributions they made. Let's first look at the file. You can do this via the command line but for simplicity's sake we can use the Jupyter cell magic.

In [73]:
# For Max/Linux users:
! more ../data/raw_commits.txt

commit da4f480a6adf5fed30a42500fe0e5a21c404ac2a
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Thu Nov 5 14:57:45 2015 -0500

    Fix import of reload for python 3.3

commit 45ef71f2175fe305152e20b1a6095c535b575b84
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Thu Nov 5 14:31:45 2015 -0500

    MAINT version string for 0.17. D'OH

commit 37d18cef59a614661eb5afbadb9f8e1e124d685e
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Wed Nov 4 14:28:25 2015 -0500

    split installation into simple and advanced part

commit 9334274305e8b9ef0273835a8a6b53ed0c1810c0
Author: Andreas Mueller <t3kcit@gmail.com>
Date:   Thu Nov 5 10:36:17 2015 -0500

    skip unstable tests and doctests  on 32bit platform

commit a4122f0f414d5f750259e8e0f12912984e505c20
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Wed Nov 4 21:43:26 2015 -0500

    More doc fixes. Latex builds again.

commit 812c3a4e5d467be63262367a937edd625dea8dbd
Author: Andreas Mueller <amueller@nyu.edu>
Date:   Wed Nov 4 19:18:0

We see that each commit has an Author and a date. We need to be able to read the file line by line and add to a list of authors. Remember to use `with open('<filename>') as <variable>` where `<filename>` is the full path to the file and the `<variable>` is any identifier (such as `f`).

##### Lines of file -> List of Strings

In [74]:
# Loop through each line of the txt file. If the line contains the word "Author", 
# add that line to a list as a string. Print the result at index 0. 

authors = []

with open('../data/raw_commits.txt', 'r') as data:
    for line in data:
        if 'Author:' in line:
            authors.append(line)

print authors[0]

Author: Andreas Mueller <amueller@nyu.edu>



In [75]:
# Split each line at every space and add each string to a list

authors = []

with open('../data/raw_commits.txt', 'r') as data:
    for line in data:
        if 'Author:' in line:
            authors.append(line.split(' '))

print authors[0:3]

[['Author:', 'Andreas', 'Mueller', '<amueller@nyu.edu>\n'], ['Author:', 'Andreas', 'Mueller', '<amueller@nyu.edu>\n'], ['Author:', 'Andreas', 'Mueller', '<amueller@nyu.edu>\n']]


In [76]:
# Only put the second and last list items into the list

authors = []

with open('../data/raw_commits.txt', 'r') as data:
    for line in data:
        if 'Author:' in line:
            authors.append(line.split(' ')[1:-1])

print authors[0:5]

[['Andreas', 'Mueller'], ['Andreas', 'Mueller'], ['Andreas', 'Mueller'], ['Andreas', 'Mueller'], ['Andreas', 'Mueller']]


In [77]:
# Join the first and last names into list items

authors = []

with open('../data/raw_commits.txt', 'r') as data:
    for line in data:
        if 'Author:' in line:
            authors.append(' '.join(line.split(' ')[1:-1]))

print authors[0:5]
    
# Make sure to append the author name to the list. You'll need to use string manipulation techniques.

['Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller']


In [85]:
# Remove names that begin with a "="
authors = []

with open('../data/raw_commits.txt', 'r') as data:
    for line in data:
        if 'Author:' in line and '=' not in line:            
            authors.append(' '.join(line.split(' ')[1:-1]))            

print authors[0:5]
    

['Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller']


Sort the authors to find the first and last authors, alphabetically. Make sure your data is clean! (No username should begin with an = sign, for example)

#####List of Strings -> Sorted unique list

In [1]:
sorted_authors = sorted(authors)
print sorted_authors[0][-1]
# print sorted_authors[-1]
# print sorted_authors[0:10]
#list(set(sorted_authors))

NameError: name 'authors' is not defined

In [79]:
# Think of what data types you can take advantage of

To count out our data, we can loop over our list and construct a dictionary where the key is the commit author and the value increases whenever we match a key.
#####List -> Dictionary

Find the contributor with the highest number of commits. Useful dictionary method: `dict.get()`

#####Dictionary -> Specific String

Bonus: how do you handle a tie? Can you pull all authors with the lowest number of commits (without hardcoding the minimum).