## Recent Grad Analysis
analysis of `fivethirtyeight`'s recent graduate data, for application to the Center for Acedemic Innovation's Data Science Fellowship

### 1. Download the Data
We can do this from Jupyter, using the linux tool `subversion`, to grab only the folder from the repo we need:

In [14]:
!svn export https://github.com/fivethirtyeight/data/trunk/college-majors

A    college-majors
A    college-majors/all-ages.csv
A    college-majors/college-majors-rscript.R
A    college-majors/grad-students.csv
A    college-majors/majors-list.csv
A    college-majors/readme.md
A    college-majors/recent-grads.csv
A    college-majors/women-stem.csv
Exported revision 1092.


we see then that our data is there:

In [16]:
ls -al college-majors/

total 116
drwxrwxrwx 1 luclepot luclepot  4096 Sep 14 15:55 [0m[34;40m.[0m/
drwxrwxrwx 1 luclepot luclepot  4096 Sep 14 15:55 [34;40m..[0m/
-rwxrwxrwx 1 luclepot luclepot 17902 Feb  9  2018 [01;31mall-ages.csv[0m*
-rwxrwxrwx 1 luclepot luclepot  9386 Feb  9  2018 [01;31mcollege-majors-rscript.R[0m*
-rwxrwxrwx 1 luclepot luclepot 31937 Feb  9  2018 [01;31mgrad-students.csv[0m*
-rwxrwxrwx 1 luclepot luclepot  8558 Feb  9  2018 [01;31mmajors-list.csv[0m*
-rwxrwxrwx 1 luclepot luclepot  2634 Feb  9  2018 [01;31mreadme.md[0m*
-rwxrwxrwx 1 luclepot luclepot 26872 Feb  9  2018 [01;31mrecent-grads.csv[0m*
-rwxrwxrwx 1 luclepot luclepot  6445 Feb  9  2018 [01;31mwomen-stem.csv[0m*


### 2. Load/Clean
We'll use Pandas for these datasets; they're pretty much ideal to be put into DataFrames. 

We only really need the 'all-ages' dataset right now, but we can write a generic function to CSV data properly.

In [43]:
import pandas as pd
import os
import glob

def load(path):
    """Notebook-specific data loading function. Will load only .csv files, glob-style, with priority 
    placed on the data folder we're focusing on ('college-majors', in this case.)
    
    Args:
        path (str): glob-style string path to the file we want to load. Must match exactly one .csv file,
            otherwise errors will happen. 
        
    Returns:
        pandas.DataFrame: Loaded dataframe for the selected .csv file. 
    """
    # we only want .csv files, add this to our pathspec if it's not there
    if not path.endswith(".csv"):
        path += ".csv"
    # data will always be in our college-majors directory, so check there first for the path
    candidates = glob.glob(os.path.join("college-majors", path))
    # otherwise, check locally
    candidates = candidates if len(candidates) > 0 else glob.glob(path)
    # complain if we don't find anything
    if len(candidates) < 1:
        raise FileNotFoundError("No datasets found with path matching '{}'".format(path))
    # .. and if we find multiple datasets
    elif len(candidates) > 1:
        raise AttributeError("Multiple datasets found with path matching '{}'. Matches include: {}".format(path, candidates))
    # return loaded dataframe with selected path candidate
    selected = candidates[0]
    # helpful message, and return
    print("Loading CSV file at path '{}'".format(selected))
    return pd.read_csv(os.path.abspath(selected))
    
data = load("all-ages")

Loading CSV file at path 'college-majors/all-ages.csv'


In [44]:
data.shape

(173, 11)