# 1: How Challenges Work

At Dataquest, we're huge believers in learning through doing, and we hope this shows in your learning experience. While missions focus on introducing concepts, challenges allow you to perform deliberate practice by completing structured problems. You can read more about deliberate practice here and here. Challenges will feel similar to missions, but with little instructional material and a greater focus on exercises.

If you have questions or run into issues, head over to the Dataquest forums or our Slack community.

# 2: Introduction to the Data

The American Community Survey is a U.S. Census Bureau survey that collects data on everything from housing affordability to industry employment rates. For this challenge, you'll be using the data that the team at FiveThirtyEight derived from the 2010-2012 American Community Surveys. FiveThirtyEight cleaned the data set and made it available in a Github repository.

Here's a quick overview of the files we'll be working with:

    all-ages.csv - Employment data by major for all ages
    recent-grads.csv - Employment data by major for recent college graduates only

Here are descriptions for a few of the columns (out of 21 total columns):

    Rank - The major's numerical rank, by post-graduation median earnings
    Major_code - The major's numerical code
    Major - The major's description
    Major_category - The major's category
    Total - The total number of people who studied the major
    Men - The number of men who studied the major
    Women - The number of women who studied the major
    ShareWomen - The share of women (from 0 to 1) who studied the major
    Employed - The number of people who studied the major and obtained a job after graduating

Here are the first few rows and columns in recent-grads.csv. The data set all-ages.csv has the same structure, but with different values for some of the columns:

By completing this challenge, you'll test your comfort level with using pandas to manipulate DataFrames and calculate summary statistics. First, we'll need to read the data set into pandas

In [2]:
import pandas as pd
all_ages = pd.read_csv( "../data/all-ages.csv" )
print( all_ages.head( 5 ) )

   Major_code                                  Major  \
0        1100                    GENERAL AGRICULTURE   
1        1101  AGRICULTURE PRODUCTION AND MANAGEMENT   
2        1102                 AGRICULTURAL ECONOMICS   
3        1103                        ANIMAL SCIENCES   
4        1104                           FOOD SCIENCE   

                    Major_category   Total  Employed  \
0  Agriculture & Natural Resources  128148     90245   
1  Agriculture & Natural Resources   95326     76865   
2  Agriculture & Natural Resources   33955     26321   
3  Agriculture & Natural Resources  103549     81177   
4  Agriculture & Natural Resources   24280     17281   

   Employed_full_time_year_round  Unemployed  Unemployment_rate  Median  \
0                          74078        2423           0.026147   50000   
1                          64240        2266           0.028636   54000   
2                          22810         821           0.030248   63000   
3                         

# 3: Introduction to the Data

## Instructions

    Read all-ages.csv into a DataFrame object, and assign it to all_ages.
    Read recent-grads.csv into a DataFrame object, and assign it to recent_grads.
    Display the first five rows of all_ages and recent_grads.


In [1]:
all_ages = pd.read_csv( "../data/all-ages.csv" )
recent_grads = pd.read_csv( "../data/recent-grads.csv" )

print( all_ages.head( 5 ) )
print( recent_grads.head( 5 ) )

   Major_code                                  Major  \
0        1100                    GENERAL AGRICULTURE   
1        1101  AGRICULTURE PRODUCTION AND MANAGEMENT   
2        1102                 AGRICULTURAL ECONOMICS   
3        1103                        ANIMAL SCIENCES   
4        1104                           FOOD SCIENCE   

                    Major_category   Total  Employed  \
0  Agriculture & Natural Resources  128148     90245   
1  Agriculture & Natural Resources   95326     76865   
2  Agriculture & Natural Resources   33955     26321   
3  Agriculture & Natural Resources  103549     81177   
4  Agriculture & Natural Resources   24280     17281   

   Employed_full_time_year_round  Unemployed  Unemployment_rate  Median  \
0                          74078        2423           0.026147   50000   
1                          64240        2266           0.028636   54000   
2                          22810         821           0.030248   63000   
3                         

# 4: Summarizing Major Categories

Both of these data sets group the various majors into categories in the Major_category column. Let's start by understanding the number of people in each Major_category for both data sets.

To do so, you'll need to:

    Return the unique values in Major_category.
        Use the Series.unique() method to return the unique values in a column, like this: recent_grads['Major_category'].unique()
    For each unique value:
        Return all of the rows where Major_category equals that unique value.
        Calculate the total number of students those rows represent (using the Total column).
            Use the Series.sum() to calculate the sum of the values in a column. recent_grads['Total'].sum() returns the sum of the values in the Total column.
        Keep track of the totals by adding the Major_category value and the total number of students to a dictionary.

## Instructions

    Use the Total column to calculate the number of people who fall under each Major_category in each data set.
        Store the result as a separate dictionary for each data set.
        The key for the dictionary should be the Major_category, and the value should be the total count.
        For the counts from all_ages, store the results as a dictionary named aa_cat_counts.
        For the counts from recent_grads, store the results as a dictionary named rg_cat_counts.


In [3]:
aa_cat_counts = dict()
rg_cat_counts = dict()

rg_major_list = recent_grads['Major_category'].unique()
for rg_major in rg_major_list:
    rg_major_df = recent_grads[ recent_grads["Major_category"] == rg_major ]
    sum_rg = rg_major_df['Total'].sum()
    rg_cat_counts[ rg_major ] = sum_rg
    
aa_major_list = all_ages['Major_category'].unique()
for aa_major in aa_major_list:
    aa_major_df = all_ages[ all_ages["Major_category"] == aa_major ]
    sum_aa = aa_major_df['Total'].sum()
    aa_cat_counts[ aa_major ] = sum_aa
    
print aa_cat_counts

{'Arts': 1805865, 'Social Science': 2654125, 'Interdisciplinary': 45199, 'Industrial Arts & Consumer Services': 1033798, 'Computers & Mathematics': 1781378, 'Communications & Journalism': 1803822, 'Humanities & Liberal Arts': 3738335, 'Engineering': 3576013, 'Biology & Life Science': 1338186, 'Health': 2950859, 'Law & Public Policy': 902926, 'Physical Sciences': 1025318, 'Education': 4700118, 'Agriculture & Natural Resources': 632437, 'Business': 9858741, 'Psychology & Social Work': 1987278}


# 5: Low-Wage Job Rates

The press likes to talk about the number of college graduates working low-pay, unskilled jobs because they can't find better ones. As a data person, you should be skeptical of any broad claims, and analyze relevant data to obtain a more nuanced view.

Let's run some basic calculations to explore that idea further.

## Instructions

    Use the Low_wage_jobs and Total columns to calculate the proportion of recent college graduates that worked low wage jobs.
    Recall that you can use the Series.sum() method to return the sum of the values in a column.
    Store the resulting float as low_wage_percent, and display the value with the print() function.


In [7]:
low_wage_percent = 0.0

low_wage_job = recent_grads['Low_wage_jobs'].sum()
total_jobs= recent_grads['Total'].sum()
low_wage_percent = ( low_wage_job/total_jobs )
print ( low_wage_percent )

0.0985889119556


# 6: Comparing Data Sets

It looks like only about 9.85% of graduates took on a low wage job after finishing college.

Both the all_ages and recent_grads data sets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two data sets, and perform some initial calculations to see how the statistics for recent college graduates compare with those for the entire population.

Next, let's calculate the number of majors where recent graduates did better than the overall population.

## Instructions

    Use a for loop to iterate over majors.
    For each major, use Boolean filtering to find the corresponding row in both DataFrames.
    Compare the values for Unemployment_rate to see which DataFrame has a lower value.
    Increment rg_lower_count if the value for Unemployment_rate is lower for recent_grads than it is for all_ages.

    Display rg_lower_count with the print() function.


In [8]:
# All majors, common to both DataFrames
majors = recent_grads['Major'].unique()
rg_lower_count = 0

for major in majors:
    rg_major_df = recent_grads[ recent_grads["Major"] == major ]
    aa_major_df = all_ages[ all_ages["Major"] == major ]
    rg_rate = float ( rg_major_df['Unemployment_rate'] )
    aa_rate = float ( aa_major_df['Unemployment_rate'] )
    if rg_rate < aa_rate:
        rg_lower_count = rg_lower_count + 1
        
print ( rg_lower_count )

44
