# Pew Research Center's Religious Landscape Study

In this notebook you will clean and wrangle a dataset that has come out of the Pew Research Center's [Religious Landscape Study](http://www.pewforum.org/religious-landscape-study/). The actual dataset used comes from Hadley Wickham's [Tidy Data Repository](https://github.com/hadley/tidy-data).

## Imports

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

## Read in the data

Here, we read in the data using `read_csv`:

In [None]:
df = pd.read_csv('/data/tidy-data/data/pew.csv')

Extract the following columns and rename them:

* `q16`
* `reltrad` -> `religion`
* `income`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
df.head()

In [None]:
assert list(df.columns)==['q16', 'religion', 'income']
assert len(df)==35556

## Religion

Now you are going to tidy up the `religion` column. Perform the following transformations of the `religion` column:

* Replace `'Churches'` by an empty string
* Replace `'Protestant'` by `'Prot'`
* For the rows that have `" Atheist (do not believe in God) "` in the `q16` column, put the value `'Atheist'` in the
  `religion` column
* For rows that have `" Agnostic (not sure if there is a God) "` in the `q16` column, put the value `'Agnostic'` in the `religion` column
* For rows that have the phrase `"(no information on religious affiliation)"`, replace their value by `'Unknown'`
* Strip leading and trailing whitespace

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert list(df['religion'].unique()) == \
['Evangelical Prot',
 'Mainline Prot',
 'Unaffiliated',
 'Jewish',
 'Unknown',
 'Other Faiths',
 'Historically Black Prot',
 "Jehovah's Witness",
 'Atheist',
 'Agnostic',
 'Catholic',
 'Buddhist',
 'Mormon',
 'Muslim',
 'Hindu',
 'Other Christian',
 'Orthodox',
 'Other World Religions']

Now do the following:

* Extract out just the `religion` and `income` columns
* Convert the `religion` column to a categorical type

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
df.head()

In [None]:
assert list(df.columns)==['religion', 'income']
assert df.religion.dtype.name=='category'

Now make an appropriately labeled Seaborn `countplot` of the `religion` column on the y-axis, sorted by the number of people in each religion.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Income

Now work on the `income` column. Replace the existing income strings by the ones in the tests below:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert list(df.income.unique())==\
    ['$75-100k',
     '$20-30k',
     '$30-40k',
     '<$10k',
     '$50-75k',
     '>150k',
     '$40-50k',
     'Unknown',
     '$100-150k',
     '$10-20k']

Convert the `income` column to a category type:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.income.dtype.name=='category'

Mow make an approprately labeled `countplot` of the `income` column, ordered by the income level:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Simple Analysis

Using a `groupby` and custom aggregation do the following:

* Extract rows where the income is not `'Unknown'`
* Compute the most commonly occuring income category for each religion (mode)
* Sort the result by the mode income
* Store the result in a single-column `DataFrame` with an index that is the religion and a column name of `mode_income`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()