# Welcome to data exploration with Python and Jupyter!

## Here&rsquo;s what to expect:


1. [Overview of Jupyter](#1.-Overview-of-Jupyter)


2. [Overview of Python syntax with calculation](#2.-Overview-of-Python-syntax)


3. [Refactoring our calculation using Python data structures](#3.-Overview-of-Python-data-structures)


4. [Introducing pandas](#4.-Introducing-pandas)
  - [Filtering](#Filtering)
  - [Aggregation](#Aggregation)
  
  
5. [Scraping HTML tables with a single command](#5.-Scraping-HTML-tables-with-a-single-command)


6. [Common cleaning/"munging" tasks](#6.-Common-cleaning/"munging"-tasks)
  - [String replacement or substitution](#String-replacement-or-substitution)
  - [Converting data types](#Converting-data-types)

### Note that this is the practice notebook, so you'll see empty cells and prompts for what you should type. However, `reference_notebook` in this same directory has the code largely already entered, and may be easier to follow along in. Either way is totally acceptable!

# 0. Imports

In [None]:
import pandas as pd

# 1. Overview of Jupyter

In [None]:
%pwd

In [None]:
%quickref

# 2. Overview of Python syntax

## Compare state populations in a reusable way

Simple calculation, but easy to forget what these values mean

In [None]:
# Compare

7535591 / 4190713

In [None]:
# name variables

washington = 7535591
oregon = 4190713
idaho = 1754208

In [None]:
wa_vs_or = 

How do these values differ?

#### Check `type` of `washington`

In [None]:
type()

#### Check `type` of `wa_vs_or`

# 3. Overview of Python data structures

These include lists, dictionaries, tuples and sets. However, we only need to use the first two today.

#### Create a `list` of states in the Pacific Northwest

In [None]:
pnw = []

You can access items in a list through its `index`, which starts at 0.

In [None]:
pnw[]

#### Use a `dict` to allow comparison of multiple coffee types 

In [None]:
population = {'Washington': 7535591, 'Oregon': 4190713, 'Idaho': 1754208}

#### Retrieve the population of Washington

In [None]:
population['']

#### Calculate the difference between the population of Washington and the population of Idaho without typing out their values.

# 4. Introducing pandas

In [None]:
df = pd.read_csv('data/subset

In [None]:
df.head()

In [None]:
df.info()

#### Grab the first row in the dataframe, using the same syntax you did for lists

In [None]:
df.iloc[]

What is a `NaN`?

In [None]:
df.iloc[0]['search_basis']

#### Check the `type` of the value you just returned

Sometimes it's best to leave `NaN`s as they are, but sometimes it's better to replace them with a empty strings or 0s, if the absence of data is equivalent to a zero count! We would do this with .fillna().

<hr>

How messy is our data? One quick way to tell is variation among values of a known quantity.

In [None]:
df['county_name'].nunique()

Checking unique values is also helpful for understanding whether we have a unique identifier! For example: `raw_row_number`.

In [None]:
len(df)

#### Check how many unique values there are in the `raw_row_number` field.

To make absolutely sure they're equal, instead of eyeing the difference, you can use the comparison operator `==`. Other comparison operators include `<`, `>`, `<=` and `>=`.

### Filtering

You can use syntax like this to filter the data on the county you want.

In [None]:
df[df['county_name'] == 'Clark']

What if we want to look at just the data for minors? Adjust the above with the appropriate comparison operator and field:

### Aggregation

In [None]:
# Hint: Start with df.groupby() and use shift + tab to look at its parameters

# df.groupby('Year').count().sort_values(by='subject_race', ascending=False)

grouped = df.groupby('subject_race')

In [None]:
grouped

What is the GroupBy object?

You can read more in its documentation [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes).

But you can also explore!

In [None]:
# Use tab to explore the `grouped` object

# grouped.

What does it look like to find number of stops per `subject_race`?

In [None]:
grouped.count()

This is a start, but how do we sort what we've found?

In [None]:
grouped_by_count = grouped.count()

Find your method:

In [None]:
grouped_by_count.sort_values(by='raw_row_number')

This is useful! But it'd be more useful if it were sorted in descending order.

How do we figure out whether this is an option for the `sort_values` function?

In [None]:
grouped_by_count.sort_values(by='raw_row_number', 

In [None]:
stop_demographics = grouped_by_count['raw_row_number'].reset_index()

In [None]:
stop_demographics

You can create a new field or column on the fly by defining it like you would a variable and assigning a calculation to it.

In [None]:
stop_demographics['percent'] = stop_demographics[
    'raw_row_number'] / stop_demographics['raw_row_number'].sum()

In [None]:
stop_demographics

Why should we wait to report this finding?

# 5. Scraping HTML tables with a single command

Ideally, we'll join this on census data. But if we'd like a quick reference, we can do the following:

In [None]:
page = pd.read_html(
    'https://en.wikipedia.org/wiki/Washington_(state)#Demographics', header=0)

#### How do we check what type `page` is?

The table we want is the 10th on the page.

#### How do we retrieve it?

#### Assign this to variable `demographics`

In [None]:
demographics.head()

These columns headers are a little unwieldy. How do we fix that?

In [None]:
# demographics.rename(columns={})

You can also use a tool called a regular expression if there's a pattern in the text you want to manipulate. To test out regular expression in a helpful learning environment and without worrying about changing your data in unexpected ways, I recommend playing with a sample of your data on [regex101.com](regex101.com).

## 6. Common cleaning/"munging" tasks

### String replacement or substitution

In [None]:
demographics.columns = demographics.columns.str.replace('\[\d+\]', '', regex=True)

In [None]:
demographics

In [None]:
demographics.replace

In [None]:
demographics

### Converting data types

In [None]:
def convert(value):
    try:
        return float(value)/100
    except ValueError:
        return # what do we want to return if this fails?

In [None]:
demographics[['1990', '2000', '2010',
              '2018']] = demographics[['1990', '2000', '2010',
                                       '2018']].applymap(lambda value: convert(value))

In [None]:
demographics

This is only the beginning! It's okay if all of this doesn't make sense now. It's okay to copy and paste a bit of code, change one variable, and run it again to see what happens. Remember that `shift` + `tab` and following a command with `?` will bring up documentation in the notebook, or you can add [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) to your bookmarks.