# Data Exploration, Cleaning, and Analysis Guide

## Step One: Import necessary modules

NumPy will allow access to arrays, basic mathematics and statistical concepts, and other useful tools. pandas allows you to create DataFrames, and modify/add/update/remove columns and rows within those DataFrames. 

`import numpy as np
import pandas as pd`

## Step Two: Read in a data source

First, determine what kind of file type you have. You can tell by the file extension (for example, "dataset.csv" is a csv file). There are several types of file extensions that are used for data analysis, and different types are appropriate for different kinds of analysis. The two most common types are CSV, or Comma Separated Values, and JSON, or JavaScript Object Notation (e.g. "dataset.json"). The other types are: Excel (.xlsx), XML (.xml), UTF-8 (usually .txt), and key-value dictionaries (these can actually be created within a Jupyter notebook).

The hardest part of reading in a data source is figuring out what the path to the file is from your notebook, and will likely be the most common error you experience at this stage. Make sure that you know where your data source is located, and then where your notebook is located. As always, your best bet is to start from the root path (usually "/Users/..." on a Mac).

Next, name your dataframe loading variable. Many people like to name their dataframes "df", and you will see the variable used frequently while doing research, but you can name the dataframe anything you want. We'll be using "df" to get you used to the "standard" in this notebook.

The data source we will be using is [expenditures from the U.S. House of Representatives](https://projects.propublica.org/represent/expenditures).

`df = pd.read_csv("2019Q3-house-disburse-detail.csv")`

When reading in a data source, you may also run across encoding issues - Sometimes a source will not be formatted in UTF-8, so if you simply adjust the code like so: pd.read_{filetype}("filename.filetype", encoding: "latin1"), it will usually solve the issue.

## Step Three: Data verification

In order to verify the data, you need to see it. So the first step here is to call:
`df.head`

What this will do is deliver you the first five rows of the dataset. After you've seen the dataset for yourself, you can do a number of things:
1. use `df.describe()` to see summary statistics of the data
1. find unique values using `df['COLUMN NAME'].unique()`
1. call `df.columns()` to see the list of columns you're working with

This is not an exhaustive list by any means, but these are great tools with which to start understanding your dataset.

After this, you can begin dropping any rows or columns you don't need.

pandas lets you drop certain null rows using `pd.dropna()`, but be cautious with this--sometimes it makes more sense to keep your null values. An absence of data does not necessarily mean there is an absence of information.

Additionally, you can drop rows according to certain conditions. Here is an example:
`df = df.drop(df[df['SORT SEQUENCE'] == "GRAND TOTAL FOR ORGANIZATION"].index)`

Make sure that you're not reassigning variables you'll need later!

## Step Four: Visualization

There are a few different kinds of modules for visualization purposes, but the most common are matplotlib and seaborn, which is based off of matplotlib. You can import both of them with the following code:

`from matplotlib import pyplot as plt
import seaborn as sns`

After importing the modules, you're pretty much ready to start visualizing!

In this guide, we'll be focusing on matplotlib. Here is an example of one of the types of graphs you can make:

`plt.hist(df['AMOUNT'], color='blue', edgecolor='black', bins=int(7732336/500000))
plt.yscale("log")
plt.xlabel('Amount Paid')
plt.ylabel('# of Record')
plt.show()`

This is a large dataset, so the binning is not quite as standard as it usually would be, but in essence this will give you a histogram. `plt.hist` calls a histogram, while `df['AMOUNT']` calls the column "AMOUNT" from the dataset we loaded. You can also set border color and fill color. Finally, you create bins for the histogram.

Here, I've used a logarithmic scale to show the full extent of the values without making the columns look too short.

You can relabel the X and Y axes--these are sometimes automatically assigned by matplotlib--to more descriptive names.

Finally, call `plt.show()` to call a complete graph.

Congratulations! You've read data, verified it, and created a visualization!