# Dataset Descriptions
This notebook contains most of the datasets used in Pandas Cookbook along with the names, types, descriptions and some summary statistics of each column. This is not an exhaustive list as several datasets used in the book are quite small and are explained with enough detail in the book itself. The datasets presented here are the prominent ones that appear most frequently throughout the book.

## Datasets in order of appearance
* [Movie](#Movie-Dataset)
* [College](#College-Dataset)
* [Employee](#Employee-Dataset)
* [Flights](#Flights-Dataset)
* [Chinook Database](#Chinook-Database)
* [Crime](#Crime-Dataset)
* [Meetup Groups](#Meetup-Groups-Dataset)
* [Diamonds](#Diamonds-Dataset)

In [None]:
import pandas as pd
pd.options.display.max_columns = 80

# Movie Dataset

### Brief Overview
28 columns from 4,916 movies scraped from the popular website IMDB. Each row contains information on a single movie dating back to 1916 to 2015. Actor and director facebook likes should be constant for all instances across all movies. For instance, Johnny Depp should have the same number of facebook likes regardless of which movie he is in. Since each movie was not scraped at the same exact time, there are some inconsistencies in these counts. The dataset **movie_altered.csv** is a much cleaner version of this dataset.

In [None]:
movie = pd.read_csv('data/movie.csv')
movie.head()

In [None]:
movie.shape

In [None]:
pd.read_csv('data/descriptions/movie_decsription.csv', index_col='Column Name')

# College Dataset

### Brief Overview

US department of education data on 7,535 colleges. Only a sample of the total number of columns available were used in this dataset. Visit [the website](https://collegescorecard.ed.gov/data/) for more info. Data was pulled in January, 2017.

In [None]:
college = pd.read_csv('data/college.csv')
college.head()

In [None]:
college.shape

In [None]:
pd.read_csv('data/descriptions/college_decsription.csv')

# Employee Dataset

### Brief Overview
The city of Houston provides information on all its employees to the public. This is a random sample of 2,000 employees with a few of the more interesting columns. For more on [open Houston data visit their website](http://data.houstontx.gov/). Data was pulled in December, 2016.

In [None]:
employee = pd.read_csv('data/employee.csv')
employee.head()

In [None]:
employee.shape

In [None]:
pd.read_csv('data/descriptions/employee_description.csv')

# Flights Dataset

### Brief Overview
A random sample of three percent of the US domestic flights originating from the ten busiest airports. Data is from the U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics. [See here for more info](https://www.kaggle.com/usdot/flight-delays).

In [None]:
flights = pd.read_csv('data/flights.csv')
flights.head()

In [None]:
flights.shape

In [None]:
pd.read_csv('data/descriptions/flights_description.csv')

### Airline Codes

In [None]:
pd.read_csv('data/descriptions/airlines.csv')

### Airport codes

In [None]:
pd.read_csv('data/descriptions/airports.csv').head()

# Chinook Database

### Brief Overview
This is a sample database of a music store provided by SQLite with 11 tables. The table description image is an excellent way to get familiar with the database. [Visit the sqlite website](http://www.sqlitetutorial.net/sqlite-sample-database/) for more detail.
![data/descriptions/data/descriptions/ch09_05_erd.png](data/descriptions/ch09_05_erd.png)

# Crime Dataset

### Brief Overview
All crime and traffic accidents for the city of Denver from January to September of 2017. This dataset is stored in special binary form called *hdf5*. Pandas uses the PyTables library to help read the data into a DataFrame. [Read the documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) for more info on hdf5 formatted data.

In [None]:
crime = pd.read_hdf('data/crime.h5')
crime.head()

In [None]:
crime.shape

In [None]:
pd.read_csv('data/descriptions/crime_description.csv')

# Meetup Groups Dataset

### Brief Overview
Data was collected through the [meetup.com API](https://www.meetup.com/meetup_api/) on five Houston-area data science meetup groups. Each row represents a member joining a particular group.

In [None]:
meetup = pd.read_csv('data/meetup_groups.csv')
meetup.head()

In [None]:
meetup.shape

In [None]:
pd.read_csv('data/descriptions/meetup_description.csv')

# Diamonds Dataset

### Brief Overview
Quality, size and price of nearly 54,000 diamonds scraped from the [Diamond Search Engine](http://www.diamondse.info/) by Hadley Wickham. [Visit blue nile](https://www.bluenile.com/ca/education/diamonds?track=SideNav) for a beginners guide to diamonds. 

In [None]:
diamonds = pd.read_csv('data/diamonds.csv')
diamonds.head()

In [None]:
diamonds.shape

In [None]:
pd.read_csv('data/descriptions/diamonds_description.csv')