# Pandas Basics
In python, the `pandas` package is the most popular data-frame (i.e., 2-dimensional) manipulation framework. You can think of it like `dplyr` for python. This notebook will walk you through some of the basics of working with dataframes. As you complete this notebook, I **strongly suggest** that you read [this chapter](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in the _Python Data Science Handbook_, which describes the package in detail (you will need to read it to identify the appropriate syntax).

## Creating DataFrames
Let's create a dataframe representing 1000 houses that has the following columns:
- `house_id`: This should be `"house_N"`, where `N` is the house number (i.e., `house_1`)
- `neighborhood`: There should be 5 neighborhoods, `"a"` through `"e"` (200 of each)
- `price_2010`: The price (in USD) of the house in 2010 (can be _float_)
- `price_2018`: The price (in USD) of the house in 2010 (can be _float_)

This [section](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html#Constructing-DataFrame-objects) may be helpful to review.

In [1]:
# Import the pandas and numpy packages (as pd and np, respectively)

In [2]:
# Create a _list_ of houses ("house_N") 1 through 1000 (hint: use a list comprehension)

In [3]:
# Create a list of neighborhoods `a` thorugh `e` that is 1000 elements long. Hint: *multiply* a list by 200

In [4]:
# Creaet a list of 1000 home prices (for the year 2010) that range uniformly from 50000 to 400000 using np.random

In [5]:
# Create a list of 1000 home prices (for the year 2018) by multiplying each 2010 price by a (different) random number
# The number should be drawn randomly from a normal distribution with mean 1.5 and standard deviation .05

In [6]:
# Create a dataframe `houses` with each list above as a column. Use the pd.DataFrame() function

In [7]:
# Write your `houses` dataframe to a .csv file "houses.csv", excluding row numbers

# Accessing DataFrames
In this section, you'll extract and compute information of interest using dataframe **properties** and **methods**. I suggest you review [this section](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) as you move through the following prompts. 

In [8]:
# What is the _shape_ of the dataframe?

In [9]:
# What is the maximum house price in 2018 (hint: use the `.max()` method)

In [10]:
# What are the summary statistics of the dataframe (hint: use the `.describe()` method)

In [11]:
# What was the median house price in 2018 (hint: use the `.median()` method)

In [12]:
# In which neighborhood was the cheapest house in 2010?
# Hint: you can subset a dataframe using df[df.col == value]

## Aggregating data
Just like `dplyr`, pandas has a `groupby` method in which you can _group_ a dataframe by a column of interest, and then _aggregate_ (`agg`) your columns using a given function (i.e., `mean`, `sum`, `median`, etc.). Note, this will create a row `index` (i.e., row name) for your dataframe. See [this section](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html) for details.

In [13]:
# What was the median home price _in each neighborhood_ in each year? 
# Create a new variable of these values.

In [14]:
# Get the prices for neighborhood 'b'. 
# Note, you can now use the df.loc[rows, columns] to select by row name