# Pandas Basics
In python, the `pandas` package is the most popular data-frame (i.e., 2-dimensional) manipulation framework. You can think of it like `dplyr` for python. This notebook will walk you through some of the basics of working with dataframes, as well as plotting with `matplotlib` (a popular graphics package). 

## Creating DataFrames
Let's create a dataframe representing 1000 houses that has the following columns:
- `house_id`: This should be `"house_N"`, where `N` is the house number (i.e., `house_1`)
- `neighborhood`: There should be 5 neighborhoods, `"a"` through `"e"` (200 of each)
- `price_2010`: The price (in USD) of the house in 2010 (can be _float_)
- `price_2018`: The price (in USD) of the house in 2010 (can be _float_)

In [1]:
# Import the pandas, numpy, and matplotlib.pyplot packages (as pd, np, and plt respectively)


In [2]:
# Create a _list_ of houses ("house_N") 1 through 1000 (hint: use a list comprehension)


In [3]:
# Create a list of neighborhoods `a` thorugh `e` that is 1000 elements long. Hint: *multiply* a list...


In [4]:
# Creaet a list of 1000 home prices (for the year 2010) that range uniformly from 50000 to 400000 using np.random


In [5]:
# Create a list of 1000 home prices (for the year 2018) by multiplying the 2010 price by a random number
# The number should be drawn randomly from a normal distribution with mean 1.5 and standard deviation .5


In [6]:
# Create a dataframe `houses` with each list above as a column. Use the pd.DataFrame() function


# Accessing DataFrames
In this section, you'll extract and compute information of interest using dataframe **properties** and **methods**

In [7]:
# What is the _shape_ of the dataframe?


In [8]:
# What is the maximum house price in 2018 (hint: use the `.max()` method)


In [9]:
# What are the summary statistics of the dataframe (hint: use the `.describe()` method)


In [10]:
# What was the median house price in 2018 (hint: use the `.median()` method)


In [11]:
# In which neighborhood was the cheapest house in 2010?
# Hint: you can subset a dataframe using df[df.col == value]


## Aggregating data
Just like `dplyr`, pandas has a `groupby` method in which you can _group_ a dataframe by a column of interest, and then _aggregate_ (`agg`) your columns using a given function (i.e., `mean`, `sum`, `median`, etc.). Note, this will create a row `index` (i.e., row name) for your dataframe

In [12]:
# What was the median home price _in each neighborhood_ in each year? 
# Create a new variable of these values.


In [13]:
# Get the prices for neighborhood 'b'. 
# Note, you can now use the df.loc[rows, columns] to select by row name


## Visualizing data
In this section, we'll use `matplotlib` to create a few plots. You may want to reference the [documentation](https://matplotlib.org/contents.html) or [examples](https://matplotlib.org/examples/)

In [14]:
# Create a histogram of prices in 2010 using `plt.hist()`, then `plt.show()`


In [15]:
# Create a scatterplot of prices in 2010 v.s. prices in 2018 using `plt.scatter()`


In [16]:
# To compare median house prices in 2018 by neighborhood, 
# Make a bar chart of the median house price in each neighborhood 
# (use your aggregated data from above)
# Hint: you'll use the `index` of the dataframe as the `x` of your bar chart


In [17]:
# Finally, let's create *two adjacent bar charts* of price by neighborhood (2010, 2018)
# Use the *same y axis*
# See: https://matplotlib.org/examples/pylab_examples/subplots_demo.html
