# Pandas Basics
In python, the `pandas` package is the most popular data-frame (i.e., 2-dimensional) manipulation framework. You can think of it like `dplyr` for python. This notebook will walk you through some of the basics of working with dataframes. As you complete this notebook, I **strongly suggest** that you read [this chapter](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in the _Python Data Science Handbook_, which describes the package in detail (you will need to read it to identify the appropriate syntax).

## Creating DataFrames
Let's create a dataframe representing 1000 houses that has the following columns:
- `house_id`: This should be `"house_N"`, where `N` is the house number (i.e., `house_1`)
- `neighborhood`: There should be 5 neighborhoods, `"a"` through `"e"` (200 of each)
- `price_2010`: The price (in USD) of the house in 2010 (can be _float_)
- `price_2018`: The price (in USD) of the house in 2010 (can be _float_)

This [section](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html#Constructing-DataFrame-objects) may be helpful to review.

In [1]:
# Import the pandas and numpy packages (as pd and np, respectively)
import pandas as pd
import numpy as np

In [2]:
# Create a _list_ of houses ("house_N") 1 through 1000 
# (hint: use a list comprehension)
houses = ["house_" + str(i) for i in range(1, 1001)]

# houses

In [3]:
# Create a list of neighborhoods `a` thorugh `e` that is 1000 elements long. Hint: *multiply* a list by 200
neighborhoods = ['a', 'b', 'c', 'd', 'e'] * 200

In [4]:
# Creaet a list of 1000 home prices (for the year 2010) that range uniformly from 50000 to 400000 using np.random
# numpy.random?
# prices_2010 = np.random.randint(50000, 400000, size=1000)
prices_2010 = np.random.uniform(50000, 400000, size=1000)

In [5]:
# Create a list of 1000 home prices (for the year 2018) by multiplying each 2010 price by a (different) random number
# The number should be drawn randomly from a normal distribution with mean 1.5 and standard deviation .05
prices_2018 = np.random.normal(1.5, 0.05, 1000) * prices_2010

In [6]:
# Create a dataframe `houses` with each list above as a column. Use the pd.DataFrame() function
house_data = pd.DataFrame(data={ 
    'houses':houses,
    'neighborhoods':neighborhoods,
    'prices_2010':prices_2010,
    'prices_2018':prices_2018
})
#house_data

In [7]:
# Write your `houses` dataframe to a .csv file "houses.csv", excluding row numbers
house_data.to_csv("houses.csv", index=False)

# Accessing DataFrames
In this section, you'll extract and compute information of interest using dataframe **properties** and **methods**. I suggest you review [this section](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) as you move through the following prompts. 

In [8]:
# What is the _shape_ of the dataframe?
house_data.shape
house_data.prices_2018.max()

653649.2078148773

In [9]:
# What is the maximum house price in 2018 (hint: use the `.max()` method)
house_data.prices_2018.max()

653649.2078148773

In [10]:
# What are the summary statistics of the dataframe (hint: use the `.describe()` method)
house_data.describe()

Unnamed: 0,prices_2010,prices_2018
count,1000.0,1000.0
mean,226554.253867,340163.174466
std,100053.364156,151331.4754
min,50048.435855,70092.675921
25%,138924.10258,208880.95247
50%,222794.73298,334145.627036
75%,311711.29735,468332.456392
max,399521.492084,653649.207815


In [12]:
# What was the median house price in 2018 (hint: use the `.median()` method)
house_data.prices_2018.median()

350198.27783706324

In [13]:
# In which neighborhood was the cheapest house in 2010?
# Hint: you can subset a dataframe using df[df.col == value]
house_data[house_data.prices_2018 == house_data.prices_2018.min()]

Unnamed: 0,houses,neighborhoods,prices_2010,prices_2018
574,house_575,e,50864.474051,74107.501957


## Aggregating data
Just like `dplyr`, pandas has a `groupby` method in which you can _group_ a dataframe by a column of interest, and then _aggregate_ (`agg`) your columns using a given function (i.e., `mean`, `sum`, `median`, etc.). Note, this will create a row `index` (i.e., row name) for your dataframe. See [this section](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html) for details.

In [17]:
# What was the median home price _in each neighborhood_ in each year? 
# Create a new variable of these values.
median_price = house_data.groupby('neighborhoods').median()


Unnamed: 0_level_0,prices_2010,prices_2018
neighborhoods,Unnamed: 1_level_1,Unnamed: 2_level_1
a,242083.768767,361844.90965
b,238649.192402,363859.836915
c,234336.170178,349136.655665
d,229763.716461,340587.120242
e,228081.689051,336215.964652


In [22]:
# Get the prices for neighborhood 'b'. 
# Note, you can now use the df.loc[rows, columns] to select by row name
print(median_price.loc['b',:])

prices_2010    238649.192402
prices_2018    363859.836915
Name: b, dtype: float64
