# Intro to pandas

####  Review and Outline

Great Work! We have made it this far...we know some basic calculations, built-in data types and structures (lists, tuples, strings, dictionaries), we also know some key operations if else conditional operations, for loops, etc. 

Where are we going know...we will get into the key data analysis package in python: **`pandas`**. 

What is `pandas`??? "pan"nel "da"ta "s"tructures. Powerfull, intuitive, data analysis
tool. This package convinced me to learn and start to use python as a research tool
Developed at [AQR](https://www.aqr.com/) (a quantative hedgefund) by [Wes Mckinney](http://wesmckinney.com). They made it open source and quickly expanded developed and became widely used. 

[This notebook largely follows the discussion in the Book.](https://nyudatabootcamp.gitbooks.io/data-bootcamp/content/py-fun2.html)

#### Python

First we need to import the `pandas` package...very simmilar to when we imported
our functions, but this is a MUCH larger. Further more, this is what makes pandas
a higher-level add-on to python. That is at a lower level the objects, methods, functions...
are already created for us, then when we import pandas they are ready to go. 

Then we will learn the key data structures in **Pandas** and their attributes and methods. Moreover, we will learn how to select data in **`DataFrame`** and then do computations afterwards.

**Buzzwords.** DataFrame, Series


---
## Basics

This says import the package `pandas` then the "as pd" says call it `pd` (our alias)
this just simplifies our life without having to always type `pandas`, we just
type `pd`. IF you're lost on this, go back to our chapter on [importing packages](https://nyudatabootcamp.gitbooks.io/data-bootcamp/content/packages.html). 

Let's first get to know the two most important data structures in `Pandas`.

In [None]:
import pandas as pd
import numpy as np

### Series

The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. It can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict. 

In [None]:
# create a series from a dictionary
gdp = {"GDP": [5974.7, 10031.0, 14681.1]} 
# what kind of data structure is this
print(type(gdp))

It should tell us that it is a dictionary, with keys and values (which are lists). How do we get those?

In [None]:
# 'name' parameter specifies the column name of the series object
gdp_s= pd.Series(gdp,name='GDP')
print(type(gdp_s))

gdp_s

In [None]:
# create a series from a numpy 1 dimension array
cpi = np.array([127.5, 169.3, 217.488])
cpi_s= pd.Series(cpi,name='CPI')

# create series from a list
year = [1990, 2000, 2010]
country = ["US", "US", "US"]
year_s = pd.Series(year,name='Year')
country_s = pd.Series(country,name='Country')

In [None]:
print (cpi_s)

#### From Series to DataFrame

A `DataFrame` is essentially just a table of data, or a dictionary of `Series` while a `Series` can be thought of as a one columned `DataFrame`.

Let's create a `DataFrame` from the series previously created by pandas [concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) methods. It concatenates `pandas` objects, e.g., Series, DataFrame along a particular axis.

I know it might seem a little bit overwhelming right now, e.g., new methods and not understanding what does **axis** mean here? But let's see the examples by setting different axis parameters at first. We will discuss this over and over later.

In [None]:
Series_Df = pd.concat([year_s,country_s],axis=1)
print(Series_Df)

In [None]:
Series_Df = pd.concat([year_s,country_s],axis=0)
print(Series_Df)

Besides, we can also convert `Dataframe` to `Series` as well via just selecting one column of a `DataFrame`.

---
## DataFrame

Now let's create a `DataFrame` object from a dictionary.

In [None]:
data = {"GDP": [5974.7, 10031.0, 14681.1],
                   "CPI": [127.5, 169.3, 217.488],
                   "Year": [1990, 2000, 2010],
                   "Country": ["US", "US", "US"]}

Now we are going to convert the type of data to a "DataFrame" this is the key oject within pandas. (If familiar with R this is simmilar to their dataframe)


In [None]:
df = pd.DataFrame(data)

In [None]:
print("\n", type(df))

# Now lets see how cool this is, return to the original data and lets look at it
print(data)

In [None]:
data

In [None]:
df

In python remember the `DataFrame` is an object and with that object comes methods and attributes (we have seen less attributes, but lots of methods)


In [None]:
print(df.shape)
# Note that this is an attribute not a method as the method 
# takes in arguments through () where as this just asks what is the shape of df

In [None]:
print(df.columns) # which returns an object...but we can get it to a list.
print(df.columns.tolist())

In [None]:
print(df.index) # which is like a range type, but within pandas...
print(df.index.tolist())

In [None]:
print(df.dtypes) # this is an attribute on the dataframe, simmilar to type

So this is interesting, for the numerical values it says that they are flaoting
point vlaues, that is great. For the names, strings, it says that they are objects
NOT strings? Pandas does this (i) if all the data in a column is a number, then 
it is recognized as a number (ii) if not, then it is just going to be an object.

---
### Time to practice


**Exercises.** Consider the following data below:

In [None]:
pwt_data = {'countrycode': ['CHN', 'CHN', 'CHN', 'FRA', 'FRA', 'FRA'],
        'pop': [1124.8, 1246.8, 1318.2, 58.2, 60.8, 64.7],
        'rgdpe': [2.611, 4.951, 11.106, 1.294, 1.753, 2.032],
        'year': [1990, 2000, 2010, 1990, 2000, 2010]}

pwt = pd.DataFrame(pwt_data)

a) What are the dimensions of pwt?

b) What dtypes are the variables? What do they mean?

c) What does pwt.columns.tolist() do? How does it compare to list(pwt)?

d) Challenging. What is list(pwt)[0]? Why? What type is it?

e) Challenging. What would you say is the natural index? How would you set it?

---
## Understanding DataFrame

In [None]:
df

This lays out the data in a very intuitive way, columns will be labeled as they are in the excel file. Rows are labeled with unique identifiers as well, called the “index.” We have already learned how to retrieve the index of a `DataFrame` by `df.index`.

Amazing. You may be thinking...
so what, well there is a reason why excel is popular, it is natural for people
do think about data in a table like formate, a dataframe is always going to 
present this in the intuitive, natural way. This is also important because it
helps us visualize and then implement calculations and operations on the table. 
Where as this could be very hard to do in the data variable above.

Therefore, we call the `DataFrame` as symmetric and indexed versions of spreadsheets.


### Play with Columns

#### Grab one column

As we'll, there are three ways commonly used...

In [None]:
df.CPI

In [None]:
df['CPI']

In [None]:
df.iloc[:,0]

Since `DataFrame` is like an excel, think about the first input in the above bracket as index for rows while second for columns. Here, we want to select all the rows in the first column. 

Remember python index starts with **0**.

Regarding `:`, it's similar to we have learned in Python Fundmentals 2 **Slicing** section.


#### Grab several columns

In [None]:
df[["CPI","Country"]]

In [None]:
# We can also do this with iloc.
df.iloc[:,0:2] 

But why only two columns selected? 

**Note:** this is different for what we have learned in numpy `array`.

How about grabbing columns like CPI, GDP, can we do this with **`iloc`** at once? No.

Now we might notice that we can almost use `iloc` and other methods to grab rows interchangeably. However, we might consider by specifying the column name, it is easier for us to **debug** in the future. Think about, once you change the sequence of columns, everything won't work.

#### Reset the column name


In [None]:
df.columns = ["cpi", "country", "gdp", "year"]
# What if the elelments here were less than the number of columns?

df.columns = [var.upper() for var in df.columns]
# Here we can use list comprehension to change the names in columns in the way
# we want...

df

Another way to rename specific instances... not that if we did not have the df
in front, nothing would fundementally change, it would just copy and print out
the new one, but the saved df is the same...

In [None]:
df = df.rename(columns = {"GDP":"NGDP"})

df

In [None]:
namelist = ["NGDP","CPI"]

df[namelist]

### Play with Rows

Below is an example of setting the index. This is a feature that I'm slowing starting to embrace. The idea essentially, is that by setting the index, then we can use the `.loc` or location finding command to pull out only specific entries on a row. For example, if we only want year 2000, then we 
- set the index to be the year
- then use `.loc` to pull out that particular year.

How you would you do this for the country. Same idea, not that we set the index to countrycode, then select the county code that we want. 

Two more points about this:
- One is that we can multi-index, that is have layers of indexes...why would we want to do this. This relates to the question of the "natural index"...

I would argue the natural index would be on the level of an observation. What does that mean? Think about the data set above, what is an observation look like and what are the variables associated with it. Here an observation is a country time pair. Note an observation is two dimensions a country at a particular time. Then the variables associated with each observation are population, gdp. Back to the natural index, given this argument above, I would actually say that it is a multi index with countries and years. 
- MTWN: This discussion relates to this concept of ["tidy data" which is discusses nicely here](http://vita.had.co.nz/papers/tidy-data.pdf).


In [None]:
pwt

In [None]:
pwt.set_index(["year"]).loc[2000]

#pwt.set_index(["year"], inplace = True)

#pwt.set_index(["countrycode","year"]).xs(2000, level = "year")


In [None]:
pwt.set_index?

Why is the index back to the original...well its just like string methods, the original data frame is not fundamentally changed. To change it you need to either (i) assign the modified data frame either to itself or to a new name or (ii) use the inplace = True command where it does not create a new object, but directly creates the new index on the old object. 

We can also achieve the above via just **`loc`**.

In [None]:
pwt.loc[pwt['year']==2000]

In [None]:
# This will also work
pwt[pwt['year']==2000]

Can we use `loc` to achieve the similar thing as `iloc`, e.g., selecting a row?
Yes, but we have to use different inputs.

In [None]:
pwt.loc[:,'year']

In [None]:
# For iloc
pwt.iloc[:,3]

### Remove Stuff by Column or Row
How do we remove stuff, well there is the `.drop` method. In addition, we come across the `axis` parameter again. Let's become familar with it.

In [None]:
# Reset the df DataFrame
df=pd.DataFrame(data)
df

Can you guess what will happen, if...

In [None]:
df.drop("CPI", axis = 1) 

In [None]:
df.drop(0, axis = 0)  # the first 0 here means we want drop the first row which is indexed by 0

In [None]:
df

Now, we can conclude: if we want to perform operations columnwise, we should set **axis** = 1 while for row-wise, **axis** = 0. We will see more examples for the `DataFrame` calculations part to help us grasp the idea.

---
### Time to practice


**Exercise.** For the DataFrame df, create a variable diff equal to the difference of ngdp and rgdp. Verify that diff is now in df.

**Exercise.** How would you extract the variables ngdp and year? 


**Exercise** How would you extract all variables for just one year, say 2000?


**Exercise** How would you drop the variable diff? If you print your dataframe again, is it gone? If not, why do you think it is still there?


**Exercise** How would you drop one year from the data set?

Hint: the key thing to recognize is the axis, this is saying drop a column named "CPI"
if you did this with out the axis it would give an error, why the defalut is 
axis = 0 which are rows...and there is no index named "CPI"

---
## Calculations on a Dataframe

Below are a bunch of calculations. This is essentially, the "excel" functionality of the data frame. 

In [None]:
print(type(df["GDP"]))

# then it is easy to do optionation on a series...

print(df["GDP"] + df["GDP"])

print(df["GDP"] / df["CPI"]) # This would be real gdp

print(100*df["GDP"] / df["GDP"][0]) # what is this doing...if you remeber from EGB
# This is a way to index GDP by the first entry...

# Then it is super easy to create a new column based on an operation or existing
# columns, almost excel like...

df['RGDP'] = df['GDP']/df['CPI']

df['GDP_div_1000'] = df['GDP'] / 1000

print("\n",df) # so there is a new column called real gdp now...

# See the digressioini n the book....I don't mind doing this...  
# What do you mean by this? I did not find it in the book Tinghao

In [None]:
df

### Operations across rows/columns

Here again, we need to set the **axis** parameters. Rememer, for across row computations, we need to set it as 0 and wise versa.
Can you think of the execution results?

In [None]:
df.sum(axis=0)

How about this one? Can even it be executed? Remember we have one column with string data structures.

In [None]:
df.sum(axis=1)

Yes, it can. It just ignores the column with string values. Amazing!

Can you try the following?

In [None]:
df.var(axis=0)
df.var(axis=1)

---
### Time to practice


**Exercise.** Can you compute the mean of each column of df? 

**Exercise.** Can you select the year 2010 and compute the row sum of df?

**Exercise (Challenging).** Can you compute the mean of GDP where it is larger than 6000 of df?

---
## Simple Statistics

Here are some simple commands that can report basic summary statistics of the data:

In [None]:
test = pd.DataFrame(df.mean(axis=0))

test.loc["CPI"]

In [None]:
sumstate = df.describe() # This one creates a dataframe. Could grab what we want from there

In [None]:
type(sumstate)
print(sumstate)

In [None]:
df

**Exercise.** Compute the summary statistics (for the pwt data frame). Write these summary stats to an excel sheet. Can you do this only for China?

---
## Output/Save Data

We can output data in easy way as well with these commands. Note that it creates the file within your working directory unless you specify otherwise...

In [None]:
pwt.to_csv("pwt.csv")

pwt.to_excel("pwt.xlsx")

---
## Summary

**Congratulations!** First, it's amazing that you have made it this far. Reflect on what you knew before working through this notebook, namely what we did in python fundamental notebooks. Now reflect on what you can do...AMAZING!!! Let us summarize some key things that we covered.

* **Pandas Core Objects**: A `DataFrame` is essentially just a table of data while a `Series` can be thought of as a one columned `DataFrame`.

* **Understanding the `DataFrame`**:
    * Become familiar with basic attributes (`.columns`, `.shape`) and methods (`.sum()`, `.mean()`) in `DataFrame` data structure.
    * Know different methods to grab columns and rows, e.g., their pros and cons, especially for the differences between `iloc` and `loc`. They look familiar but the inputs for the two methods are very different. `loc` gets rows (or columns) with particular labels from the index, while `iloc` gets rows (or columns) at particular positions in the index (so it only takes integers).
    * Learned how to perform basic mathematic/statistical computations in `pandas`.

* **Axis Understanding**: when setting **axis**, always think about the operation first, whether it will be done across column or across row. If the former, setting axis = 1. For this course and the majority of dataframe, the **axis** will always be **0** or **1**.

