<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pandas-DataFrames" data-toc-modified-id="Pandas-DataFrames-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pandas DataFrames</a></span></li><li><span><a href="#Creating-dataframes-with-the-&quot;.DataFrame()&quot;-function" data-toc-modified-id="Creating-dataframes-with-the-&quot;.DataFrame()&quot;-function-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Creating dataframes with the ".DataFrame()" function</a></span></li><li><span><a href="#The-structure-of-a-DataFrame:-index,-columns,-values" data-toc-modified-id="The-structure-of-a-DataFrame:-index,-columns,-values-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The structure of a DataFrame: index, columns, values</a></span></li><li><span><a href="#Manipulating-the-index" data-toc-modified-id="Manipulating-the-index-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Manipulating the index</a></span><ul class="toc-item"><li><span><a href="#.reset_index()" data-toc-modified-id=".reset_index()-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>.reset_index()</a></span></li><li><span><a href="#.set_index()" data-toc-modified-id=".set_index()-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>.set_index()</a></span></li><li><span><a href="#.rename()" data-toc-modified-id=".rename()-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>.rename()</a></span></li></ul></li><li><span><a href="#Manipulating-columns" data-toc-modified-id="Manipulating-columns-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Manipulating columns</a></span><ul class="toc-item"><li><span><a href="#.rename()" data-toc-modified-id=".rename()-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>.rename()</a></span></li><li><span><a href="#Accessing-and-creating-entire-columns" data-toc-modified-id="Accessing-and-creating-entire-columns-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Accessing and creating entire columns</a></span></li></ul></li><li><span><a href="#Pandas-Series" data-toc-modified-id="Pandas-Series-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Pandas Series</a></span></li></ul></div>

# Pandas DataFrames

Almost all our data analysis in this course will be performed using a new data structure called Pandas DataFrames. These data structures are implemented in the Pandas package, which comes with the Anaconda installation. It is customary to import the Pandas package as below. 

In [None]:
import pandas as pd

The Pandas package has a very rich functionality. In this course we will cover only a subset of what one can do with this package. If you are interested in higher level of detail than what is covered in this course, I strongly recommend the official user guide for the package: 

https://pandas.pydata.org/docs/user_guide/index.html

We will be referencing different parts of this user guide at different points in the class. You are not expected to know all the details in the user guide. Only the parts that are covered in class, or in the practice problems.

The pandas package comes with many different functions/attributes. We will introduce some of the more commonly used attributes gradually over the course of the semester. Remember, an object attribute is anything that you can put (using dot notation) after the name of that object. We will use the terms "attribute", "method" and "function" interchangeably.

# Creating dataframes with the ".DataFrame()" function

One of the most important functions in the pandas package is the DataFrame() function, which allows us to create a new dataframe (this is similar to using ``[]`` to create lists, and using ``{}`` to create dictionaries). 

The general syntax for DataFrame() is as follows:


```python
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
```

As an aside, every time we introduce a new Pandas function, you should check its official documentation to see how you're supposed to use it. Usually, this is the first result if you Google "pandas" followed by the name of that function. For example, Google "pandas DataFrame()" and you should get this https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html as one of the first results. 

If you imported pandas as pd (as above), you would replace "pandas" in the syntax above with "pd". 

For example:

In [None]:
df = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])

The cell above creates a new variable ``df`` of type "pandas.core.frame.DataFrame":

In [None]:
print(type(df))

To print ``df``, we can use the ``print()`` function as usual:

In [None]:
print(df)

Though simply using the name of the dataframe also works, and the output looks nicer. We will use this method throughout the course:

In [None]:
df

From the syntax of the DataFrame() function (above), you should be able to tell that either of the three parameters of DataFrame can be omitted. 

For example:

In [None]:
df2 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], index = ['Tesla','Apple'])
df2

In [None]:
df3 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], columns = ['ticker','price'])
df3

In [None]:
df4 = pd.DataFrame( index = ['Tesla','Apple'], columns = ['ticker','price'])
df4

Finally, if you check the documentation of DataFrame(), you will see that there are several different types of objects you can supply to each of its parameters. 

For example, we could have created our initial dataframe ``df`` using a dictionary for the ``data`` parameter (instead of a list):

In [None]:
df5 = pd.DataFrame(data = {'ticker': ['TSLA','AAPL'], 'price': [1000, 2000]}, 
                  index = ['Tesla','Apple'])
df5

Note how the ``columns`` parameter is missing, because column names are specified as the keys in the dictionary supplied to the ``data`` parameter.

We will not go through all the different ways to create a new dataframe. The key point is that **most Pandas functions offer a large degree of flexibility with respect to how they can be used**. This is very helpful for more advanced users, but it can be quite confusing for beginners. You can only get better at this with practice. Look at the documentation, and experiment with different ways to use that particular function. Check the results and see if you can explain what happened. 

**I will not expect you to know every little detail about the documentation of all the functions we introduce throughout the course**. But you should be able to use those functions in the form that I use them in the notes. For example, given the code above, I expect you to know how to create a new dataframe from a list (first approach at the top) and from a dictionary (see code cell right above this one). In the code below, I also show you how to use the "numpy" package to create a dataframe containing some randomly generate data (this comes in handy when you want to quickly generate an example dataframe to test some of your code, when the contents of the data don't really matter):

In [None]:
import numpy as np #usually we have all import statements at the top of the file

In [None]:
rdata = pd.DataFrame(data = np.random.rand(10,3), columns = list('abc'))
rdata

# The structure of a DataFrame: index, columns, values 

The syntax of the DataFrame() function should make it clear what the main components of a dataframe are:
- index: the names of the rows (also referred to as axis 0 of the dataframe)
- columns: the names of the columns (also referred to as axis 1 of the dataframe)
- values: the data contained in the table itself

We can access each of these individual parts of the dataframe using the "index", "columns" and "values" attributes:

In [None]:
df

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

Note that these attributes are not followed by parentheses. They are still attributes, but they do not compute anything, they simply return some property of the dataframe. It is important to pay attention to these details. For example, the ``index`` attribute above is very different from the ``pd.Index()`` attribute (https://pandas.pydata.org/docs/reference/api/pandas.Index.html). 

Finally, note that the data inside a dataframe is represented as a numpy array:

In [None]:
type(df.values)

At this point in time, it is not important to know anything about the ``numpy`` package or the ``ndarray`` type that comes with it (we will introduce numpy later on in the course). I just wanted to point out a very important feature of the Python programming language: many packages are built using other existing packages (e.g. pandas uses the numpy and matplotlib packages to name a few).  

The **.info()** function gives us another way to quickly check the structure of our DataFrame:

In [None]:
df.info()

The **.describe()** function gives us summary statistics for the numerical columns in the dataframe:

In [None]:
df.describe()

The **.dtypes** attribute tells us the data type for each column:

In [None]:
df.dtypes

The **.shape** attribute tells us the shape of the dataframe:

In [None]:
df.shape

# Manipulating the index

In [None]:
df

## .reset_index()

We can switch to a numerical index for our dataframe using the **.reset_index()** function. 

Syntax:

```python
reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
```

For example:

In [None]:
df.reset_index()

Note how this pushed the old index inside the table itself, as a new column.

**Very important**: the code above did not actually change ``df``:

In [None]:
df

To do that, we have to set the 'inplace' parameter to True:

In [None]:
df.reset_index(inplace = True)

In [None]:
df

Or we can simply re-write the ``df`` dataframe:

In [None]:
df = df.reset_index()

In [None]:
df

## .set_index()

We can use the data in one of the columns, as the new index of the table with the **.set_index()** function:

Syntax:

```python
set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```

For example:

In [None]:
df = df.set_index('ticker')

In [None]:
df

Note that, once we make the 'ticker' column into an index, it stops showing up as a column:

In [None]:
df.columns

So if we try to set the index to 'ticker' again, we will get an error, because Pyhton can not find the 'ticker' column:

In [None]:
df.set_index('ticker') #this won't work

## .rename()

We can rename one or more of the index values using the **.rename()** function.

Syntax:
```python
rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
```

We will supply a dictionary to the ``index`` parameter, where the keys are the current names we want to change in the index, and the values are the new names we want to use.

For example

In [None]:
df.rename(index = {'TSLA': 'tsla'})

As with **set_index()**, **rename** does not actually change the ``df`` dataframe unless we set ``inplace = True`` or we redefine the df variable. This behavior should be expected of any pandas function that has an ``inplace`` parameter:

In [None]:
df

# Manipulating columns

Let's rewrite ``df`` to make sure we're on the same page at this point:

In [None]:
ndf = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])
ndf

## .rename()

We can rename one or more of the columns using the **.rename()** function.

This time, we will supply a dictionary to the ``columns`` parameter, where the keys are the current names of the columns we want to change, and the values are the new names we want to use.

For example

In [None]:
ndf = ndf.rename(columns = {'ticker': 't', 'price':'p'})

In [None]:
ndf

## Accessing and creating entire columns

We can access the data in an entire column of a dataframe using square brackets after the dataframe name:

In [None]:
ndf['t']

If we want to access multiple columns at once, we have to put their names in a list:

In [None]:
ndf[['t','p']]

We can create new columns by just supplying the data manually:

In [None]:
ndf['new'] = [1,2]
ndf

But most of the time, new columns will be created by bringing them over from other dataframes (we'll cover this in a future lecture) or by some calculation involving other existing columns from the dataframe.

For example:

In [None]:
ndf['somecalc'] = ndf['p'] + ndf['new']
ndf

# Pandas Series

The pandas package comes with another commonly used data structure: the Series. This is a data structure that contains a single column and an index.

For example:

In [None]:
s = pd.Series(data = [12,34])
s

In [None]:
type(s)

We will almost never use Series in our analysis. We only introduce it here because some pandas functions return a Series as a result. For the most part, we will convert series to dataframes using the **.to_frame()** function: 

In [None]:
s = s.to_frame()
s

In [None]:
type(s)