# Pandas Basics

## Reading files into DataFrames

In Pandas, data is accessed through a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). A DataFrame is a 2D data structure where each column may contain different data types, from numeric series to complex structures. In most cases, you can think on DataFrames as _tables_.

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

The [IO API](https://pandas.pydata.org/pandas-docs/stable/io.html) has different methods to read different formats, most common one is text-delimited files:

In [45]:
import pandas as pd
data = pd.read_csv("goog.csv", index_col=0)
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-07-01,692.200012,700.650024,692.130005,699.210022,1342700,699.210022
2016-06-30,685.469971,692.320007,683.650024,692.099976,1590500,692.099976
2016-06-29,683.0,687.429016,681.409973,684.109985,1928500,684.109985
2016-06-28,678.969971,680.330017,673.0,680.039978,2116600,680.039978
2016-06-27,671.0,672.299988,663.283997,668.26001,2629000,668.26001


[DataFrame API documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

In [4]:
type(data)

pandas.core.frame.DataFrame

In [5]:
type(data.Close)

pandas.core.series.Series

By default, index is not sorted and querying using a range might not return data

In [6]:
data["2015":"2016"].head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [7]:
data["2016":"2015"].head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-31,769.5,769.5,758.340027,758.880005,1489600,758.880005
2015-12-30,776.599976,777.599976,766.900024,771.0,1293300,771.0
2015-12-29,766.690002,779.97998,766.429993,776.599976,1765000,776.599976
2015-12-28,752.919983,762.98999,749.52002,762.51001,1515300,762.51001
2015-12-24,749.549988,751.349976,746.619995,748.400024,527200,748.400024


In [12]:
data["Diff"] = data.Close - data.Open
data[["Open", "Close", "Diff"]].head()

Unnamed: 0_level_0,Open,Close,Diff
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-07-01,692.200012,699.210022,7.01001
2016-06-30,685.469971,692.099976,6.630005
2016-06-29,683.0,684.109985,1.109985
2016-06-28,678.969971,680.039978,1.070007
2016-06-27,671.0,668.26001,-2.73999


In [13]:
del data["Diff"]
data.columns

Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

---

## Fetching data from web

Originally included in Pandas, now moved to its own package named [`pandas-datareader`](https://pandas-datareader.readthedocs.io/en/latest). We can install it using `pip`:

```
$ pip install pandas_datareader

```

In [39]:
import pandas_datareader.data as web
from datetime import datetime, timedelta
msft = web.DataReader("MSFT", "google", datetime(2010, 1, 1), datetime(2016, 12, 31))
msft[:1]

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,30.62,31.1,30.59,30.95,38414185


### Caching
Data can be cached into a sqlite database to avoid repeating the query. You need to install [`request_cache`](https://requests-cache.readthedocs.io/en/latest/) package and add `session` parameter to the `DataReader` method. Try to execute this method twice:

In [37]:
import requests_cache
expire_after = timedelta(days=3)
session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=expire_after)
aapl = web.DataReader("AAPL", "google", datetime(2010, 1, 1), datetime(2016, 12, 31))
aapl[:1]

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04,30.49,30.64,30.34,30.57,123432050
