# Using Financial Data in Python

Data for financial analysis come from two sources:
1. Web Server
2. Computer

To access data from a web server, we connect to an API. Yahoo provides a pretty good api for data. Other examples include Morning Star or Alpha Vantage.

When working with data stored on a computer, there are certain file formats that we must work with. A file format that every analyst should know how to work with is a `*.csv` (comma separated value).

For most of the course, we'll be working with csv's provided by the course.

## Importing and Organizing Data pt 1

In [1]:
import numpy as np
import pandas as pd

ser = pd.Series(np.random.random(5), name = "Column 1")

One of the two main data types to find in pandas is Series. These can be thought of as a single column data, a set of observations related to a single variable

In [2]:
ser

0    0.523613
1    0.704642
2    0.322144
3    0.999826
4    0.156622
Name: Column 1, dtype: float64

In [3]:
ser[2]

0.32214367806906496

The other data type is called data frame. It's like the series data type, but with several columns.

In [5]:
from pandas_datareader import data as wb

PG = wb.DataReader('PG', data_source='yahoo', start='1995-1-1')
PG

  from pandas.util.testing import assert_frame_equal


Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1995-01-03,15.625000,15.437500,15.468750,15.593750,3318400.0,6.320252
1995-01-04,15.656250,15.312500,15.531250,15.468750,2218800.0,6.269589
1995-01-05,15.437500,15.218750,15.375000,15.250000,2319600.0,6.180927
1995-01-06,15.406250,15.156250,15.156250,15.281250,3438000.0,6.193593
1995-01-09,15.406250,15.187500,15.343750,15.218750,1795200.0,6.168259
...,...,...,...,...,...,...
2020-06-18,119.959999,117.370003,117.459999,119.279999,6274400.0,119.279999
2020-06-19,121.820000,118.830002,120.489998,118.919998,17506200.0,118.919998
2020-06-22,119.080002,117.339996,118.779999,117.750000,5695600.0,117.750000
2020-06-23,119.190002,117.650002,118.669998,117.730003,5340400.0,117.730003


Here, we are extracting data from Yahoo Finance about Procter and Gamble starting from Jan 1st 1995.

`DataReader(ticker, data_source, start)`

## Importing and Organizing Data pt2

The data we've extracted is a time series. In every trading day, P&G's price has been recorded, as shown.

The adjusted closing price for the first year seems a small number compared to the closing price. In the most recent data, we can see that the adj close and close price are much closer to each other. 

The difference is due to dividends paid to stock owners and other changes to the stock price such as stock splits, increases of capital, and so on.

There are a few pandas methods we use for analysis.

In [6]:
PG.info()
# tells us about the data frame object

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6415 entries, 1995-01-03 to 2020-06-24
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   High       6415 non-null   float64
 1   Low        6415 non-null   float64
 2   Open       6415 non-null   float64
 3   Close      6415 non-null   float64
 4   Volume     6415 non-null   float64
 5   Adj Close  6415 non-null   float64
dtypes: float64(6)
memory usage: 350.8 KB


In [7]:
PG.head()
# for when we want to see the first five rows of data
# to get more than five, we can enter a number in the parenthesis

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1995-01-03,15.625,15.4375,15.46875,15.59375,3318400.0,6.320252
1995-01-04,15.65625,15.3125,15.53125,15.46875,2218800.0,6.269589
1995-01-05,15.4375,15.21875,15.375,15.25,2319600.0,6.180927
1995-01-06,15.40625,15.15625,15.15625,15.28125,3438000.0,6.193593
1995-01-09,15.40625,15.1875,15.34375,15.21875,1795200.0,6.168259


In [8]:
PG.tail()
# for when we want to see the last five rows of data

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-06-18,119.959999,117.370003,117.459999,119.279999,6274400.0,119.279999
2020-06-19,121.82,118.830002,120.489998,118.919998,17506200.0,118.919998
2020-06-22,119.080002,117.339996,118.779999,117.75,5695600.0,117.75
2020-06-23,119.190002,117.650002,118.669998,117.730003,5340400.0,117.730003
2020-06-24,117.959999,116.279999,117.220001,116.419998,6079123.0,116.419998


What if we want data on multiple companies?

In [9]:
tickers = ['PG', 'MSFT', 'T', 'F', 'GE']
new_data = pd.DataFrame()
for t in tickers:
    new_data[t] = wb.DataReader(t, data_source='yahoo', start='1995-1-1')['Adj Close']
    
new_data.tail()

Unnamed: 0_level_0,PG,MSFT,T,F,GE
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-06-18,119.279999,196.320007,30.35,6.33,7.28
2020-06-19,118.919998,195.149994,30.309999,6.23,7.15
2020-06-22,117.75,200.570007,30.110001,6.28,7.04
2020-06-23,117.730003,201.910004,30.25,6.15,7.0
2020-06-24,116.419998,197.839996,29.42,5.95,6.53
