# Merging ordered or time series data

## 1.1 merge_ordered( )
### 1.1.1 .merge( ) method                                                    

columns to join on -  on, left_on, right_on

type of join - how(*left, right, inner, outer*)

default join is inner

overlapping column names *Suffixes*

calling the method - df1.merge(df2)



### 1.1.2 .merge_ordered( ) method 

columns to join on - on, left_on, right_on

type of join - how(*left, right, inner, outer*)

default join is outer

overlapping column names *Suffixes*

calling the function - pd.merge_ordered(df1, df2)



### foward filling technique
Forward filling will interpolate missing data by filling the missing values with the previous value. "ffill" means forward fill.

pd.merge_ordered(df1,df2, on = "common_order_column", suffixes = ("_df1","_df2"), fill_method = "ffill")

In [1]:
# import library
import pandas as pd

Question
Analyze stock returns from the S&P 500. You believe there may be a relationship between the returns of the S&P 500 and the GDP of the US. Merge the different datasets together to compute the correlation.


In [20]:
# load the dataframes
gdp = pd.read_csv("WorldBank_GDP.csv")
sp500 = pd.read_csv("S&P500.csv")
pop = pd.read_csv("WorldBank_POP.csv")

In [23]:
# view the dataframe
gdp.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,GDP
0,China,CHN,GDP (current US$),2010,6087160000000.0
1,Germany,DEU,GDP (current US$),2010,3417090000000.0
2,Japan,JPN,GDP (current US$),2010,5700100000000.0
3,United States,USA,GDP (current US$),2010,14992100000000.0
4,China,CHN,GDP (current US$),2011,7551500000000.0


In [6]:
# view the data frame
sp500.head()

Unnamed: 0,Date,Returns
0,2008,-38.49
1,2009,23.45
2,2010,12.78
3,2011,0.0
4,2012,13.41


In [21]:
# view the dataframe
pop.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,Pop
0,Aruba,ABW,"Population, total",2010,101669.0
1,Afghanistan,AFG,"Population, total",2010,29185507.0
2,Angola,AGO,"Population, total",2010,23356246.0
3,Albania,ALB,"Population, total",2010,2913021.0
4,Andorra,AND,"Population, total",2010,84449.0


In [14]:
# Use merge_ordered() to merge gdp and sp500 on year and date
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on="Year", right_on="Date", how="left")
gdp_sp500

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,GDP,Date,Returns
0,China,CHN,GDP (current US$),2010,6087160000000.0,2010.0,12.78
1,Germany,DEU,GDP (current US$),2010,3417090000000.0,2010.0,12.78
2,Japan,JPN,GDP (current US$),2010,5700100000000.0,2010.0,12.78
3,United States,USA,GDP (current US$),2010,14992100000000.0,2010.0,12.78
4,China,CHN,GDP (current US$),2011,7551500000000.0,2011.0,0.0
5,Germany,DEU,GDP (current US$),2011,3757700000000.0,2011.0,0.0
6,Japan,JPN,GDP (current US$),2011,6157460000000.0,2011.0,0.0
7,United States,USA,GDP (current US$),2011,15542600000000.0,2011.0,0.0
8,China,CHN,GDP (current US$),2012,8532230000000.0,2012.0,13.41
9,Germany,DEU,GDP (current US$),2012,3543980000000.0,2012.0,13.41


In [16]:
# Use merge_ordered() to merge gdp and sp500, interpolate missing value
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on = "Year", right_on = "Date",how = "left", fill_method = "ffill")
gdp_sp500

Unnamed: 0,Country Name,Country Code,Indicator Name,Year,GDP,Date,Returns
0,China,CHN,GDP (current US$),2010,6087160000000.0,2010,12.78
1,Germany,DEU,GDP (current US$),2010,3417090000000.0,2010,12.78
2,Japan,JPN,GDP (current US$),2010,5700100000000.0,2010,12.78
3,United States,USA,GDP (current US$),2010,14992100000000.0,2010,12.78
4,China,CHN,GDP (current US$),2011,7551500000000.0,2011,0.0
5,Germany,DEU,GDP (current US$),2011,3757700000000.0,2011,0.0
6,Japan,JPN,GDP (current US$),2011,6157460000000.0,2011,0.0
7,United States,USA,GDP (current US$),2011,15542600000000.0,2011,0.0
8,China,CHN,GDP (current US$),2012,8532230000000.0,2012,13.41
9,Germany,DEU,GDP (current US$),2012,3543980000000.0,2012,13.41


In [19]:
# Subset the gdp and returns columns
gdp_returns = gdp_sp500[["GDP", "Returns"]]

# Print gdp_returns correlation
gdp_returns.corr()

Unnamed: 0,GDP,Returns
GDP,1.0,0.040669
Returns,0.040669,1.0


When using merge_ordered() to merge on multiple columns, the order is important when you combine it with the forward fill feature. The function sorts the merge on columns in the order provided. 

## 1.2 merge_asof( )
It is similar to an ordered left join. It has similar features as merge_ordered( ). However, unlike an ordered left join, merge_asof( ) will match the nearest value columns rather than equal values. Whatever columns you merge on must be sorted.

pd.merge_asof(df1, df2, on = "ordered_column", suffixes = ("_df1", "_df2"))


### direction argument as forward.
This changes the behaviour of the method to select the first row in the right table whose "on" key column is greater than or equal to the left's key column. The default value of the direction argument is backwards. 

pd.merge_asof(df1, df2, on = "ordered_column", suffixes = ("_df1", "_df2"), direction = "forward")

### direction argument as nearest.
Returns the nearest row in the right table regardless if it is forward or backwards. 

pd.merge_asof(df1, df2, on = "ordered_column", suffixes = ("_df1", "_df2"), direction = "nearest")



## 1.3 selecting with .query( )
Accepts an input string that it will use to select rows to return from the table. Similar to the portion after the WHERE clause in a SQL statement.

df.query('col1 > 90')

df.query('col1 > 90 and col4 < 140')

df.query('col1 > 90 or col4 < 140')


### when checking text, use the double equal sign.
This is to avoid unintentionally ending our string statement since we use single quotes to start the statement. 

df.query('col2 == "disney" or (col2 == "nike" and col3 < 90)')






## 1.4 Reshaping data with melt
Wide data - Every row relates to one subject and each column has different information about an attribute of that subject. It is easier to be read by people. 

Long data - Information about one subject is found over many rows and each row has one attribute about that subject. More accessible for computers to work with. 


### .melt( )\
changes data from wide to long
#### melt arguments 
1. id_vars. These are columns to be used as identifier variables. They are columns in the original dataset that we don't want to change.

df_tall = df_wide.melt(id_vars = ["col1", "col2"])

2. value_vars. Controls which columns are unpivoted. 

df_tall = df_wide.melt(id_vars = ["col1", "col2"], value_vars = ["2017", "2018"])

#### melting with column names
var_name=["year"] the argument allows us to set the name of the year column in the output.

value_name= "dollars" the argument allows us to set the name of the  value column in the output.

df_tall = df_wide.melt(id_vars = ["col1", "col2"], value_vars = ["2017", "2018"], var_name = ["year"], value_name = "dollars")